Feeding SOLR with its own logs

One of the latest projects I worked on was an apartment listing website. The main search engine used to query for properties on different cities was SOLR and the product owner requested to build some type of analytic tool to dissect user searches, get top searched cities, etc. So my first reaction was to somehow read SOLR logs that prints the queries, parse each line in a proper way and then store it on a new SOLR collection. Implementing the previous sentence from the scratch would required at least a few days of work. After some investigation I found out a nice integration software called Solr Log Manager. Basically is a bridge between Logstash and SOLR.

Logstash is a data pipeline that helps you process logs and other event data from a variety of systems, allows you to manage events and logs, so you can use it to collect, parse and store logs for later use. To setup Solr Log Manager reading the Readme.md file is pretty straight forward and simple to follow, besides you will find Manual.md with extra information.

Configuring lw_solr.conf file

After you setup Solr Log Manager you will have to customize lw_solr.conf file to fit your needs. Below will show the one that I used on the project and will describe the important parts. Many of them are intuitive, you can read the official Logstash documentation to get more information.

# Input logs
input {
  file {
    type => "solrlog"
    path => [ "/opt/solr/logs/*" ]
    exclude => ["*.gz","*.zip","*.tgz"]
    sincedb_path => "/dev/null"
    start_position => "beginning"
  }
}
# Add name=value pairs as fields
filter {
  if [type] == "solrlog" {
    grok {
      patterns_dir => "./patterns"
      match => ["message", "INFO %{DATA} %{TIMESTAMP_ISO8601:received_at}; %{DATA}; \[%{DATA:collection}\] webapp=%{DATA:webapp} path=%{DATA:search_handler} params={%{DATA}%{SORT:sort}%{DATA}%{QUERY_TERMS:query_terms}%{DATA}%{FILTER_QUERY_TERMS:filter_query_terms}%{DATA}} hits=%{BASE10NUM:hits} status=%{BASE10NUM:status} QTime=%{BASE10NUM:qtime}"]
    }
    if ("_grokparsefailure" in [tags]) {
      drop{}
    }
    date {
      # Try to pull the time stamp from the 'received_at' field (parsed above with grok)
      match => [ "received_at", "yyyy-MM-dd HH:mm:ss.SSS" ]
    }
    mutate {
      # Remove unwanted characters and normalize data
      gsub => [ 
        "query_terms", "q=", "",
        "query_terms", "\+", " ",
        "sort", "sort=", "",
        "sort", "\+", " ",
        "filter_query_terms", "fq=", "",
        "filter_query_terms", "&", " AND ",
        "filter_query_terms", " AND $", "",
        "filter_query_terms", "\+", " ",
        "received_at", " ", "T",
        "received_at", "(\d$)", "\0Z"
      ]
    }
    urldecode {
      # avoid url encoding
      all_fields => true
    }
    mutate {
      # separate city and state
      add_field => [ "city", "%{query_terms}", "state", "%{query_terms}" ]
    }
    mutate {
      gsub => [
        "city", "PropertyCity:", "",
        "city", "PropertyPostalCode:", "",
        "city", "\*", "",
        "city", " AND .+", "",
        "state", "PropertyStateCode:", "",
        "state", "\*", "",
        "state", ".+ AND ", ""
      ]
    }
    grep {
      drop => false
      match => [ "filter_query_terms", ".+geofilt" ]
      add_field => [ "radius_search", "true" ]
    }
    grep {
      drop => false
      negate => true
      match => [ "filter_query_terms", ".+geofilt" ]
      add_field => [ "radius_search", "false" ]
    }
  }
}
# Output solr
output {
  stdout { debug => true codec => "rubydebug"}
  lucidworks_solr_lsv133 { collection_host => "localhost" collection_port => "8983" collection_name => "analytic" field_prefix => "event_" force_commit => false flush_size => 1000 idle_flush_time => 1 }
}

type => "solrlog" specifies the log type that will be later referenced on the filter section.

path => [ "/opt/solr/logs/*" ] location of solr logs.

patterns_dir => "./patterns" directory where I defined some useful custom patterns to be used on match section.

match => ["message", "INFO %{DATA} %{TIMESTAMP_ISO8601:received_at}; %{DATA}; \[%{DATA:collection}\] webapp=%{DATA:webapp} path=%{DATA:search_handler} params={%{DATA}%{SORT:sort}%{DATA}%{QUERY_TERMS:query_terms}%{DATA}%{FILTER_QUERY_TERMS:filter_query_terms}%{DATA}} hits=%{BASE10NUM:hits} status=%{BASE10NUM:status} QTime=%{BASE10NUM:qtime}"] is a regular expression that matches and encapsulate each log line section in its own field (defined after colon character).

For example, webapp=%{DATA:webapp} specifies that everything after webapp= and before path= characters should be matched against gork DATA pattern and will be stored in received_at field if so.

Here is a list of grok built in patterns.

Below there are some custom patterns that I defined and placed it on a file inside ./patterns directory

QUERY_TERMS q=[^&]+
FILTER_QUERY_TERMS (?:fq=.+&)+
SORT sort=[^&]+

Filters are applied in sequence from top to bottom, mutate.gsub is useful to remove unwanted characters and normalize data after matching. As you can see the already matched fields (that stores each portion of the log line) are referred on gsub.

mutate.add_field is used to add extra fields, in this case city and state, both will be filled with query_terms field data. I performed this trick to separate city and state info and keep query_terms field intact. Then I applied some transformations to get city and state clean data.

lucidworks_solr_lsv133 contains the info to reach solr instance and collection that will be fed.

Here is a solr log line example which will match above configuration:

INFO  - 2015-06-30 14:23:49.683; org.apache.solr.core.SolrCore; [collection1] webapp=/solr path=/select params={group.ngroups=true&sort=PropertyIsDisplayed+desc,+PropertySearchSortOrderKey+asc,PropertySearchListingRank+desc&fl=PropertyKey,+PropertyName,+PropertyAddress1,+PropertyCity,+PropertyStateCode,+PropertyPostalCode,+PropertyExternalUrl,+PropertyMinRent,+PropertyMaxRent,+PropertyMinBed,+PropertyMaxBed,PropertyMinBath,+PropertyMaxBath,+PropertyId,+PropertyStructureTypeId,+PropertyLatitude,+PropertyLongitude,+PropertyRevenueModelId,+PropertyMediaIds,+PropertyPostalCode,+PropertyDirectContactPhone,+PropertyMobileSiteNumber,+PropertySemTrackingNumber,+PropertyWalkScore,+PropertyTransitScore,+PropertyShortDescription,+PropertyFullDescription,+PropertyBrandName,+PropertyFloorplanImageTagIds,+PropertyIsDisplayed&start=0&q=PropertyCity:"san+francisco"+AND+PropertyStateCode:CA*&group.field=PropertyKey&group=true&wt=standard&fq=Bathrooms:[1+TO+*]&fq=Bedrooms:[0+TO+*]&fq=FloorPlanMaxRent:[0+TO+*]&fq=FloorPlanMinRent:[0+TO+*]&fq=PropertyStructureTypeId:(33+OR+34+OR+35+OR+36+OR+37+OR+38+OR+41)+AND+PropertyMediaIds:[''+TO+*]+AND+PropertyRevenueModelId:*+AND+!PropertySearchSortOrderKey:500&rows=20} hits=2049 status=0 QTime=2631