How To Use Sub Aggregations With Searchkick To Return Multiple Terms Per Document

29 January 2017 on . 6 minutes to read

Aggregations for Elastisearch: Lightning Fast

Elasticsearch is a fantastic way to store denormalized data for searching or serving up as an API in order to reduce database load. Those cases are just the surface of what elasticsearch has to offer. The next step in is using aggregations (formerly known as filters). Though a simple terms count aggregation (very similar to count(*) distinct in sql) is a great place to start, I’m going to dive into something more complex and powerful: Sub Aggregations.

Up to date list of articles diving into my ecommerce performance investigations:



For a Rails application I recently worked with, there was a need to take this one step further, as an exercise to further reduce database load. For our use case, this could be done with sub aggregations, and a single inject statement to pull out the needed data into an array of Hashie::Mash’s.

Though the Hashie::Mash’s may not be required for you, they did allow a seamless drop in due to the attr accessors provided for the hash keys. This prevented having to go through and refactor many logical statement into the views (ie, if .is_a?(Hash) do one thing, else, continue with the legacy behavior). Additionally, in the future it will allow us to transition directly to hashes/json very easily if needed. There should be as little churn on the code going forward this way, which will make everyone’s lives easier.

Aggregations are powerful. Sub aggregations make it even moreso. Combine it with the fact that Elasticsearch now caches all size: 0 aggregations, and it will make future queries even quicker.

Elasticsearch and Postgres Rails Implementation

The needed code to fetch and create the {:name, :id} hash from elasticsearch.

#app/models/product.rb
def self.categories_agg_search
  search "*", body_options: {aggs: categories_agg}, limit: 0
end

def self.categories_agg
  {  
    "categories": {
      "terms": {
        "field": "category_name",
        size: 100000000 # Placeholder to ensure all data was fetched. ES wasn't aggregating across all documents, and this was quicker than going into in depth troubleshooting after a few likely candidate solutions were tested. Don't do this in production.
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "category_id"
          }
        }
      }
    }
  }
end

def self.category_names_from_aggregation(search = categories_agg_search)
  return [] unless search.aggregations.present?
  category_name_agg = search.aggregations.dig("categories", "buckets")
  return [] unless category_name_agg.present?
  category_name_agg.inject([]) {|categories, i| categories << Hashie::Mash.new(name: i['key'], id: i.dig("categories", "buckets")[0]["key"])}
end

And the corresponding code to fetch what was needed from the database.

#app/models/product.rb
def self.categories_from_db
  Category.where(id: Product.pluck('DISTINCT category_id'))
end

Benchmarks: Unleash the Numbers

Before the official benchmark, let’s checkout an initial, manual run to see how these two compare:

Product.category_names_from_aggregation
  Product Search (467.8ms)  curl http://localhost:9200/products_development/_search?pretty -d '{"query":{"match_all":{}},"size":0,"from":0,"timeout":"11s","_source":false,"aggs":{"categories":{"terms":{"field":"category_name","size":100000000},"aggs":{"categories":{"terms":{"field":"category_id"}}}}}}'

Product.categories_from_db
   (352.9ms)  SELECT DISTINCT category_id FROM "products"
  Category Load (0.6ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" IN[...]

Rails.logger.level = :info

Benchmark.ips do |x|
  x.report("Category names, ids from Elasticsearch Aggregation") {Product.category_names_from_aggregation}
  x.report("Category names, ids from PG Distinct Pluck") {Product.categories_from_db}
  x.compare!
end
Warming up --------------------------------------
Category names, ids from Elasticsearch Aggregation
                         8.000  i/100ms
Category names, ids from PG Distinct Pluck
                         1.000  i/100ms
Calculating -------------------------------------
Category names, ids from Elasticsearch Aggregation
                        213.169  (±23.0%) i/s -    896.000  in   5.020203s
Category names, ids from PG Distinct Pluck
                          4.203  (± 0.0%) i/s -     22.000  in   5.235945s

Comparison:
Category names, ids from Elasticsearch Aggregation:      213.2 i/s
Category names, ids from PG Distinct Pluck:        4.2 i/s - 50.72x  slower

Win: Elasticsearch. But wait. Look at those numbers; they don’t bear any resemblance to our initial runs above, which had the following result:

  • Elasticsearch: 468ms
  • Postgrs: 352ms

Sharp eyes if you spotted that. If you wonder why, it has to do with elasticsearch’s aggregations caching which is automatically enabled for any size: 0 queries, and all aggregations. Another manual run reveals this:

Product.category_names_from_aggregation
  Product Search (31.0ms)  curl http://localhost:9200/products_development/_search?pretty -d '{"query":{"match_all":{}},"size":0,"from":0,"timeout":"11s","_source":false,"aggs":{"categories":{"terms":{"field":"category_name","size":100000000},"aggs":{"categories":{"terms":{"field":"category_id"}}}}}}'

Product.categories_from_db
   (242.3ms)  SELECT DISTINCT category_id FROM "products"
  Category Load (0.7ms)  SELECT "categories".* FROM "categories" WHERE "categories"."id" IN

HUGE difference.

Wrap Up: Use Elasticsearch for Big Data Sets

Again, elasticsearch comes out much quicker when using large datasets. If you ever find yourself needing to pull out this sort of aggregated data from larger sets quickly, then elasticsearch has demonstrated it’s a very capable solution. Again, the size of your data set, how often you need to access this data between the backing information changing, the number of nodes in your elasticsearch cluster, and how you’ve architected that cluster will all play a huge part in the specific numbers for your application.

When you reach the point where you’re considering implementing elasticsearch or something similar for your application to do that which your database wasn’t explicitly designed for, be sure to consider all the variations both to your needed data and to the architecture of any solutions you want to implement. Need a hand with that? Contact me; I’ll be glad to lend my expertise to your project and get you up and running as soon as possible while delivering the maximum value.


If you enjoy having free time and the peace of mind that a professional is on your side, then you’d love to have me work on your project.

Contact or view a list of available services to see how I’ll make your life better, easier and bring satisfaction back into you running your business.