Simple Elasticsearch Aggregation vs Postgres Counts Benchmark

01 February 2017

Aggregations for Elastisearch: Quick Like F1

In my last post, I dove into sub aggregations. This time, we’ll look at aggregations, which are the elasticsearch equivalent of SQL’s count group. If you’ve seen ecommerce sites where they say how many products belong to each category or brand, this is most likely how they accomplish that quickly and at scale.

Read More...

How To Use Sub Aggregations With Searchkick To Return Multiple Terms Per Document

29 January 2017

Aggregations for Elastisearch: Lightning Fast

Elasticsearch is a fantastic way to store denormalized data for searching or serving up as an API in order to reduce database load. Those cases are just the surface of what elasticsearch has to offer. The next step in is using aggregations (formerly known as filters). Though a simple terms count aggregation (very similar to count(*) distinct in sql) is a great place to start, I’m going to dive into something more complex and powerful: Sub Aggregations.

Read More...

Event Driven Elasticsearch Bulk Updates

25 January 2017

Always update elasticsearch in bulk

You’ve got or are considering elasticsearch hooked up to bring some real speed back to your application, but have hit a snag. Each time a model is updated, it’s reindexed, but doing these operations one at a time has added a huge load to your application, database, and elasticsearch instance because of highly inefficient processing. Sit back, and read how I implemented a fix for this on the Searchkick gem, that brough efficiency back to the reindexing process.

We had the need to be able to handle large updates to our document set, but also to ensure that any updates which needed to happen were immediately picked up. This second requirement removed a scheduled job from our list of possibilities, and since polling redis for updates seemed like a terrible idea, left me with an event driven design. (Note, since Searchkick 2 has been released, very similar functionality, though not quite as fast as my custom implementation here, is now part of the gem) What follows is a close approximation of what was implemented.

High Level:

  • Insert Records IDs to a Redis Set
  • After inserting ID, queue up a Searchkick::ForemanJob
  • Searchkick::Foreman job takes chunks of ids and sends to a Searchkick::BulkUpdaterJob
  • BulkUpdaterJob sends bulk requests to elasticsearch

Callback

#models/products.rb
after_commit :add_id_to_redis_queue

def add_id_to_redis_queue
  Redis.new.sadd :products, id
end

Foreman Job

#jobs/searchkick/foreman_job.rb
def perform
  redis = Redis.new
  product_ids = redis.smembers :products
  product_ids.each_slice(Product.searchkick_index.options[:batch_size] || 1000) do | product_id_batch |
    Searchkick::BulkUpdaterJob.perform_later(product_id_batch)
    redis.srem :products, product_id_batch
  end
end

Bulk Updater Job

#jobs/searchkick/bulk_updater_job.rb
def perform(product_ids)
  products = Product.where(id: product_ids).search_import
  Product.searchkick_index.import(products) #This is a batch operation
end

Wrap Up

Put all together, this leverages redis sets to enforce uniqueness of added keys and Searchkick’s bulk import to both efficiently load data from your database, and then send it off to elasticsearch. This will provide a good boost in speed to reindexing a single model, but really shines when data from other models is being denormalized on the search document. The reason why is to do so requires loading information from associated models, so doing bulk loads from the database dramatically reduces the resource load on your Postgres instance, while at the same time reducing the load on elasticsearch.

Is your current search solution lacking in speed or causing extreme load on your application’s resources and you’d like an expert to check it out? Reach out and I’d be happy to discuss possible solutions to bring your project back up to speed.

Read More...

Why You Should Benchmark With Production Services: Redis Edition

25 January 2017

Learn How to Benchmark Redis with Ruby

Intro

There comes a time to benchmark your application and its services. However, if it’s not done properly, then your results can very misleading. In this post, I’ll throw some numbers at you to demonstrate the massive differences between benchmarking locally, on production, and locally pointing at a cloud resource. The numbers for each set of tests are different by an order of magnitude, demonstrating the importance of using the proper setup.

Read More...

Searchkick 2: How to Quickly Reindex Your Elasticsearch Models

25 January 2017

Intro

You’ve got your app hooked up to elasticsearch for some blazing speeds, but all of a sudden, your elasticsearch cluster comes under heavy load. The problem? Each model update is being sent to elasticsearch solo, putting huge load on your cluster. See, the best way to keep elasticsearch in sync is with bulk indexing. Luckily for you, Searchkick now supports this, so you don’t have to implement it manually anymore.

Searchkick 2 introduces a great way to keep your elasticsearch cluster in sync with your models: queuing. In prior versions, this had to be implemented manually; having it built into the gem is a huge time saver. In this article, I’ll run through the setup, and crunch some numbers as well.

Setup

#config/initializers/searchkick.rb
Searchkick.redis = ConnectionPool.new { Redis.new }
class Product < ActiveRecord::Base
  searchkick callbacks: :queue
end
#Procfile
searchkick_worker: bundle exec sidekiq -c5 -qsearchkick

Manual Test

To verify your setup is correct:

  • Start your procfile (or run bundle exec sidekiq -c5 -qsearchkick in a console tab)
  • Add some ids to the queue Redis.new.lpush "searchkick:reindex_queue:products_development", Product.limit(3000).pluck(:id)
  • Start the queueing job Searchkick::ProcessQueueJob.perform_later(class_name: "Product")

If successful, you’ll see some Searchkick::ProcessQueueJobs and Searchkick::ProcessBatchJobs kick off in your sidekiq worker.

Reindex Everything

What follows is a quick benchmark against a local elasticsearch instance. The comparison will be timing batched updates for 20 thousand records.

Rails.logger.level = :info
records = Product.order(:id).limit(20000);
single = Benchmark.measure {Product.where("id <= ?", records.last.id).find_each(&:reindex)}
bulk = Benchmark.measure {Searchkick.callbacks(:bulk) {Product.where("id <= ?", records.last.id).find_each(&:reindex)}}

# Note, the following only measure starting the job and adding the IDs to redis. Job times will be added later.
batch_worker = Benchmark.measure {
  Redis.new.lpush "searchkick:reindex_queue:products_development", records.pluck(:id) #The key name for redis
  Searchkick::ProcessQueueJob.perform_later(class_name: "Product")
}


single        #<Benchmark::Tms:0x007ffbd7184e00 @label="", @real=69.67301190498983, @cstime=0.0, @cutime=0.0, @stime=11.990000000000002, @utime=28.52000000000001, @total=40.51000000000001>
bulk          #<Benchmark::Tms:0x007ffbd8406678 @label="", @real=14.314875675016083, @cstime=0.0, @cutime=0.0, @stime=5.420000000000002, @utime=8.309999999999988, @total=13.72999999999999>
batch_worker  #<Benchmark::Tms:0x007ffbc30d1ff8 @label="", @real=0.11151637800503522, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.10999999999998522, @total=0.10999999999998522> # + (06:29.387 - 06:14.327): Time it took jobs to execute. 06:29.387 is time of completion for last job, 06:14.327 is start time for first job

single        28.520000  11.990000  40.510000 ( 69.673012)
bulk          8.310000   5.420000  13.730000 ( 14.314876)
batch_worker  0.110000   0.000000   0.110000 (  0.111516) # Add 15.06 (time for all jobs to complete) = 15.17 (total time)

So, any either of those methods will be much quicker than than using the classic, inline reindex callbacks.

Other Thoughts

Prior to Searchkick 2, this functionality had to implemented manually, so having it baked into the gem is a huge added bonus, for multiple reasons. First, it’s standardized and open source. Second, moving it into a job allows you to monitor via the sidekiq web interface and will also notify and retry should anything go wrong. There are a couple places to improve the background job method, which I’ve used before to implement very similar functionality in Searchkick 1.5.1. I’ll have some PR’s along to add those improvements, hopefully soon. Regardless, Ankane does a fantastic job with the Searchkick gem; it’s hands down the best way to use elasticsearch with Rails. The reasoning behind that as well as the PRs will be featured in an upcoming post.

Read More...

Previous Page: 2 of 11 Next