Searchkick 2: How to Quickly Reindex Your Elasticsearch Models

25 January 2017

Intro

You’ve got your app hooked up to elasticsearch for some blazing speeds, but all of a sudden, your elasticsearch cluster comes under heavy load. The problem? Each model update is being sent to elasticsearch solo, putting huge load on your cluster. See, the best way to keep elasticsearch in sync is with bulk indexing. Luckily for you, Searchkick now supports this, so you don’t have to implement it manually anymore.

Searchkick 2 introduces a great way to keep your elasticsearch cluster in sync with your models: queuing. In prior versions, this had to be implemented manually; having it built into the gem is a huge time saver. In this article, I’ll run through the setup, and crunch some numbers as well.

Setup

#config/initializers/searchkick.rb
Searchkick.redis = ConnectionPool.new { Redis.new }
class Product < ActiveRecord::Base
  searchkick callbacks: :queue
end
#Procfile
searchkick_worker: bundle exec sidekiq -c5 -qsearchkick

Manual Test

To verify your setup is correct:

  • Start your procfile (or run bundle exec sidekiq -c5 -qsearchkick in a console tab)
  • Add some ids to the queue Redis.new.lpush "searchkick:reindex_queue:products_development", Product.limit(3000).pluck(:id)
  • Start the queueing job Searchkick::ProcessQueueJob.perform_later(class_name: "Product")

If successful, you’ll see some Searchkick::ProcessQueueJobs and Searchkick::ProcessBatchJobs kick off in your sidekiq worker.

Reindex Everything

What follows is a quick benchmark against a local elasticsearch instance. The comparison will be timing batched updates for 20 thousand records.

Rails.logger.level = :info
records = Product.order(:id).limit(20000);
single = Benchmark.measure {Product.where("id <= ?", records.last.id).find_each(&:reindex)}
bulk = Benchmark.measure {Searchkick.callbacks(:bulk) {Product.where("id <= ?", records.last.id).find_each(&:reindex)}}

# Note, the following only measure starting the job and adding the IDs to redis. Job times will be added later.
batch_worker = Benchmark.measure {
  Redis.new.lpush "searchkick:reindex_queue:products_development", records.pluck(:id) #The key name for redis
  Searchkick::ProcessQueueJob.perform_later(class_name: "Product")
}


single        #<Benchmark::Tms:0x007ffbd7184e00 @label="", @real=69.67301190498983, @cstime=0.0, @cutime=0.0, @stime=11.990000000000002, @utime=28.52000000000001, @total=40.51000000000001>
bulk          #<Benchmark::Tms:0x007ffbd8406678 @label="", @real=14.314875675016083, @cstime=0.0, @cutime=0.0, @stime=5.420000000000002, @utime=8.309999999999988, @total=13.72999999999999>
batch_worker  #<Benchmark::Tms:0x007ffbc30d1ff8 @label="", @real=0.11151637800503522, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.10999999999998522, @total=0.10999999999998522> # + (06:29.387 - 06:14.327): Time it took jobs to execute. 06:29.387 is time of completion for last job, 06:14.327 is start time for first job

single        28.520000  11.990000  40.510000 ( 69.673012)
bulk          8.310000   5.420000  13.730000 ( 14.314876)
batch_worker  0.110000   0.000000   0.110000 (  0.111516) # Add 15.06 (time for all jobs to complete) = 15.17 (total time)

So, any either of those methods will be much quicker than than using the classic, inline reindex callbacks.

Other Thoughts

Prior to Searchkick 2, this functionality had to implemented manually, so having it baked into the gem is a huge added bonus, for multiple reasons. First, it’s standardized and open source. Second, moving it into a job allows you to monitor via the sidekiq web interface and will also notify and retry should anything go wrong. There are a couple places to improve the background job method, which I’ve used before to implement very similar functionality in Searchkick 1.5.1. I’ll have some PR’s along to add those improvements, hopefully soon. Regardless, Ankane does a fantastic job with the Searchkick gem; it’s hands down the best way to use elasticsearch with Rails. The reasoning behind that as well as the PRs will be featured in an upcoming post.

Read More...

Performance Testing a Postgres Database vs Elasticsearch 5: Column Statistics

24 January 2017

This is the first post on benchmarking a postgres database vs a (1 node) elasticsearch instance. The subject of this test are numeric column statistics, based on 10 Million products inserted into both the database and elasticsearch index.

Up to date list of articles diving into my ecommerce performance investigations:



Rails.logger.level = :info

Benchmark.ips do |x|
  column = :brand_id
  x.report("Product Brand ID Elasticsearch Stats") {Product.elasticsearch_stats(column)}
  x.report("Product Brand ID PG Stats") {Product.pg_stats(column)}
  x.compare!
end
Warming up --------------------------------------
Product Brand ID Elasticsearch Stats
                        42.000  i/100ms
Product Brand ID PG Stats
                         1.000  i/100ms
Calculating -------------------------------------
Product Brand ID Elasticsearch Stats
                        451.179  (± 8.4%) i/s -      2.268k in   5.066563s
Product Brand ID PG Stats
                          3.249  (± 0.0%) i/s -     17.000  in   5.236520s

Comparison:
Product Brand ID Elasticsearch Stats:      451.2 i/s
Product Brand ID PG Stats:        3.2 i/s - 138.86x  slower

Point, blouses Elasticsearch.

Read More...

Intro to the Ecommerce SaaS Benchmark Application

24 January 2017

In my search for speed and scalability, I’ve had the pleasure to spend a lot of time recently with Elasticsearch. It’s fast, powerful and continually updated to make it better at all it does. Besides Elasticsearch, I have my eyes on other technologies such as RELC (Redis Labs Enterprise Cluster), Citus DB, and many others which are geared towards scalability and ultimate performance. As a consultant, much of what I do revolves around enabling businesses to make money quicker and more efficiently. The core of many businesses these days is ecommerce. As such, I’ve created a stubbed out Ecommer SaaS project which will be specifically used to benchmark various technologies and how they scale on different orders of magnitude.

As time progresses, I’ll collect more data, expand the application’s features to more closely mimic an actual ecommerce app so that we can investigate what effects different technologies, platforms and data sets will have on the app’s performance.

Up to date list of articles diving into my ecommerce performance investigations:



Read More...

The ABC of My Life: Always Be Constructing

10 January 2017

Always Be Closing for Productivity and Profit

For sales, there’s the classic line from Glengarry Glen Ross, “ABC: Always Be Closing”. It’s used as the mantra to drive their actions towards the end goal of more sales. Over the last 18 months (since I decided to become a consultant), I’d been living by my own ABC, though I hadn’t sat down and thought about it much til now. The ABC for my life is this: Always Be Constructing.

Read More...

Think Big: Continue on the Path to Scalability as a Lead Developer

08 January 2017

As the lead developer on a project, you’ve already either created or been given the high level design by your project’s software architect and will now have to implement it. What sort of goals should you keep in mind and shoot for as you lead development of the project in order to maintain the initial momentum towards a scalable product? Thinking big is still part of the game; you must identify specific challenges and potential or actual bottlenecks which could challenge the long term viability of your web application. Whether that’s performing volume testing on specific and vital endpoints of your application or performance testing some common user flows, you have to be cognizant at all times of areas that could be pain points during the growth of your product.

Here are some actions to take during development:

  • Leave a SQL logger running and see if any specific requests generate more queries than you’d expect
  • Go wild: Add a million items to a shopping cart, spam likes and comments
  • Be evil: Try to break things. Create loops in parent/child categories for instance.
  • Add a ton of web processes on a production clone to see how your database handles it (connection pooling/raw resources)
  • Perform simple requests with stupid amounts of test data. Accidentally loading all records from your DB anywhere?
  • Ensure any services such as Redis or Elasticsearch can handle traffic spikes.

There are many more places to take action and monitor; the above should be a starting point to inspire other actions. What do combinations of the above yield, and how does it apply to your application? Thinking on and answering that will provide new ideas which you can combine with the originals until you’ve synthesized a large amount to take care of and think about. Whether you formalize testing of these or not, always remember that they all revolve around two points. Any endpoints, and user or automated actions could be potential weak spots for exacerbating an unidentified hot spot, so keep these following two in mind:

  1. Malicious actions (intentional or no)
  2. Large Amounts of Information (whether data or users)

Be aware and mindful of those two, let them guide you as you review features and perform final testing. A little preemptive action on these will go a long way towards saving you for the day where you get slashdotted or decide to turn your product into a SaaS offering. Covering and catching even the few most likely candidates for slowdowns will save you massive amounts of time later.

Read More...

Previous Page: 2 of 10 Next