The Gnar Company
The Gnar Company

Useful Sidekiq Patterns

by Dan Frenette

TL;DR

Because Sidekiq workers run in the background, it's not hard to get comfortable using them as a "junk drawer" for logic and high-effort work. But by:

it's possible to create very resilient and well-organized worker classes.

Background

Sidekiq is a Ruby gem that aims to make background processing in Ruby applications efficient and straightforward.

Problem

We deployed some code that made heavy use of a particular Sidekiq worker, but neglected to follow the best practices as described below. The result was far from desirable, and immediate changes had to be made.

What follows is a hopefully not-too-contrived example of the challenges we faced.

Here we have a worker that runs in the background and invites active users to our app, so long as they haven't been invited. We can tell if they've been invited or not by looking at the invited_at timestamp column on the User model.

# app/workers/invite_users_worker.rb

class InviteUsersWorker < ApplicationWorker
  def perform
    uninvited_users = User.active.where(invited_at: nil)

    uninvited_users.each do |user|
      InviteUserService.new(user).invite!
    end
  end
end

At first glance this seems pretty reasonable. We're fetching some information from the database using a scope that looks like it was provided by a Rails enum, querying it with a where clause, and then iterating through each of those users. We're then passing each of those users to a service object that will, based on the class name and #invite! method, do whatever our app needs to do when a user is invited. Looks super clean, right?

Yep - don't investigate that, just ship it. It'll be great.

Our tests are green and we've seen it work in the staging environment. What could go wrong at this point?

Well, when your production data is hundreds of times the volume of your test data…lots! Several tens of thousands of workers in the retry queue later, we woefully accept that a mistake was made and shut down the workers, then get to work examining what went wrong.

So wait a second - what exactly does InviteUsersWorker#invite! do?

A quick peek into that file reveals over 100 lines of code across several private methods, all wrapped up in a transaction. Additionally, at different points during the execution of these hundreds of lines, two other service objects of similar size and scale are called, that end up calling other workers, that end up calling other service objects, and it's turtles all the way down for a little while - fun!

Following our developer's intuition, we decide to watch this cycle run in staging to see how long it takes to invite a single user.

The results are in: 5 seconds.

Realizing that our code is just shy of blazing speeds, we send a quick email to our project manager to find out how many users need to be invited for this week's demo (where the code failed).

The results are in: 1000 an hour - unfun!

Okay, so how do we fix this?

Solution

Aside from a handful of smaller optimizations we found while refactoring the code, there were three patterns that we implemented that took our worker, which was practically doomed to fail, into a more resilient piece of code that's gone on to do several times that initial workload.

Nest your workers

In my opinion, the most simple and effective of these three patterns is using nested workers. If your worker contains a lot of logic, especially if your worker contains loops, then this is something you should consider. So our code will now look like this:

# app/workers/update_todos_worker.rb

class InviteUsersWorker < ApplicationWorker
  def perform
    uninvited_users = User.select(:id).active.where(invited_at: nil)

    uninvited_users.each do |user|
      InviteUserWorker.perform_async(user.id)
    end
  end
end
# app/workers/update_todo_worker.rb

class InviteUserWorker < ApplicationWorker
  def perform(id)
    user = User.find(id)

    InviteUserService.new(user).invite!
  end
end

This new approach makes for a much more scalable application - now instead of doing all the work in one gigantic worker, we're spreading the work out into many smaller workers, each responsible for a single task - inviting just one user.

The most significant benefit here is resilience. If something goes wrong while trying to invite a user, now instead of the whole process failing, the only thing that will fail is the worker responsible for the user that couldn't be invited.

What's more is that because Sidekiq displays the arguments passed to every worker in the Sidekiq UI, we'll have access to the user that couldn't be invited to begin debugging the code!

One thing to note however is the addition of select(:id) to our user query. Since the users' ids are now the only thing this initial worker needs to do its job, we can use select avoid wasting resources that would be taken up by loading entire user objects into memory.

Belt and Suspenders Checking

If your workers are updating records based on their state (such as updating all the records created after a specific date or all the records with a particular attribute), consider doing one final check before that data is actually changed. Some devs at The Gnar like to call this a "belt and suspenders" approach to workers because it's redundant. However, the safety (and by extension, peace of mind) that this ensures is worth the extra work.

Let's think back to our example from before.

If it takes one of our "nested" workers a full five seconds to complete, and we're inviting 1,000 users every hour, there's probably going to be a significant "wait time" for some workers to start running.

What if while running our initial job - the one that enqueues all the other "nested" jobs - something with the data changes. For example, let's say a few minutes after we've enqueued all the workers, one of the users with an enqueued worker is updated to no longer be active. In other words, if we ran the initial job an hour from now, this user wouldn't have a worker enqueued for them, but they currently do.

We don't want to be sending an invite email and login credentials to someone who doesn't want (or shouldn't have) them.

If we employ this belt and suspenders style of redundant checking however, this isn't an issue:

# app/workers/update_todo_worker.rb

class InviteUserWorker < ApplicationWorker
  def perform(id)
    user = User.find(id)

    if user.active?
      InviteUserService.new(user).invite!
    end
  end
end

Because our code relies on the records in our system being in a particular state before it executes, we need to ensure we check the state of those records as close to the code execution as possible. The check added here takes place immediately before our code is executed, which is ideal.

At this point you might be thinking, "wait, this isn't finished though, you should also have a check here to see if the user was invited or not using the timestamp" - and you're right! *high five*

But just for the sake of example, we're going to handle that a slightly different way using the last pattern.

Idempotency

The code in your workers must be idempotent. What that means is that if your code is run once, and then is rerun once, ten, or 1,000 more times, nothing will change on those subsequent runs. If one of your worker's responsibilities is to fire off an email to a specific user, you need to make sure it doesn't happen any more than it should.

Throughout this example, we've seen that all the code that changes any data lives inside the InviteUserService. One of the benefits of this is that your workers don't get any more complicated than they need to. This is great, because they're not the best place to store your logic.

However, we've also introduced a problem in our code. Here's an example of what our service object looked like before:

# app/services/invite_user_service.rb

class InviteUserService
  def initialize(user)
    @user = user
  end

  def invite!
    update_user
    send_user_invitation_email
  end

  private

  attr_reader :user

  ...
end

If we passed the same user to the #invite! method five times, that user would get five emails, which is a problem when they'll only need one. To fix this, let's add a check to our class to see if the user's invited_at column has been set to a value.

# app/services/invite_user_service.rb

class InviteUserService
  def initialize(user)
    @user = user
  end

  def invite!(user)
    return if user_has_already_been_invited?

    update_user
    send_user_invitation_email
  end

  private

  attr_reader :user

  def user_has_already_been_invited?
    user.invited_at.present?
  end
  ...
end

You might be asking - "what's the difference between the check we added in our worker and the check used by the service object?" They serve two similar yet distinct purposes.

The purpose of the check used by the nested worker is to ensure that the data is in the same state as when the initial worker enqueued it. This protects against issues where the data might have changed while the jobs were enqueued.

The point of the check used by the service object is to ensure that if something goes wrong in your worker, and the worker retries itself, the service object won't cause any headaches due to being run on the same set of data more than once. It's not uncommon for the first check to also do the work of the second, but even in those cases, there's still a huge benefit to doing this in that your service object becomes far safer to use.

Note: We could also move this check into the worker before the service object is called if we wanted to, but this way we can ensure the service object is never executing code when it shouldn't be regardless of the caller.

It's safer because it now knows when it should and shouldn't update its data. Think of it as a worst-case scenario, last-ditch effort to keep these classes from changing your data when you don't want them to.

Other Useful Tidbits

  • If you can be reasonably confident that retrying a worker in the event of failure won't change the outcome, consider setting the retry count of that worker to a low number like five. This way you're still accounting for edge cases, but without running more jobs than you need to. I think I read something once about trying the same thing more than once and expecting a different result…
  • Organize your workers into subdirectories. This advice could be applied to almost any type of Ruby class in a Rails project (models, service objects, controllers, etc.). Still, if it's starting to feel painful to look for a specific worker, perhaps because many workers look and are named very similarly, consider organizing your workers with subdirectory folders.
  • Depending on your app's infrastructure, you might need to make sure any workers that are automated only run at certain times. Tools like the Periodic Jobs functionality of Sidekiq Enterprise, or open-source projects such as Sidekiq Scheduler make this very straightforward and are a great tool to keep your servers from being burdened down.
  • Lastly, it's never a bad idea to be regularly viewing the Sidekiq UI (or an APM tool such as Kibana) for spikes in retries in your background workers.

Outcome/Takeaways

Rails workers often perform critical tasks, such as updating large batches of data or performing maintenance tasks at automated, regular intervals. Because of this, it's crucial to ensure that workers are not only able to perform well under extensive workloads, but also handle errors and retries effectively under these conditions.

Sources: