- Diagnose first, write code later.
- Everything in moderation, especially indexes.
heroku pg:diagnoseis an excellent tool for projects using Heroku Postgres instances.
EXPLAINon the generated query, which can have some powerful insights on how your queries are running.
The Power of Postgres
Let's be clear from the start here. Postgres can handle a lot of data, and oftentimes the reason for slow queries are the fault of the programmer and not of the database. In tackling some of the issues here at The Gnar Company, we often expected a query to be extremely straightforward and thus placed the blame on Postgres. "Our Postgres instance is too small", "there are too many concurrent queries", and "we should just shard this database" all came up at various points in the process, but in reality, we just weren't using Postgres to its full potential.
A current project calls for a lot of processing power that came in the form of Sidekiq jobs. To date, this project has processed ~34 million jobs in a few weeks (with a peak of 3.4 million jobs a day), and each of those jobs was making several queries to our database, especially on one specific table. This table has ~300 million records, so performance is crucial. As we said before, Postgres is powerful, and managing 300 million rows is completely reasonable when handled properly.
This project uses Rails and Postgres and is hosted on Heroku, so some of the tools we use are specific to these technologies, but the ideas will apply to many other stacks.
Diagnosing the Problem
For any given problem, there may be many solutions. There's rarely a silver bullet, but incremental improvements can make a big difference in the long run.
One tool that was immensely helpful was Heroku's Postgres diagnostics tool.
For projects using a Heroku Postgres instance, running
heroku pg:diagnose on the Heroku CLI will give you an overview of potential problems such as blocking queries, bloat, low cache hit rate, and long-running queries.
For more in-depth analysis, there's also a plugin for the Heroku CLI called
heroku-pg-extras that can give similar diagnostics in more detail.
These serve as a good starting place to dive into more complex Postgres issues.
Too many long-running queries may prevent garbage collection on your tables, or a low cache hit rate may mean you need a bigger Postgres instance. In our case, we were seeing some queries that were taking far too long (as in, 9 hours too long).
Another great option is the
explain method, which can be tacked on to any ActiveRecord call to get a peek at how Postgres decides the best way to run a given query.
This will run the Postgres command
EXPLAIN on your query, giving output like this:
irb(main):008:0> Account.where(user_id: 1).explain D, [2019-05-13T20:54:50.542986 #4] DEBUG -- : Account Load (1.7ms) SELECT "accounts".* FROM "accounts" WHERE "accounts"."user_id" = $1 [["user_id", 1]] => EXPLAIN for: SELECT "accounts".* FROM "accounts" WHERE "accounts"."user_id" = $1 [["user_id", 1]] QUERY PLAN --------------------------------------------------------------------------------------------- Index Scan using index_accounts_on_user_id on accounts (cost=0.06..5.51 rows=25 width=143) Index Cond: (user_id = '1'::bigint) (2 rows)
This output can look scary, but there are articles out there to help, and with enough practice, reading this explanation can save a lot of headaches.
Here, we see our query using an Index Scan, which is exactly what we'd want for a common lookup.
However, some queries may not take the path you expect.
Does your query do a sequential scan when you expected an index scan?
Or maybe you see a Bitmap Heap Scan?
None of these are inherently bad, but
explain may show an unexpected plan, such as ignoring an index where you would expect one to be used.
This is a good place to start our investigation.
Remember that 9-hour query I mentioned before?
Let's start there.
We tried our hands at many different solutions, and many things helped in varying degrees. Let's start with what was most useful here at The Gnar and then cover some other common problems.
Indexes in Moderation
When it comes to indexes in Postgres, oftentimes less is more. Our project features a lot of writing to the database, and at one point we had six separate indexes on our largest table, which meant each write took far longer than we would like. This is a huge roadblock for the most common operation in the app. Using only three was more than enough.
Fewer indexes are of course more helpful for optimizing writes, but what about reading from the database?
This is where the diagnostics tools come into play.
heroku pg:diagnose is helpful, as you may find that some indexes are never even used at all!
Some indexes are more helpful than others, and the Postgres planner may find a better path that doesn't use your expected index.
This was certainly the case on our project. In our database, records are associated with an account and a date, and there may be several records associated with the same date. We had an index on the account id and on the date, so those two indexes together should be enough, right?
Well, not quite.
Postgres can only use one of those indexes, so which does it use? Well, it could look up every record for an account (~50,000 records) or every record for a given date (~200,000 records), but then it has to go through all of those records individually. Neither of these is ideal. The real solution is to index on both of these columns together, which narrows the results down to a handful of records for any given account and date. Instead of scanning through thousands of records, Postgres now only has to go through ~20 records, which decreased our query time from a few seconds to a few milliseconds.
When it comes to indexes, it varies from project to project, but a few very powerful indexes can be extremely performant when used intentionally.
As mentioned earlier, this project in particular had hundreds of jobs processing at a time, so we did our best to lump many small jobs into fewer larger jobs to efficiently use our database connections and increase throughput. We did this by selecting and upserting data in bulk.
Before we made our optimizations, background jobs would upsert records one at a time by using
find_or_initialize_by, performing our business logic, and then passing data on to the database.
Again, this is inefficient when doing hundreds of jobs a minute.
Instead, we began pulling data in large chunks (such as pulling all records for a given chunk of time), modifying that data in Ruby, and then upserting it all at once.
Pulling this work into Ruby and doing it manually meant a little more work on our part, but it took a lot of pressure off of Postgres and allowed it to handle queries more effectively.
Bulk upserts are being introduced in Rails 6, but since this project is being wrapped up before the official release, we opted instead to use activerecord-import, which allowed us to bulk upsert records with ease, all while still using model objects and running validations.
Using bulk inserts can combine thousands of queries into a single
INSERT statement, which can save a lot of time in projects that spend a lot of time writing to the database.
Other common solutions
While minimizing our indexes and bulk upserting data where the two most effective solutions for our needs, there are some other common optimizations that can have major impacts on a project's performance.
N+1 Queries and Eager Loading
N+1 queries are a common culprit on many projects.
An n+1 query is a common problem where every record from a single query generates many more queries for each record retrieved.
This often happens when retrieving instances in
Book.all.each do |book| puts book.author.first_name end
In this example, every iteration must make another query to retrieve each book's
This can be solved by intentionally loading each
Author at the same time as each
This is called "eager loading," and can be accomplished with the
Book.all.includes(:author).each do |book| puts book.author.first_name end
Bullet is an excellent gem that can be used to detect n+1 queries.
Plucking Specific Attributes
Another common action in Rails projects is selecting a single attribute from a model.
There are many ways to do this, such as using
map to select the attribute from every item in the returned array.
author = Author.first Book.where(author: author).map(&:title)
The second line generates the following SQL:
SELECT "books".* FROM "books" WHERE "books"."author_id" = 1;
This would find every Book that has the matching Author, pull each entire row, and then select the title attribute. However, there's a more efficient way of doing this.
This produces this SQL instead:
SELECT "books"."title" FROM "books" WHERE "books"."author_id" = 1;
This accomplishes the same thing, so what's the difference?
pluck pulls only the given attribute from the database.
We can see this in the sql that each generates:
"books"."title" instead of
Instead of pulling all of the columns from the
Books table, we're only getting the titles.
For large data sets or tables with a lot of columns, using
pluck (or related methods like
select) can be a big performance boost.
As I mentioned from the beginning, there's no silver bullet for fixing database queries. Increasing performance is an iterative process. The key takeaway here is that writing code should come at the end of the process, not the beginning. Using diagnostics tools should shed some light on the causes of your performance issues, and from there you can explore solutions.
Running diagnostics and learning about your options is a great start, and once the source of the problem is identified, the code will follow.