A Tale of Query Optimization

A story of optimising SQL queries and the lessons I learned that day.

A tale of Query Optimization

Last December, I got an interesting bug report from the VWO support team. It seemed like the loading time of a particular analytics report was ridiculously slow for a big Enterprise customer. Since I am part of the Data platform, a flag was immediately raised and I got involved in debugging the issue.

Background

To provide a bit of context, I’ll explain the basics of VWO. It’s a platform through which people can run targeted campaigns of various types on their websites: perform A/B experiments, track visitors and conversions, do funnel analysis, render heatmaps and play visitor recordings.

The real power, however, lies in the reporting of the platform. All of the above mentioned features are connected together. And for Enterprise customers, ingesting all this vast data would be useless without providing a powerful platform that can deliver those insights.

Using the platform, you can make arbitrarily complex queries on large data sets. Here is a simplistic example:

Show all clicks by visitors on webpage "abc.com"
FROM <date d1> TO <date d2>
for people who were either
using Chrome as a browser OR 
(were browsing from Europe AND were using iPhone devices)

Notice the boolean operators. Those are provided in the query interface for customers, so they can make arbitrarily complex queries to slice and dice the data.

The Slow Query

The customer in question was trying to do something which should be reasonably fast if you think about it intuitively:

Show me all session recordings 
for users who visited any webpage 
containing the url that matches the pattern "/jobs"

This was a site that got huge amount of traffic and we had more than a million unique URLs stored for them alone. And they wanted to search for a fairly common URL pattern that relates to their business model.

Preliminary Investigation

Lets dive into what was happening in the database. The following was the original slow SQL query:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND sessions.referrer_id = recordings_urls.id 
    AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   ) 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0 ;

Here are the timings:

Planning time: 1.480 ms
Execution time: 1431924.650 ms

It ran on around 150k rows. The query planner showed some interesting details, but no immediate bottlenecks jumped out.

Lets examine the query a bit further. As it can be seen, the query does a JOIN on three tables:

  1. sessions: To display the session information: browser, user-agent, country, and so on.
  2. recording_data: Recording urls, pages, duration
  3. urls: Since we’d like to avoid duplication of extremely large url strings across the table, we normalize it in a dedicated table.

Also notice that we already have all of our tables partitioned by account_id. So there is no possible scenario of one large account overwhelming other accounts and causing slowdown.

Hunting for Clues

Upon close inspection, we see that there is something different in this particular query. The following line needs to be examined:

urls && array(
	select id from acc_{account_id}.urls 
	where url  ILIKE  '%enterprise_customer.com/jobs%'
)::text[]

Initially I thought that perhaps the ‘ILIKE’ operation on all those long URLs - we have more than 1.4 million unique URLs captured for this account - might be causing the slowdown.

But that isn’t the case at all!

SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%';
  id
--------
 ...
(198661 rows)

Time: 5231.765 ms

The pattern matching query itself is taking only 5 seconds independently. Matching millions of unique URLs is clearly not a problem.

The next suspect on the list is the multiple JOIN statements. Perhaps the excessive JOINs have contributed to the slowdown? They are usually the most obvious contender when it comes to query slowdown, but I’ve not really believed that to be the typical case.

analytics_db=# SELECT
    count(*)
FROM
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data_0 as recording_data,
    acc_{account_id}.sessions_0 as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND sessions.referrer_id = recordings_urls.id
    AND r_time > to_timestamp(1542585600)
    AND r_time < to_timestamp(1545177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
-------
  8086
(1 row)

Time: 147.851 ms

And it wasn’t the case. The JOINs are quite fast.

Narrowing down the Suspects

I was ready to start tweaking the query to achieve any possible perf gains possible. My team and I brainstormed 2 tactics to try out:

  • Use EXISTS on the URLs subquery: We wanted to double check that the problem was with the URLs subquery. One way to do this is to simply use EXISTS. EXISTS [1] can be a big performance booster since it returns as soon as it finds even a single row that matches the internal condition.
SELECT
	count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data as recording_data,
    acc_{account_id}.sessions as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND  (  1 = 1  )
    AND sessions.referrer_id = recordings_urls.id
    AND  (exists(select id from acc_{account_id}.urls where url  ILIKE '%enterprise_customer.com/jobs%'))
    AND r_time > to_timestamp(1547585600)
    AND r_time < to_timestamp(1549177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
 32519
(1 row)
Time: 1636.637 ms

Yep. The subquery, when wrapped with EXISTS, makes the entire thing super-fast. The interesting question to ask therefore, is that Why were both the overall query with JOINs and the subquery itself, independently fast, but dramatically slow when combined?

  • Move the sub-query into a CTE: If the query is fast on its own, then we can just pre-compute the fast result and then supply that to the main query
WITH matching_urls AS (
    select id::text from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%'
)

SELECT 
    count(*) FROM acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions,
    matching_urls
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id
    AND (urls && array(SELECT id from matching_urls)::text[])
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545107599)
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0;

But this was still extremely slow.

Finding the Culprit

There was this one little thing that I’d been looking at and dismissing all along. Since nothing else was working, I decided to take a look at it anyway. It was the && operator. While EXISTS was the performance booster that made things faster, it seemed like the && was the only remaining common factor in all versions of the slow query.

Looking at the docs [2], we see that && is used when we need to find common elements between two arrays.

The clause in the original query is:

AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   )

Which means that we do pattern matching on our URLs, then take an intersection with all the urls common with the recordings. It is a bit confusing since the “urls” here does not refer to the table containing all URLs, but the “urls” column in the recording_data table.

Becoming suspicious of &&, I tried to confirm it from the query plan generated from EXPLAIN ANALYZE (I’d already saved the plan, but I’m usually more comfortable with experimenting in SQL than trying to understand the opaqueness of query planners).

Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0))
                           Rows Removed by Filter: 52710

There were multiple rows of just && Filters. Which means that not only was this operation expensive, it ran multiple times.

I then verified it by isolating the clause

SELECT 1
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data_30 as recording_data_30, 
    acc_{account_id}.sessions_30 as sessions_30 
WHERE 
	urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]

This query is slow. Since JOINs are fast, and the subquery is fast, the only thing left is the && operator.

Except, this is a crucial operation. We always need to search throughout the primary urls table to do pattern match, and we always need to find the intersection. We can’t search on the recording urls directly because those are just ids referring to urls.

Towards a Solution

&& is slow because both sets are extremely large. The operation is reasonably fast if I replace urls with { "http://google.com/", "http://wingify.com/" }.

I started googling for a way to do set-intersection in Postgres without using the && and it wasn’t going anywhere.

In the end, we decided to simply solve the problem in isolation: Get me all the urls rows for which the url matches the pattern. Given nothing else, I would simply do -

SELECT urls.url
FROM 
	acc_{account_id}.urls as urls,
	(SELECT unnest(recording_data.urls) AS id) AS unrolled_urls
WHERE
	urls.id = unrolled_urls.id AND
	urls.url  ILIKE  '%jobs%'

Instead of the JOIN syntax, I just used a sub-query, and unrolled the recording_data.urls array, so that I can directly apply the condition in the where clause.

The key insight here is that && is used in order to test whether a given recording contains a matching URL. If we squint our eyes a bit, we can look at this operation as traversing items in an array (or rows in a table) and stopping as soon as a condition (match) is met. Does that remind you of something? Yep, EXISTS.

Since recording_data.urls can be referred from outside the subquery’s context, and when that happens, we can back to our old friend EXISTS, and wrap the subquery with it.

Combining all together, we get the final optimized query as the following:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0
    AND EXISTS(
        SELECT urls.url
        FROM 
            acc_{account_id}.urls as urls,
            (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) 
            AS unrolled_urls
        WHERE
            urls.id = unrolled_urls.rec_url_id AND
            urls.url  ILIKE  '%enterprise_customer.com/jobs%'
    );

And the final query execution time: Time: 1898.717 ms Should we start celebrating?!?

Not so fast! We still had to verify correctness. I was extremely suspicious of the EXISTS optimization, since it changes the logic to exit early. We had to be sure that we hadn’t added a subtle bug back into the query.

Simple check was to simple do a count(*) on both the slow and fast queries on many different sets of data. Then, for a small subset of data, I manually verified that all results were correct.

All verifications returned consistently positive results. We had fixed it!

Lessons Learned

There were quite a lot of lessons learned from this episode:

  1. Query plans don’t tell the entire story, but do give hints
  2. The obvious suspects aren’t always the true culprits
  3. Slow queries can be broken down to isolate bottlenecks
  4. Not all optimizations are reductive in nature
  5. Using EXIST where possible can provide drastic perf gains

Conclusion

We went from a query time of ~24 mins to 2 seconds, an extremely dramatic performance improvement! While this blog post is long, all of the experiments we did happened on the same day and I’d say it took around 1.5 to 2 hrs for us to do the optimization and test it.

SQL is quite an amazing language once you let go of the fear and embrace it. Having a good mental model of the way SQL queries are executed, the query plans generated by the DB, the way indexes are used, and just the sheer size of the data gets you quite far when doing query optimizations. Equally important, however, is to keep trying different approaches and slowly break the problem down and find the bottlenecks.

The best part about achieving such results is the tangible improvement seen, when the report that would not even load earlier on was now loading in an extremely snappy fashion.

Special Thanks to my teammates Aditya Mishra, Aditya Gaur and Varun Malhotra for brainstorming and to Dinkar Pundir, for finding a crucial bug in our final query before we got ahead of ourselves!

Tweet this article

References:

[1] https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY-EXISTS

[2] https://www.postgresql.org/docs/9.1/functions-array.html