How to Optimize BigQuery Queries with Partition Skew

Christian Baghai
4 min readNov 18, 2023

--

Partition skew is a common problem that affects the performance and cost of BigQuery queries. It occurs when the data is unevenly distributed across partitions, resulting in some partitions having more data than others. This can cause BigQuery to scan more data than necessary, use more resources, and take longer to execute the query.

In this blog post, we will explain what causes partition skew, how to detect it, and how to fix it using the new Query Performance Insight feature that Google launched recently.

What causes partition skew?

BigQuery supports two types of partitioning: time-unit column partitioning and integer range partitioning. Time-unit column partitioning divides the data into partitions based on a date or timestamp column. Integer range partitioning divides the data into partitions based on an integer column.

Partition skew can happen for various reasons, such as:

  • The data is naturally skewed, meaning that some values of the partitioning column are more frequent than others. For example, if you partition a table by country, some countries may have more rows than others.
  • The data is artificially skewed, meaning that some values of the partitioning column are generated or inserted in a way that creates imbalance. For example, if you partition a table by a hash function of a column, some hash values may have more rows than others.
  • The data is temporally skewed, meaning that some values of the partitioning column are more recent than others. For example, if you partition a table by date, the most recent partitions may have more rows than the older ones.

How to detect partition skew?

Partition skew can affect the performance and cost of BigQuery queries in different ways, depending on the type of query and the type of partitioning. Some common symptoms of partition skew are:

  • High query slot time, meaning that the query uses more CPU resources than expected.
  • High query bytes scanned, meaning that the query reads more data than expected.
  • High query execution time, meaning that the query takes longer to complete than expected.
  • High query cost, meaning that the query incurs more charges than expected.

To detect partition skew, you can use the new Query Performance Insight feature that Google launched recently. This feature allows you to analyze the performance and cost of your BigQuery queries, and identify potential issues and opportunities for optimization.

To use Query Performance Insight, you need to enable the BigQuery Reservations API and create a reservation. A reservation is a way to allocate a fixed amount of slots (CPU resources) for your BigQuery queries. You can assign projects, folders, or organizations to a reservation, and monitor their usage and performance.

Once you have a reservation, you can access the Query Performance Insight dashboard from the BigQuery console. The dashboard shows you various metrics and charts that help you understand how your queries are performing and costing. You can filter the dashboard by date range, reservation, project, folder, or organization.

One of the charts that the dashboard shows is the Partition Skew chart. This chart shows you the distribution of bytes scanned across partitions for a given query. You can see the number of partitions, the minimum, maximum, average, and median bytes scanned per partition, and the standard deviation of bytes scanned per partition. You can also see the partition skew ratio, which is the ratio of the maximum bytes scanned per partition to the average bytes scanned per partition. A high partition skew ratio indicates a high degree of partition skew.

How to fix partition skew?

There are different ways to fix partition skew, depending on the type of partitioning and the cause of the skew. Some general strategies are:

  • Choose a partitioning column that has a high cardinality and a uniform distribution. For example, instead of partitioning by country, you can partition by a combination of country and city, or by a hash function of the country column.
  • Choose a partitioning granularity that matches the query frequency and filter criteria. For example, instead of partitioning by day, you can partition by hour, minute, or second, depending on how often and how precisely you query the data.
  • Use clustering to further divide the data within each partition based on one or more columns. Clustering can help reduce the amount of data scanned by a query, and improve the query performance and cost. For example, you can cluster a table by customer ID, product ID, or category.
  • Use partition filters to limit the number of partitions scanned by a query. Partition filters can help prune the partitions that do not match the filter criteria, and reduce the query performance and cost. For example, you can use a WHERE clause to filter by date, timestamp, or integer range.
  • Use partition expiration to automatically delete old partitions that are no longer needed. Partition expiration can help reduce the storage cost and the query performance and cost. For example, you can set a partition expiration time of 30 days, 90 days, or 365 days, depending on your data retention policy.

Conclusion

Partition skew is a common problem that affects the performance and cost of BigQuery queries. It occurs when the data is unevenly distributed across partitions, resulting in some partitions having more data than others. To detect partition skew, you can use the new Query Performance Insight feature that Google launched recently. This feature allows you to analyze the performance and cost of your BigQuery queries, and identify potential issues and opportunities for optimization. To fix partition skew, you can use different strategies, such as choosing a better partitioning column, granularity, or clustering, using partition filters or expiration, or re-partitioning the data. By optimizing your BigQuery queries with partition skew, you can improve the query performance and cost, and get more value from your data.

--

--