How to Optimize BigQuery Queries with Partition Skew

Christian Baghai
4 min readNov 18, 2023

Partition skew is a common problem that affects the performance and cost of BigQuery queries. It occurs when the data is unevenly distributed across partitions, resulting in some partitions having more data than others. This can cause BigQuery to scan more data than necessary, use more resources, and take longer to execute the query.

In this blog post, we will explain what causes partition skew, how to detect it, and how to fix it using the new Query Performance Insight feature that Google launched recently.

What causes partition skew?

BigQuery supports two types of partitioning: time-unit column partitioning and integer range partitioning. Time-unit column partitioning divides the data into partitions based on a date or timestamp column. Integer range partitioning divides the data into partitions based on an integer column.

Partition skew can happen for various reasons, such as:

  • The data is naturally skewed, meaning that some values of the partitioning column are more frequent than others. For example, if you partition a table by country, some countries may have more rows than others.
  • The data is artificially skewed, meaning that some values of the partitioning column are generated or inserted in a way that creates imbalance. For example, if you partition a table by a hash function of a column, some hash values may have more rows than others.
  • The data is temporally skewed, meaning that some values of the partitioning column are more recent than others. For example, if you partition a table by date, the most recent partitions may have more rows than the older ones.

How to detect partition skew?

Partition skew can affect the performance and cost of BigQuery queries in different ways, depending on the type of query and the type of partitioning. Some common symptoms of partition skew are:

  • High query slot time, meaning that the query uses more CPU resources than expected.
  • High query bytes scanned, meaning that the query reads more data than expected.
  • High query execution time, meaning that the query takes longer to complete…

--

--