Solutions for Azure SQL Database Throttle Log Wait of HADR THROTTLE LOG RATE GOVERNOR
We recently experienced errors with an ETL load into an Azure SQL database and identified the wait as HADR_THROTTLE_LOG_RATE_GOVERNOR, but couldn't find any information from Microsoft about this, or any possible work-arounds. What's the best way to handle an ETL load if experiencing this wait?
Let's look at what might be causing this wait. Microsoft does not provide documentation on it, but in coming across it multiple times there is a strong correlation with the log usage.
We can use the below query to confirm that this is the wait:
SELECT wait_type , last_wait_type , * FROM sys.dm_exec_requests CROSS APPLY sys.dm_exec_sql_text(sql_handle)
A couple of quick points to make is that while Azure-SQL sits on SQL Server, it does not always behave the same way and you will find the nuances as you work with it. When we run across the HADR_THROTTLE_LOG_RATE_GOVERNOR wait, we want to look at our current log usage using sys.dm_db_resource_stats:
SELECT TOP 10 end_time , avg_log_write_percent , * FROM sys.dm_db_resource_stats ORDER BY end_time DESC
According to multiple Azure-SQL databases that I reviewed, the end_time field is reported every minute and Microsoft states that this reports the UTC time of the minute interval and the avg_log_write_percent column is what I've found correlated with this wait (the usage is over 98% when the wait appears). More than likely, the log use is being throttled by our Azure service tier. I have not seen a situation where this wait arises and the log usage is low; we should make sure that querying the above dm_db_resource_stats is near the time we received the wait.
We can also look at the average DTU use from the query that Microsoft provides on the same page - I've added some columns to it to make it more meaningful, such as our DTU limit and average DTUs actually used:
SELECT end_time AS [EndTime] , (SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v)) AS [AvgDTU_Percent] , ((dtu_limit)*((SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v))/100.00)) AS [AvgDTUsUsed] , dtu_limit AS [DTULimit] FROM sys.dm_db_resource_stats
In the odd situation where the log usage is low (which I have not seen), we would want to know what our overall DTU usage is. What's especially helpful about the above query is that if we know the times we see close to the limit (in general), we can develop around these possible bottlenecks. Using the example of an ETL process, we might want to scale up the tier before the load, then scale down after the load. Likewise, if we're running ETLs throughout the day from APIs, we might want to consider delineating some of them from each other.
These are some other situations to consider as well:
- In OLTP use cases we will want to tune queries further, or scale them for the database tier and needs.
- Consider how we batch our ETL\replication\data writing and loading, along with how it's scheduled.
- We can move around Azure-SQL service tiers where we start at S3 and scale up to P2, then scale back down to S3.
- Related to the above item, it may be more efficient to scale out with smaller Azure-SQL databases at lower service tiers that can handle micro-loads over for a macro-load. For example, current costs at the time of this article, P2 is over five multiples more expensive than S3 ($150 a month vs. $930 a month), so scaling out three S3 instances might be more cost efficient if the loads are throughout the day and cannot be scheduled.
- When running into this wait, check the current avg_log_write_percent.
- I would suggest monitoring the DTU usage and waits to determine if increasing scale on a schedule reduces issues.
- Consider that in some cases, tuning queries further, adding indexes, etc. may be helpful.
About the author
View all my tips