I'm dealing with a frustrating situation where I have a query that runs for about 30 minutes, retrieving around 50 million rows from an Azure SQL database. It's currently executing on a single thread and has an index seek on a clustered index, filtering for the current year. Unfortunately, while this query runs, the database becomes unusable for others because it's consuming all the available Data IO. I've noticed that Azure SQL seems to throttle me at just over 2000 IOPS on my general-purpose SKU with 8 vcores.
I've been told that upgrading to a business-critical SKU could get me three times the IOPS, but that's not an option for me right now. I'm looking for ways to throttle this query even further to ensure other users can complete their tasks without running into timeout issues. Ideally, I wouldn't mind if this specific query took 100 minutes instead of 30, as long as it allows for smoother concurrent access.
Another hurdle is that this query is generated from an Apache Spark connector, so I can't use query hints, just table and join hints. However, I do have some control in initializing the SQL session with preparations for this query.
2 Answers
You have a few strategies you might want to consider. First, creating a clustered columnstore index can significantly compress the table, making data retrieval faster, though note it may not be ideal if your table frequently gets updates. Secondly, if you can, try page compression on the table; this is a bit friendlier for inserts and updates and could give you about a 30% performance boost, depending on your data types.
Additionally, have you looked into using hints from the Query Store? You can set up hints that can help manage how the query behaves. If needed, test running your process in hyperscale before considering the leap to business-critical SKU, as it's cost-effective and allows for read replicas.
Unfortunately, Azure SQL doesn’t have a resource governor to throttle queries directly like you're hoping, so you may also want to think about batching your query into smaller parts and using a loop to control the flow of data being pulled in.
Another option could be to use some ETL tools for processing in batches. That way, you spread out the load over time rather than hitting the database all at once. Mirroring data from SQL into MS Fabric could also be beneficial, depending on your needs and pricing.
Thanks for these suggestions! For the hints, is there any specific one you recommend that might help slow down the query to preserve IOPS? I thought about MAXDOP, but since it’s running single-threaded already, I'm leaning towards batching with some artificial delays to help manage the load. What do you think?