I'm building a product using Azure OpenAI for legal and compliance document reviews. Unfortunately, due to regulatory requirements, I have to stick with Azure OpenAI and can't switch to OpenAI directly. As a small startup with funding, I'm willing to explore more serious or costly contract options if it helps.
The challenge is that my workload can be heavy, leading to spikes in token usage. To run effectively in production, I need somewhere between 1.5M to 2M tokens per minute using the o4-mini model.
Currently, with a regular pay-as-you-go subscription, my o4-mini deployments max out at about 200k tokens per minute. I've seen mention in Microsoft documentation about reaching 1M tokens per minute for certain contracts, but I can't seem to get even close to that.
I've attempted several routes like submitting quota increase forms multiple times, logging support tickets—where support tries to close my tickets instead of helping—and speaking to Microsoft reps who have offered apologies but no clear solutions.
I'm reaching out to anyone who has successfully scaled Azure OpenAI to ask:
1. Are you able to achieve 1M+ TPM on any model? If so, how?
2. Did you have to switch to a specific contract type like MCA or Enterprise?
3. Was there a particular team or person at Microsoft that was instrumental in helping you, such as an account manager or a special Azure OpenAI team?
4. Did you need to commit to a specific monthly spending amount or contract duration to unlock higher limits?
5. Are the token per minute figures presented in the docs achievable for smaller companies or are they primarily for larger enterprises?
6 Answers
If your tenant is based in Europe, I might be able to help with your quota issues. Feel free to reach out!
Consider using a global standard deployment option; it can provide up to 30M tokens per minute, which might fit your needs better.
First, check the region where your deployment is. Some areas are more congested than others. You might want a PTU (Private Training Unit) to improve your capacity. There's a chance more capacity will become available after the Black Friday rush ends.
My main hurdle has been with the embedding model—I'm limited to just 350K TPM in East US. Plus, it's frustrating that there's no batch API available for the embedding models, while there's one for inference models!
What region are you in? We actually switched to Sweden specifically for OpenAI and managed to get 10M TPM with models 4.1 and 5.0. It's all about the region!
One way to boost your limits is by using Azure API Management (APIM) to distribute the load across multiple backends in different regions or subscriptions. This helps you horizontally scale by utilizing regions and vertically scale with new subscriptions in the same region. If you’re interested, I can share some Terraform code for that when I have the chance!

I'm really interested in that code as well. It sounds like a solid approach!