Is Azure OpenAI’s Rate Limiting Working as Advertised?

0
10
Asked By TechyNinja93 On

I've been exploring Azure OpenAI's rate limiting, and it seems there's a big gap between what's documented and what's actually happening. After running some tests and tracking my usage closely, I noticed that the token-per-minute (TPM) limits don't match up with what Azure claims. Here's what I've got:

### My Setup
- **Token Management**: I have a system set to replenish 15,000 tokens every 250ms, which should mean 3.6M TPM.
- **API Calls**: I reserve 11,000 tokens for each call, but I usually only use about 9,000.
- **Buffer**: I keep a buffer of 1,500 tokens to avoid going over the limit.
- **Processing Method**: I'm processing documents one after the other while controlling token usage.

### What I Expected
According to Azure's docs, I should be able to manage:
- 4M tokens per minute
- About 4 requests per second
- A stable processing rate that fits within their service limits

I'm on the S0 tier; I'm wondering if the actual quota depends on the specific deployment.

### Issues I've Found
I've discovered a few troubling points:
1. The actual TPM seems much lower—maybe less than 20% of what they say it should be.
2. There seem to be extra limitations that aren't linked to how many tokens I'm using.

### Seeking Help
I'm sharing this to:
1. Help others who might be facing the same issues.
2. Ask for clarity from Azure about their rate limiting.
3. Suggest they update their docs to reflect what's really happening.

My setup seems fine, so I think the problem lies with Azure's rate limiting rather than my code. Has anyone else noticed this mismatch? I'd love to hear from other devs or get official insight from Microsoft!

3 Answers

Answered By CloudExplorer88 On

What model are you using for this deployment? It can affect how the rate limits are applied. Just curious if the location of the model impacts your experience too.

Answered By TokenTamer47 On

I’ve seen discrepancies like this before! My guess is that additional factors, like the model location and tier, could be playing a role in your case.

Answered By DebuggerDude32 On

Yeah, I think the rate limits can vary based on regions and specific resources. Have you checked if any settings in your subscription might be affecting this? Sometimes there are underlying constraints that aren’t documented well.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.