How to Manage Token Limits with AWS Bedrock and Cohere v4 Multimodal?

0
6
Asked By CloudyExplorer37 On

I'm part of a team building an internal tool that utilizes large language models, specifically the Cohere v4 multimodal for embedding purposes. We're using AWS Bedrock to support our service. However, our documents are lengthy and often contain numerous drawings. Each drawing consumes around 5000 tokens, which means a single document can quickly burn through a lot of tokens. We're frequently hitting the rate limit of 300k tokens per second, which hampers our testing phase. We could request an increase to the limit, but that seems like it would only be a temporary fix, especially since our current user volume is still low. I'm wondering if anyone with experience using this model could provide insights: Is this a fundamental technical limitation at the moment, or are there any strategies for better handling documents that require a lot of tokens? I'd really appreciate any information or reading materials that could help. Thanks!

3 Answers

Answered By CohereCurious14 On

Is it essential for you to use Cohere? Since this is internal, have you thought about running a local LLM? You might find that BGE-M3 on Infinity could give you a better context window.

CloudyExplorer37 -

Going local isn't feasible for us right now. As far as I know, Cohere is the latest multimodal embedding model available on AWS in Europe.

Answered By BuilderAdvice101 On

Check out this resource: https://builder.aws.com/content/34CVjaGLlDJXGBUv15vR3dLnoy2/managing-traffic-spikes-with-amazon-bedrock-why-traditional-retry-patterns-fail-part-1. There’s some solid info on managing traffic spikes and it seems to be under heavy testing with positive results. Part 2 with sample code is coming soon if you can be patient.

Answered By TokenTamer98 On

Have you thought about whether this needs to be real-time? Instead of processing everything instantly, batch the requests and manage the rate on your end to stay under the token limit. It could help ease the pressure.

DocumentHandler22 -

That's definitely something we're considering. Our users expect a service that's nearly real-time, similar to tools like Claude. But I agree, educating them on how it works could help a lot!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.