I've managed to run Llama 3.2 on AWS Lambda with cold starts consistently under 500ms. Initially, I was hesitant to invest in a dedicated SageMaker endpoint or a reserved EC2 instance for an internal chatbot that only gets about 40-60 hits daily, which felt like wasting money. At first, Lambda's cold start times were unbearable—loading a ~4GB model took around 40 seconds, making it useless for chat. However, after experimenting over the weekend, I found a workable solution.
Two key strategies made a significant difference: First, I learned that allocating more RAM directly affects CPU performance in Lambda. My initial setup with 4GB of RAM resulted in slow inference speeds, but once I increased the RAM to unlock 6 vCPUs, I saw speeds jump to 18-22 tokens per second. Second, I found a way to bypass disk access during the cold start phase. Although SnapStart can help, it currently has limitations in Python container images and disk space. By using `memfd_create` to stream model bytes from S3 directly into RAM, I was able to trick `llama.cpp` into thinking it was reading from a file without touching the disk, effectively killing cold starts.
I've written more about the full architecture and code snippets separately. If anyone's interested, I can share the link in the comments. Also, has anyone tried using Durable Functions instead of Step Functions for orchestration? I'm tempted to simplify state management but concerned about potential costs.
5 Answers
Is this setup suitable for a chatbot or image recognition app? And how do the costs compare to Bedrock?
Have you thought about using agentcore runtime? I found it tends to be cheaper when you start needing more than 3GB of memory.
I have noticed that too! I chose to stick with Lambda for this PoC because it keeps the memory needs low—only using 512MB. It's all about minimizing overhead for now, but I'll definitely keep agentcore in mind for more complex applications.
Why not consider using Bedrock instead? It's got a ton more options and could be faster for your model needs if you're looking for something cheaper.
Good question! I did think about Bedrock. For this specific use, I wanted cost predictability and more control over runtime. Even at low volumes, those per-token costs can add up quickly, especially with retries during testing. If we had a customer-facing product or higher traffic, I'd probably lean towards Bedrock.
Just a heads-up, while the CPU scales with RAM, remember that your duration cost also increases. But if your invocation completes faster, it might still level out in terms of cost.
Exactly! The cold start time was a bigger concern than squeezing the cost per millisecond. In the end, our monthly spend stayed reasonable since cold starts were what really mattered at this usage level.
Sounds fascinating! I'd definitely like to see more details on your architecture and the implementation. Can you share the link?
Absolutely! I can drop the link in the comments for anyone interested.

This specific project may not fit your needs since I used Bedrock for running some models. For a city tour chatbot, you'd probably want a larger model for better conversational flow. If you're looking for image recognition, check out Amazon Rekognition or the multi-modal models in Bedrock.