I'm relatively new to development, and I've just experienced a shocking increase in my server bill due to heavy traffic from scraper bots. While legitimate user traffic seems normal, my logs show that these bots have been hitting my site non-stop. Thankfully, they didn't compromise any data, but the spike in costs is tough to handle. I've set up a web application firewall (WAF) and added the bots' IPs to a blocklist, but I'm wondering if that's enough. How do you all keep your setups secure to avoid surprise billing like this?
4 Answers
Using a WAF with an IP blocklist is a good start, but remember that scraper bots often rotate their IPs, so that blocklist can quickly become ineffective. I suggest implementing rate limiting at the edge level, whether through Cloudflare or AWS managed rules. This way, you can prevent those requests from even hitting your infrastructure. Also, if you have any API endpoints that don't require authentication, be sure to secure those too, as scrapers often target them. It's also a smart move to set up AWS Budget Alerts; these can notify you at 50% or 80% of your expected spending, allowing you to react before the bills get out of hand. If you're using a load balancer or NAT gateway, that's likely where a lot of these costs are coming from, so keep an eye on those as well.
We employ custom regex rulesets within our WAF to filter out unwanted User Agents, which keeps most scrapers at bay. A lot of scrapers use tools like curl, Python scripts or wget, so we include those user agents in our rules. Also, managed firewall rules are quite effective against malicious IPs. Rate limiting has been useful for us too. We fill our quota with a mix of both custom and managed rules that have been doing a great job. For even better monitoring, consider enabling Bot Control to have visibility into non-human traffic hitting your site. Finally, regular auditing of IP addresses you block is essential, as some scrapers might fake their user agents. Use tools like Athena queries to analyze which IPs are consuming your resources. And don't forget to set up billing alerts and Cloudwatch alerts for anything unusual.
Consider using Cloudflare for caching and additional bot protection. Their services can significantly help reduce unwanted bot traffic and also lighten the load on your server, which in turn can save on costs.
It's crucial to provide more context. What type of server are you running? For instance, an AWS EC2 instance will cost the same whether it's serving one user or a million since you're charged by the hour, not per request. Clarifying your exact billing concerns will help get more tailored advice.
Yeah, it's probably the load balancer and/or NAT traffic that's racking up those costs.
Here's my current setup: AWS Amplify for web hosting, AWS CloudFront as a CDN, AWS WAF for firewall and bot protection, and AWS Lightsail for my database.

The bandwidth transfer is what's really driving the cost up for us.