I'm working on a service where users can set up webhooks to notify them about certain events. However, I'm unsure how to handle failures when sending these requests. For instance, if the server expected to receive the requests is down for several hours, what should I do? Would it be better to just log the failure, try again later, use a task queue like Celery or RabbitMQ, or maybe a combination of these options?
2 Answers
When it comes to handling webhook failures in a SaaS, I typically return a unique error code for the user and log that error in a dedicated database. This allows me to keep track of how often each error occurs over time. It's much easier to troubleshoot when I can refer back to the error code, as it gives a clear indication of what went wrong.
A robust approach is to send webhooks using a background job queue like BullMQ for Node.js or Celery for Python. This prevents your main application from being blocked. If a request fails, employ a retry strategy with exponential backoff—like waiting 1 second, then 5 seconds, then 30 seconds—up to a maximum number of attempts or a time limit of around 24 hours. Be sure to log every attempt so users can investigate if needed. If it fails after all retries, you can either mark it as failed and notify the user or allow them to try again manually. This setup creates a reliable and scalable webhook system, even when the receiving server is down for extended periods.
I'm aiming for a clean implementation as well. I used Celery earlier this year and got criticized for overcomplicating things, so I'm cautious now. I'd prefer to just store failures and retry later until I either succeed or give up. By the way, why not opt for RabbitMQ?