I've been working on testing signup flows that involve OTP and email verification, and I've encountered some flaky issues when running these tests in CI. While they pass locally, they sometimes fail randomly in CI due to factors like emails taking 3-5 seconds to arrive, incorrect OTPs being picked, and multiple retry emails being sent. Instead of using mocks, I decided to run my tests with real emails and track the entire flow by logging when the email is sent, when it arrives, and when the OTP is extracted. This approach has made it much easier to identify what's actually going wrong. I'm curious to hear how others manage email and OTP testing in their setups?
5 Answers
Debugging OTP flows in CI is definitely a test of patience! The local pass rates versus CI failures often stem from that delay you mentioned. When switching from mocks to real emails, you're quite right to deal with the propagation delay from email services. Here are some methods I found effective:
1. **Unique Aliases**: Always use unique test email addresses by adding a timestamp or UUID (like [email protected]). This way, your tests only collect the OTP from the correct execution, avoiding issues with wrong OTPs.
2. **Polling with Exponential Backoff**: Instead of hard sleeping, I check the inbox every 1-2 seconds with a longer timeout (30-45 seconds). Using Playwright's `expect.poll` is great since it waits for the mail server to catch up without freezing the test.
3. **Dedicated Inbox Services**: Tools like Mailosaur or Mailtrap are usually more reliable than standard inboxes; they provide clean JSON responses for messages, simplifying OTP extraction with a regex.
I usually implement this in Cursor to pull the OTP out easily since regex makes it a lot simpler than parsing HTML. By tracking the full flow like you demonstrated, you can catch those pesky timing bugs that often ruin tests.
Right? The polling with exponential backoff really saved me from tons of headaches with timing issues in CI!
I experienced a similar scenario last year with magic links. Real Gmail in CI can become a trap because Google may rate limit or delay you once tests run frequently enough, making failures look like either rate issues or latency. I moved to a dedicated inbox with webhook and polling fallback, which cut my flaky rate to under 1%. Using a correlation ID and filtering by message timestamp really fixed the issue of picking the wrong OTP. Polling with `expect.poll` instead of hard sleep really helped too, mainly because mocking is fine for unit tests, but end-to-end tests need to go through the real path to catch certain regressions that would only show in production.
Setting up GreenMail as a full SMTP/IMAP/POP server is a solid approach. It allows for testing without the complications of real email accounts. You can also configure Maildev to capture outgoing emails, which helps keep everything organized. Having a complete email integration in a development environment mitigates issues like needing throwaway accounts, while giving you a clean view of both your sent and received messages. However, while this works fantastic locally, real-world testing can still pose challenges about delivery delays and potential retries that you won’t see in a controlled environment.
That setup seems great for local testing! I struggled when going from local to CI because real providers introduce a lot of variability like delivery delays and retries. Have you ever attempted to run this setup in CI with actual email providers, or mostly kept it for local dev?
Totally get that! It’s crucial to have reliable setups that mirror real-world usage.
Using real email can be a pain, but it’s essential for spotting timing issues. What shocked me is how different timing can be with the same provider—sometimes instant, other times it takes 3-5 seconds. Having a complete logging flow helps determine if it’s a delay or an OTP parsing issue. Are you running your tests in CI or more in a staged environment?

I totally relate to that! The propagation delay is a killer when you think everything is fine until you hit CI. I was polling earlier but often struggled to identify the reasons for test failures—whether due to delay, email not sent, or parsing issues. Tracking the entire flow really clarified everything for me. I'm curious, do you still rely on polling, or have you considered an event-driven approach?