Email testing in CI without burning real inboxes
How to wire programmable email into GitHub Actions, CircleCI, and GitLab — with parallelization, secret scoping, and concurrency that does not blow up at 50 workers.
Most teams do not have a CI email-testing problem so much as a CI email-testing accidental architecture. The first test that needs an OTP gets pointed at a personal Gmail. The second one points at qa@yourcompany.com. By the time the third arrives, someone has stood up Mailpit in a sidecar container, and the fourth team gives up and mocks. None of these are good.
This post walks through the actual shape of email testing in CI in 2026, with concrete configs for GitHub Actions, CircleCI, and GitLab CI, plus the parallelism and secret-management patterns that hold up when your test suite grows from 5 email tests to 500.
The four CI failure modes for email tests
Before the configs, the failure modes — because the configs only matter to the extent that they avoid these.
Failure 1: shared inbox contention
Two tests run in parallel, both wait for “the latest email at qa@yourcompany.com,” and the second one reads the OTP intended for the first. Test A passes. Test B fails on a verification-code-mismatch error that has nothing to do with the change under test.
Failure 2: stale fixtures from yesterday’s run
A regression test reads “the most recent email at the QA address” and gets one from yesterday’s CI run that nobody cleaned up. The test passes against stale data. Real bug ships.
Failure 3: secret leaks via fixture data
Someone hard-codes an API key into a test fixture, the fixture gets logged at debug level, the log gets shipped to Sentry, and the API key now lives in three places it should not. None of those three are revoked when the key is rotated.
Failure 4: provider rate limits caused by parallelism
Your test suite hits a single email provider with 50 parallel workers. The provider’s per-second rate limit is 30. Tests fail intermittently for “no email arrived” reasons, which look like flakes, which get retried, which makes the rate limit worse.
Every CI pattern below is shaped to avoid all four.
The right shape
Three properties make a CI email-testing setup reliable.
- Per-test inbox. Every test gets a fresh address. No sharing, no contention, automatic cleanup via TTL.
- Long-poll waiter. The test blocks on a single HTTP request that resolves the moment the email arrives. No
sleep(), no polling loop, no race with stale fixtures. - Per-pipeline scoped credentials. Each CI pipeline has its own scoped API key. A leak in staging E2E does not blast-radius into production.
What this looks like in practice:
import { CatchOTP } from '@catchotp/sdk';
const otp = new CatchOTP({ apiKey: process.env.CATCHOTP_KEY! });
const inbox = await otp.inboxes.create({ mode: 'ephemeral', ttlMinutes: 10 });
const code = await otp.inboxes.waitForOtp(inbox.id, { timeoutSeconds: 30 });
Three lines. The CI integration is mostly about wiring the secret in and respecting concurrency caps.
GitHub Actions
The canonical config for a Playwright test suite that uses email OTP.
name: e2e
on: [pull_request, push]
jobs:
e2e:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm
- run: corepack enable && pnpm install --frozen-lockfile
- run: pnpm exec playwright install --with-deps chromium
- name: Run tests
env:
CATCHOTP_KEY: ${{ secrets.CATCHOTP_KEY_CI }}
run: pnpm test:e2e --shard=${{ matrix.shard }}/4
Three things to notice.
Shard, do not parallelize within a worker. GitHub’s matrix gives you four parallel runners. Each runner runs Playwright with workers=1 or workers=2 (Playwright defaults to half the CPU count on shared CI runners). Total concurrent inboxes ≈ shards × workers — usually 4 × 2 = 8. Comfortably under the Pro plan’s 50-inbox cap.
Use a scoped key. CATCHOTP_KEY_CI is a separate API key from CATCHOTP_KEY_LOCAL and CATCHOTP_KEY_PROD. Rotate them independently. A leak in CI logs does not require rotating the production key.
Don’t fail-fast. When one shard fails because of a real bug, you want the other shards to keep running so you can see the full picture in one CI cycle. fail-fast: false is the right default here.
The wrong way (don’t do this)
# DO NOT COPY THIS
strategy:
matrix:
shard: [1, 2, ..., 50] # 50 parallel runners
env:
CATCHOTP_KEY: ${{ secrets.CATCHOTP_KEY }} # one key everywhere
TEST_EMAIL: qa@yourcompany.com # shared inbox
This is the four-failure-modes config in three lines.
CircleCI
CircleCI’s parallelism is per-job, not per-matrix. The shape is similar but the syntax is different.
version: 2.1
jobs:
e2e:
docker:
- image: cimg/node:20.18-browsers
parallelism: 4
steps:
- checkout
- run: corepack enable
- run: pnpm install --frozen-lockfile
- run: pnpm exec playwright install --with-deps chromium
- run:
name: e2e
command: |
pnpm test:e2e \
--shard=$((CIRCLE_NODE_INDEX + 1))/$CIRCLE_NODE_TOTAL
environment:
CATCHOTP_KEY: $CATCHOTP_KEY_CI
workflows:
test:
jobs:
- e2e:
context: catchotp-ci
Two CircleCI-specific notes:
Use a context, not a per-job env var. CircleCI Contexts give you per-pipeline scoped secret management with audit. The catchotp-ci context can be locked down to specific branches (e.g., only main and PRs from contributors with write access).
Mind the parallelism semantics. CircleCI’s parallelism: 4 runs four copies of the same job. Inside each copy, CIRCLE_NODE_INDEX is 0..3 and CIRCLE_NODE_TOTAL is 4. Pass them through to your test runner’s sharding flag.
GitLab CI
GitLab’s parallelism is parallel: N. The variable management is different again.
e2e:
image: node:20-bullseye
parallel: 4
before_script:
- corepack enable
- pnpm install --frozen-lockfile
- pnpm exec playwright install --with-deps chromium
script:
- pnpm test:e2e --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL
variables:
CATCHOTP_KEY: $CATCHOTP_KEY_CI
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
Inside the runner, $CI_NODE_INDEX is 1..N (note: 1-indexed, unlike CircleCI). Pass it through accordingly.
GitLab’s masked variables protect against the secret accidentally appearing in logs. Mark CATCHOTP_KEY_CI as masked at the project or group level. This is essentially free defense-in-depth.
Concurrency: the math you need to do once
Inboxes are the resource that gets capped. Plans typically expose:
| Plan | Concurrent inboxes |
|---|---|
| Free | 5 |
| Pro | 50 |
| Team | 500 |
The math: total parallel CI workers × inboxes per worker ≤ your cap.
The most common shape:
- Solo developer / small project on Free: 4 shards × 1 inbox per shard = 4. Comfortably under 5.
- Mid-size project on Pro: 8 shards × 4 workers per shard × 1 inbox = 32. Fits under 50 with headroom.
- Large org on Team: dozens of pipelines, each with its own scoped key, contributing to a 500-inbox shared pool. Per-pipeline alerts on approaching the cap.
If your math says “we’d need 60 concurrent inboxes on Pro,” the answer is either upgrading to Team or serializing the email-using tests with a Playwright @otp tag and --workers=4 on those specifically.
Secrets management
Three rules cover most teams.
Rule 1: per-pipeline scoped keys
Create a separate API key for each pipeline that needs one. Naming convention: <service>-<env>-<purpose>. Examples: signup-staging-e2e, billing-prod-smoke. The key is scoped to that pipeline’s CI variables and nowhere else.
Rule 2: rotate on the same cadence as everything else
Add the catchotp keys to the same rotation list as your AWS access keys, your Stripe keys, and your provider tokens. Quarterly is the most common cadence.
Rule 3: never log the key
This sounds obvious; in practice, every team has at least one place where it gets logged. The two common patterns:
// BAD — logs the entire env including secrets
console.error('test failed', { env: process.env });
// BAD — logs the SDK config which includes the key
console.error('client config', otp.config);
// GOOD — log specific safe values
console.error('test failed', { inboxId: inbox.id, address: inbox.address });
The single most useful prevention: a CI step that scans for likely secret patterns in test output and fails the build if found. gitleaks works, as does the GitHub native secret scanner.
Parallelization patterns
The framework-level parallelism layered with the CI-level parallelism is where teams get confused. The mental model:
CI (sharding) → 1 of 4 shards
Test runner (workers) → 2 of 2 workers in this shard
Test → 1 inbox per test
Total inboxes in flight = shards × workers. Tests inside a worker run sequentially.
Two anti-patterns to avoid:
- All sharding, no workers. 50 shards, 1 worker each. You pay 50× CI minutes for what could be 8 shards × 6 workers.
- All workers, no sharding. 1 shard, 50 workers. You hit one CI runner’s CPU cap and the runner thrashes.
The sweet spot is usually 4-8 shards with 2-6 workers each.
What about Mailpit, MailHog, Mailcatcher?
These tools are great for testing your application code. They are not great for testing your email integration.
The thing they cannot do: exercise the real DNS path. Your DKIM record, your SPF record, your transactional provider’s per-recipient reputation, your sending IP’s reputation — none of that is exercised when you point at localhost:1025. The bug “email lands in spam in production” stays uncaught.
The right pattern is to use both:
- Unit and component tests: Mailpit. Fast, free, no network.
- Integration and E2E tests: real DNS path via programmable email. Catches the things Mailpit cannot.
Different tools for different jobs.
How catchotp helps
We are the receive side that the configs above point at. Every plan, including Free, gives you per-test inbox isolation, sub-second long-poll waiters, and per-pipeline scoped API keys. The Pro tier covers most teams under the 50-inbox concurrency cap. The Team tier covers most engineering orgs.
Free tier: 5 inboxes, 1,000 messages a month, no credit card. Start free or view pricing for the full tier breakdown.
Related reading
- How to Test OTP Flows in 2026 — the full guide to OTP testing patterns and anti-patterns.
- How to E2E Test Sign-Up Flows With Real Emails — Playwright and Cypress walkthroughs.
- The E2E testing use case covers fixtures, retries, and parallelism in more depth.
The shorter version: per-test inbox, scoped key, four to eight shards, and a long-poll waiter. Do that and email tests stop being the flaky part of your CI.