Building a programmable email API: lessons from catchotp
How we designed the catchotp inbound pipeline on AWS — SES quirks, why we chose EventBridge, and the trade-offs we'd make differently.
This is a system-design post. If you are evaluating catchotp from the integration side, the OTP testing guide is probably more useful. If you are building something similar — an inbound email service, a webhook fan-out, a low-latency long-poll API — the trade-offs below are the ones that mattered most.
The thesis in one line: the hard part of a programmable email API is not the API; it is the seven things that have to happen in 600ms between the SMTP DATA ack and the long-poll waiter resolving on a developer’s laptop.
The constraints
Three constraints shaped almost every decision.
- Sub-second p95 latency from email send to API client. The OTP-test use case is dead in the water at 5-second latency. Tests that do not run faster than a developer’s
cmd-K, runreflex do not get adopted. - Per-customer cost under 1 cent per OTP at the free tier. Free tier has to be actually free. That meant no per-message Lambda billing on the inbound side, which ruled out the obvious “SES → Lambda → DynamoDB” path.
- No moving infrastructure parts under high traffic. A small team cannot run a 24/7 SMTP cluster. We needed a managed receive side that does not require capacity planning.
The architecture below is what falls out of those three.
The pipeline at a glance
+-------------------+
sender@example.com | Long-poll API |
| | (waiter resolves |
v | on EventBridge) |
AWS SES inbound +---------+---------+
| ^
v |
S3 (raw MIME) |
| |
v |
SES → SNS topic |
| |
v |
Lambda: parse + classify ----- EventBridge bus
| |
v |
DynamoDB (messages, inboxes) |
| |
+--- HTTP API (read side) ------------+
Six components. SES handles the SMTP receive. S3 stores the raw MIME (90-day retention on Pro). A single Lambda parses, OTP-extracts, writes to DynamoDB, and emits to EventBridge. The HTTP API serves reads, and the long-poll waiter blocks on EventBridge.
Why SES inbound (and what it costs you)
SES inbound was the only realistic choice for “managed SMTP receive on AWS.”
The alternatives we considered:
- Self-hosted Postfix. Real cost: an on-call rotation. We are a small team. Hard pass.
- Cloudflare Email Routing. Forwards-only at the time we built this. No native API hook into the message body.
- A third-party SMTP provider. Two services in the critical path; latency budget gone.
SES inbound is fine. Three quirks to know about.
Quirk 1: domain verification is per-region, and you cannot move it
We pinned to us-east-1 because SES inbound’s 14-region limit was the binding constraint when we started. Moving regions later means re-verifying the domain, re-publishing MX records, and an irreversible migration window where mail can be lost. Pick the region carefully on day one.
Quirk 2: inbound is sandboxed differently than outbound
You can be out of the SES sandbox for outbound and still have inbound capacity limits. We hit a soft “50 inbound recipients per second” limit at one point that took two days of support tickets to lift.
Quirk 3: rule sets are global per region, and you can only have one active
There is no concept of “production rule set” and “staging rule set” running in parallel. The active rule set is the active rule set. Staging and prod must run in different AWS accounts if you want them isolated. (We figured this out the hard way.)
Why EventBridge (and not SQS)
The choice that mattered most for latency was the long-poll architecture, and the choice within that was EventBridge over SQS.
The shape we needed:
- A waiter on a single inbox needs to resolve when a specific message arrives.
- A second waiter on the same inbox should resolve independently — they are different test runs.
- The waiter SHOULD NOT see messages that arrived before the wait started (unless explicitly asked to).
SQS does not naturally do this. SQS gives you “next available message,” not “the next message that matches a predicate posted after this point in time.” You either build a per-waiter consumer group (heavyweight) or you scan-and-filter on read (slow).
EventBridge does. The waiter creates an ephemeral rule with a pattern matching the inbox ID, listens for matching events, returns when one fires or the timeout hits. The rule is torn down on completion.
// pseudocode for the waiter loop
const ruleName = `waiter-${inboxId}-${requestId}`;
await eventbridge.putRule({
Name: ruleName,
EventPattern: { source: ['catchotp.inbound'], detail: { inboxId: [inboxId] } },
EventBusName: 'catchotp-bus',
});
await eventbridge.putTargets({
Rule: ruleName,
Targets: [{ Arn: longPollLambda.arn, Id: requestId }],
});
// long-poll Lambda holds the connection until matching event fires or timeout
This was the single biggest latency win. p95 from S3 PUT to waiter resolution dropped from 1.4s to 480ms when we moved off SQS.
The EventBridge cost: $1 per million matched events. At our scale that is invisible. At 100M+ messages a month it would bear another look.
Parsing the body
This is the part everyone underestimates. “Extract a 6-digit code from an email” is one of those sentences that breaks on contact with reality.
A non-exhaustive list of formats we have seen in the first 90 days of inbound:
Your code is 123456Your code is 123-456Your code is 1 2 3 4 5 6(one digit per cell in a styled HTML table)Verification: 123456(no whitespace before the colon)Code: 123456 (expires in 5 minutes)123456(alone on a line, no preamble)Your code: <strong>123456</strong>in HTML, with the digits split by zero-width spaces- A 6x1 table with each digit in its own
<td>and CSS that hides the surrounding tracking pixel Your verification code is123456in Markdown- A QR code as an image with the OTP rendered alongside in HTML
Our default extractor passes 95%+ of services we have measured against. The patterns that miss almost all need a custom override, which is a one-line opt-in:
const code = await otp.inboxes.waitForOtp(inbox.id, {
pattern: /verification\s+code:?\s*([a-z0-9]{8})/i,
timeoutSeconds: 30,
});
The lesson: do not ship “OTP extraction” as a clever-regex feature. Ship it as a prioritized pipeline of (1) fast common patterns, (2) HTML-stripping then pattern, (3) text-only then pattern, (4) per-sender override hints. Plan for the override hook from day one because someone will need it on day two.
Persistence and retention
Three retention windows live in the system, and they are all different on purpose.
- Raw MIME in S3: 90 days on Pro, 24h on Free, custom on Enterprise. Cold-line transition to S3-IA after 30 days for cost.
- Parsed message in DynamoDB: same as the above, but with a TTL attribute that DynamoDB enforces automatically.
- Audit log metadata: 365 days for everyone. Retains sender, timestamp, inbox ID — never body. Compliance use only.
The split matters because regulators want metadata for longer than customers want bodies. Conflating them gets you either GDPR pain (bodies kept too long) or compliance pain (no audit trail).
DynamoDB TTL is a soft delete — items become unqueryable but the rows linger up to 48 hours before physical deletion. We separately run a daily sweep that hard-deletes anything past TTL, because “soft” is a confusing word in a privacy context.
The long-poll API
Long-poll is two things in a trench coat: an HTTP request that holds open until a server-side condition fires, and a tiny state machine that decides what condition counts.
Our shape:
GET /v1/inboxes/{inboxId}/otp/wait?timeoutSeconds=30
What happens server-side:
- The Lambda starts a wait against the EventBridge rule we created above.
- If a message has already arrived in DynamoDB matching this inbox and not yet acknowledged, return immediately.
- Otherwise, hold the request. The Lambda has a 29-second internal timeout (one second under the client’s 30) so it can return a clean
204 No Matchrather than a connection drop. - On match, parse the message, extract the OTP, return.
The non-obvious part: the matching predicate. By default, “the next message that arrives after the wait started.” If you pass ?since=<messageId>, “the next message after that ID, including any that already arrived.” This is the difference between “test starts a wait, then triggers the email” (default) and “test triggers the email, then starts a wait” (race).
Most clients want the default. The since mode exists for clients that genuinely cannot reorder.
What we would do differently
Three things we got wrong.
1. We started with API Gateway WebSockets
We thought “long-poll” should be “WebSocket.” It should not. WebSockets add a connection lifecycle that you do not need for a one-shot wait. They also do not survive load-balancer-level idle timeouts as well as HTTP. We migrated to plain HTTP long-poll on a Lambda Function URL behind CloudFront and the latency improved.
2. We over-indexed on schemas early
We picked a messages.v1 schema before we had real customers. Two of the fields turned out to be useless and one needed to grow. We are now on messages.v2 and writing both versions until customers migrate. Should have shipped a leaner v1 and iterated.
3. We underestimated abuse on day one
The day after we publicly announced the free tier, the receive side was getting newsletter signup attempts targeted at the catchotp domain — not abuse exactly, more like “we are a directory of disposable inbox sites and we are scraping the address space.” The fix is rate-limiting per source IP, but we should have had it on launch instead of week two.
What is next
Three things on the roadmap that are direct consequences of the above.
- EU region. Pinning to us-east-1 is becoming a problem for European customers. eu-west-1 inbound, with regional data residency, is in design.
- Custom domains on Team. The “all addresses are at
*@inbox.catchotp.com” constraint blocks some Team-tier use cases. Bring-your-own-domain with managed DKIM is in design. - MCP server. Agents want to consume the long-poll waiter natively, not via HTTP. The AI agent email handling post covers the shape we are working toward.
How catchotp helps
If you are building something similar — an inbound email service, a high-fanout webhook system, a low-latency long-poll API — the post above is what we know. If you want to use the result rather than rebuild it, we are easier than the architecture suggests.
Free tier: 5 inboxes, 1,000 messages a month, no credit card. Start free or read the comparison page for how we differ from disposable-inbox sites.
Related reading
- Programmable Email vs Disposable Email — the user-facing version of this story.
- How to Test OTP Flows in 2026 — what the architecture above is optimized for.
- The AI agents use case covers the MCP server we are working toward.
The shorter version: SES inbound + EventBridge + long-poll Lambda + DynamoDB. None of the parts are exotic. The work is in the seams between them.