I spent roughly six weeks putting an LLM in a loop that emails real strangers, at volume, unattended, and the whole job was making sure it never did something stupid or illegal. The client was a Dallas-Fort Worth oil-and-lubricants distributor. The system sourced ~12,900 prospects, classified and quarantined them with LLMs, generated grounded per-prospect email copy, scheduled multi-touch sequences inside hard deliverability and CAN-SPAM guardrails, ran unattended on a daily cron lifecycle with a dead-man's switch, and monitored its own reputation. When the engagement ended, it shut itself down cleanly: every queued send cancelled, every job unloaded, full audit trail preserved.
This is a writeup of how it actually works. The headline isn't a conversion number. It's that a fully autonomous system ran a real outbound campaign end to end, in production, with no human in the loop, and never once stepped outside the law or the deliverability envelope. That is the actual job of an applied-AI or forward-deployed engineer: not the demo, the production envelope around the demo.
I have anonymized the client and abstracted the data sources. Everything else (architecture, code, numbers) is real and pulled directly from the repository.
At a glance
| Prospects sourced & deduplicated | 12,876 (from 4 independent data sources, merged on normalized phone) |
| Quarantined by the classifier | ~11,100 shop-classifications routed to "do not pitch" (out-of-scope / too ambiguous) |
| Prospects enriched with verified emails | ~2,030 unique businesses (scrape → Hunter.io → verify waterfall) |
| Unique businesses emailed | 1,227, across up to 4 sequenced waves (~6,000 send records) |
| Bounce rate | ~1.1% (13 bounces / 1,227), against a 2% kill threshold |
| Opt-outs honored | 78 all-time, enforced across every future-scheduled wave |
| LLM calls | ~25,700 cached decisions across 3 model tiers (classification, angle-routing, draft-vetting), plus live calls |
| Autonomous jobs | 4 scheduled agents (15-min, hourly, nightly, morning) running unattended |
| Compliance incidents | 1 real CAN-SPAM miss, root-caused and fixed with a self-auditing watchdog |
| Stack | Python, Node/Playwright, Claude (Opus / Sonnet / Haiku), Resend, Hunter.io, ZeroBounce, Google Sheets, macOS launchd |
The one-line version: I treated "an LLM that emails strangers" as a safety-critical system. A single 2,600-line compliance module is most of what made it one.
The problem
The client is the owner-operator of a regional distributor that stocks several hundred SKUs of motor oil, heavy-duty diesel oil, transmission and hydraulic fluids, with same-region delivery and trade credit terms. Auto shops, dealerships, and fleets in the metro were his ideal customers. He had no list, no outbound motion, and no time. He answers his own phone.
The brief was simple to state and hard to do well: find every relevant shop in the metro, reach the right person, say something true and specific enough that a busy shop owner would reply, and do it without torching the domain's email reputation or breaking anti-spam law. Then keep doing it every day without me babysitting it.
Two constraints shaped every design decision:
- The client is non-technical and busy. The system had to run itself and only surface a human when something genuinely needed a human. No dashboards he'd never open.
- One careless send is unrecoverable. Emailing a shop that already said "stop" is a federal CAN-SPAM violation. Blasting identical subject lines gets the whole domain flagged by Gmail. There is no undo on a sent email. The cost of a mistake is asymmetric, so the system had to be biased toward not acting when uncertain.
That second constraint is the whole story. Everything below is downstream of "the cheap failure is sending nothing; the expensive failure is sending the wrong thing."
System architecture
The design principle that runs through all of it: the LLM
does the reasoning, deterministic code holds the guarantees. An
LLM decides whether a shop is a diesel fleet or a body shop, and which
pitch fits. Plain Python decides whether someone opted out, whether
we've hit today's send cap, and whether a draft leaked a
{merge_field}. The model is allowed to be wrong; the
guarantees are not allowed to depend on the model being right.
Part 1: Data acquisition at scale
The prospect universe was assembled from four independent sources, each contributing something the others couldn't:
- A licensed B2B firmographic database (the richest source): owner and executive names, revenue and employee estimates, years in business, SIC/NAICS codes, roughly 395 columns per record. ~12,765 records.
- A maps/places API, queried as a grid of ~230 metro ZIP codes × 6 search terms to defeat the API's hard 60-results-per-query cap. Ratings, review counts, business types, websites. 7,872 records.
- A government dealer registry: 5,269 licensed auto dealers with phone and license type.
- A government used-oil-handler registry: the most oil-relevant source of all, 746 raw → 315 cleaned facilities licensed to handle used oil.
These merged into 12,876 unique prospects, deduplicated primarily on normalized phone number (the only identifier that survived across heterogeneous sources), with cross-source fields enriching matched records.
The engineering that's worth showing
The firmographic source was the hard one. It sat behind an authenticated session and a stateful, JavaScript-heavy search form, and it capped exports at 500 records per query, far below the thousands a single SIC-code × county combination returned. Getting a complete extraction took eight iterations, and the evolution is the story:
The hybrid pattern. I used Playwright (headless Chromium with anti-automation hardening:
navigator.webdriverspoofed false, desktop UA, matching timezone/locale) only for the part that needed a real browser, getting through auth and building the search. Once the search existed, I extracted the 32-character session key from the results URL and switched to direct authenticated HTTP calls against the reverse-engineered internal endpoints. Browser robustness for the handshake, raw-HTTP throughput for the data. When the HTTP client eventually lost auth after many calls, I fell back to runningfetchinside the page context so requests inherited the live session cookies.Defeating the 500-record export cap. Rather than accept the limit, I collected every record ID for a result set, then looped: tag 500 → download → untag those 500 → tag the next 500. A capped export became a complete one.
Defeating session-depth limits with multi-sort harvesting. For result sets larger than ~800, a single session couldn't paginate deep enough to even see all the IDs. The fix: query the same result set under multiple sort orders (name ascending and descending, ZIP ascending/descending), re-authenticating between each, and union the IDs into a set. Ascending gets the first ~1,000, descending gets the last ~1,000; overlap them and you've exfiltrated a result set the server will not paginate fully. I kept going until I'd captured ≥95% of the expected count.
A failure-mode sentinel. When the search criteria silently failed to apply, the form would default to returning the entire national database (I caught this in the logs as a result count of 18,522,401). So I added a guard: any count over 100,000 is treated as proof the criteria didn't stick, triggering a re-auth-and-retry instead of a doomed pagination through 18 million rows.
Crash-safe, resumable extraction. A 110-task, multi-hour scrape checkpointed after every task, with per-task retry → full-reauth → retry → mark-failed escalation, and preemptive re-auth every 15 tasks to survive session expiry. It could be killed and restarted without re-downloading anything.
The lesson I'd put on a whiteboard: when a system fights you, the win is usually not brute force, it's finding the cheap orthogonal angle (a second sort order, a session key you can replay) that turns an impossible request into a possible one.
Part 2: The classification brain
Sourcing 12,876 prospects is easy. Knowing which ones to leave alone is the hard, valuable part, and it's where the LLMs earned their keep.
Classification ran in two LLM stages, both on Claude Haiku 4.5 (the cheap, fast tier, chosen deliberately for a job that runs across tens of thousands of records):
- Segment + subsegment (
passenger_lube/hd_diesel/dealer_fleet/out-of-scope), which routes a prospect to the right product taxonomy. - Archetype (one of 10: euro specialist, hybrid specialist, lube chain, transmission specialist, body shop, heavy-duty diesel, independent general, mobile mechanic, out-of-scope, unclear), which routes to the right pitch.
Four of those archetypes (mobile mechanic, out-of-scope, unclear, and
a distinct needs-retry state) form a quarantine
set. They never get an email. Over the campaign, the classifier
routed roughly 11,100 shop-classifications into
quarantine (out-of-scope dominated at ~9,200), concentrating
the entire outbound effort on the qualified minority.
The design rule, written into the code as a comment, was "quarantine over guess. No proximity-fallback. No 'when in doubt → safe pick' bias." This matters more than it sounds:
- When Haiku returns malformed JSON, or an answer that isn't on the allowed list, the system does not coerce it into a default. An off-list answer is treated as the model being lazy or wrong, and the prospect is dropped or retried.
- A model failure (
needs-retry, from a timeout or parse error) is kept semantically distinct from a model decision (unclear, meaning Haiku looked and genuinely couldn't tell). The first gets retried next run; the second is a real verdict. Collapsing those two is how pipelines silently drop good leads on transient errors, and I refused to. - The decision cache never persists failures. Only stable decisions (a real angle pick, or a final quarantine verdict) are cached, so one timeout doesn't poison every future run.
This is the difference between a classifier that's accurate on average and one that's safe to put in front of a stranger's inbox. Average accuracy is fine when a wrong answer costs nothing. Here a wrong "this body shop wants diesel oil" costs a real person a stupid email and costs the client a reputation. The system is built to abstain.
Part 3: Prompt and context engineering for copy
Personalized copy at this scale is a cost and consistency problem. My answer was to split writing from routing:
Writing happened offline, once, with the best model. I used Opus 4.7 at maximum effort to generate 24 hand-grade body templates (3 segments × 8 "angles"), plus angle-matched subject lines and wave-2/3/4 follow-up sequences. An "angle" is a named value proposition:
next-day-delivery,net-30-terms,bulk-tanker,0w-8-unicorn-sku,euroline-specialty,def-in-stock, and so on.Routing happened per-prospect, with the cheap model. At runtime, Haiku's only job was to pick the best angle for each shop. The actual copy was a deterministic string merge. Opus writes the prose; Haiku picks which prose each prospect gets; Python fills in the name and city.
This is a deliberate cost architecture. Generating bespoke prose for 12,000 prospects with a frontier model is slow and expensive and inconsistent. Generating 24 excellent templates once and routing them intelligently is fast, cheap, auditable, and you can actually read and approve every possible email a prospect might receive.
The angle pick is a constrained decision, not a free one
Rather than ask Haiku "pick 1 of 24 templates" (a wide, error-prone choice), I made it a two-stage funnel:
- Classify into one of 10 archetypes.
- Each archetype maps to a curated shortlist of 2-6 candidate angles. Intersect that shortlist with the segment's available templates. If zero candidates remain, quarantine. If exactly one, pick it deterministically with no LLM call at all. If two or more, a second Haiku call chooses among them.
Shrinking the decision space makes wrong answers structurally harder, and lets the common cases skip the model entirely. When a batched angle-pick call timed out, the code recursively split the batch in half and retried each half, salvaging work while still honoring "every answer is a real model pick, never a fallback."
Context engineering: grounding, not dumping
Each prospect's prompt was assembled from three joined sources: the
firmographic row (name, SIC description, city, employee count), a
maps-API join on normalized phone (business type, rating, review count),
and the client's own existing-customer list joined by city (to produce a
same-city name-drop for social proof, gated on an explicit
CanNameDrop flag).
Critically, the prompt didn't just dump these fields. It included interpretation instructions: "1-2 employees = solo/single-bay; 16+ = chain"; "primary_type 'Car dealer' → very likely dealer_fleet"; "100+ reviews at 4.5+ = serious volume, a real buyer." And the templates themselves were grounded in verified domain facts, exact OEM specifications (BMW LL-01, MB 229.5, VW 502/505/507), real SKUs, with an explicit forbidden list of claims the client wasn't confident about. Hallucinating a certification the distributor doesn't hold is exactly the failure that destroys B2B trust, so the unsafe claims were removed from the model's reach rather than left to its judgment.
Caching as a first-class concern
Across the campaign, the system accumulated ~21,000 cached
classifications, ~1,700 cached angle picks, ~2,800 cached
email-legitimacy verdicts, and ~2,100 cached Hunter.io lookups. The
cache key for classification was
name | SIC | sha1(city+website)[:8], hashing city and
website specifically so two same-named shops in different cities don't
collide. A per-night classification budget meant cached rows were free
and only fresh calls counted against the cap; the comment in the code
notes that over ~13 nights the cache would cover all 12,876 rows, and
that slow-but-complete beats fast-but-lossy.
Part 4: The compliance and safety engine
This is the centerpiece, a 2,600-line module that everything else routes through. It implements defense-in-depth, because the cost of a single bad send is asymmetric.
Pre-send draft gates (deterministic, first-hit-wins, no LLM):
- Merge-field leak detection (a
{shop_name}that survived rendering). - Spam-marker detection (all-caps
FREE/GUARANTEED,$$,!!, >70% uppercase ratio in a subject). - Forbidden-brand-per-segment gates: heavy-duty diesel brands are forbidden in passenger-lube copy and vice versa, so the system structurally cannot pitch the wrong product taxonomy.
- An anti-AI-slop gate: em dashes plus a ~30-word
blocklist (
delve,crucial,robust,seamless,leverage,elevate, …). Cold email that reads like ChatGPT gets deleted by humans and flagged by filters. - High-bounce-risk address detection: generic
info@/sales@local-parts on custom domains, since role-address mailboxes on small-shop custom domains disproportionately hard-bounce.
The LLM watchdog (the most expensive model, at the last
gate): drafts that pass the deterministic gates and aren't from
free email domains go to Opus 4.7 at extra-high effort,
batched, to catch scraper-misfire and wrong-entity contamination. The
cheapest model does the high-volume classification; the most expensive
model guards the final step before sends go out. And this gate
fails closed: an earlier version let an Opus blip ship
unvalidated drafts (fail-open), so I rewrote it to route failed batches
to a separate needs-retry file that is never approved and
never sent, and to page me if more than 20% of drafts fail-open.
Opt-out enforcement (three layers, deterministic-first):
- A conservative regex pre-pass runs before the LLM inbox triage. A bare "stop" reply matches; "stop by anytime" deliberately does not.
- The LLM triage semantically classifies replies (opt-out / bounce / prospect-reply / urgent).
- An independent drift auditor re-scans the inbox against the reply log to catch anything triage missed, and fails the nightly pipeline with a specific exit code if it finds a high-severity miss containing a stop-keyword.
When an opt-out is detected, the kill-chain is: append to the suppression list → cancel every future-scheduled wave for that address in the email provider → append to the dedup log → mark the thread handled. If the pipeline lock is held, the cancellation is deferred and retried, never silently dropped.
Suppression at send time: before any wave goes out, a separate auditor loads every suppression source (opt-outs, bounces, "replied elsewhere," and real replies), scans for any future-scheduled send to a suppressed address, cancels it via the provider's API, then re-verifies with a second scan and exits non-zero if any violation remains.
The incident that proves the design
On 2026-04-30, the system committed a real CAN-SPAM violation: it sent a follow-up wave to a prospect who had replied "stop." I want to be straight about this, because how it was handled is the actual portfolio piece.
Root cause: opt-out deduplication keyed on Gmail's read/unread state. But read/unread state is owned by the human inbox, a person reading a "STOP" reply could silently flip a thread out of the "needs processing" set, un-gating it before the suppression ran. The state that the safety check depended on was mutable by someone outside the system.
The fix was a re-architecture, not a patch:
- Dedup now keys on a
threadIdrecorded in an append-only reply log the system fully owns, never on inbox read-state. - The deterministic regex pre-pass was added ahead of the LLM, so a missed-unread thread can't re-trigger sends even if triage never sees it.
- The independent drift auditor was added to fail the pipeline on any future miss.
That's the shape of responsible production AI: a real failure, root-caused to a state-ownership bug (not "the model got it wrong"), fixed with redundant deterministic layers and a self-auditing watchdog so the same class of bug can't recur silently. Out of the entire campaign, this was the one real violation, and the system was rebuilt so it would catch itself next time.
Part 5: Autonomous operation
The system ran itself on four scheduled macOS launchd agents (chosen over cron so the jobs run in the GUI session and can reach Keychain OAuth tokens):
| Job | Cadence | Role |
|---|---|---|
| poll-inbox | every 15 min | opt-out pre-pass → LLM triage → bounce canary → CRM sync |
| nightly-pipeline | 21:00 daily | enrich → generate → validate → schedule → sync, then deterministic self-audit |
| heartbeat | hourly | dead-man's switch |
| morning-snapshot | 08:00 daily | one status email to the operator |
A few decisions made this safe to leave unattended:
Sends are decoupled from my machine being awake.
Emails are queued into the provider (Resend) with absolute
scheduled_at timestamps and idempotency keys; the
provider fires them in an 11:00-11:30 UTC window. The
orchestration layer only ever plans. It never has to be awake
to send. This matters when the operator's machine is a laptop
being SSH'd into from a cafe.
The nightly orchestrator is itself an LLM agent, an Opus call given a prompt to run the five-step pipeline via Bash and decide volumes against the caps, wrapped in a 90-minute hard timeout so a hung model call can't block tomorrow's run. But every guarantee it relies on (opt-out suppression, cap enforcement, missed-opt-out detection) runs as deterministic code wired into the shell script, after the agent, "so it's deterministic even if the LLM agent skips a step or times out." The agent is allowed to be lazy. The audit is not.
Drip scheduling protects deliverability. A warmup ramp (50 → 100 → 200 → weekend-0 → 400 → 500/day, then steady) eased the domain into volume. Multi-touch waves dripped over 14 days (T+0, T+3, T+7, T+14), each placed on the earliest open weekday slot under the cap, staggered across the send window with ±20s jitter for organic pacing. The cap logic falls through to sane weekday/weekend defaults past the explicit table, specifically so the schedule never hits a future date with "no cap → silently send zero."
Observability is a dead-man's switch, not a
dashboard. The heartbeat distinguishes never ran
(missing log) from stopped running (stale log), has a
weekday-midday zero-send tripwire that catches silent under-fills the
staleness checks would miss, and even checks
pmset/caffeinate to catch the root cause (the
Mac asleep) of scheduling drift. A mid-day bounce canary on the
15-minute cadence emergency-cancels the next 6 hours of sends if bounces
spike past threshold, closing the gap between the nightly and morning
reputation checks. The operator got exactly one email a day unless
something was wrong.
A read-only supervisor agent on top. A separate watchdog-agent prompt (designed for an external GPT-5-class runtime) monitored files, the email provider, the inbox, and the CRM on a tiered cadence, strictly read-and-alert: it could page me, but it was forbidden from mutating state or pushing code without a human owning the change. When it found something it couldn't diagnose, it could escalate by spawning a more capable model for root-cause analysis. Agents watching agents, with humans on the only write path that matters.
Part 6: Knowledge base and grounded inbound
For handling inbound replies, I built a retrieval-grounded responder (designed, tested, and deliberately held back from live auto-send, which I'll get to).
The knowledge base combined hand-authored facts with
a document-extraction pipeline. 26 vendor PDFs (spec
sheets, brand line cards, certifications, price lists) were each run
through a pipeline: raw PDF → standardized PDF → markdown summary +
structured JSON → master index. The standout detail: each JSON record
carries not just summary and extracted_text
and key_facts, but a send_when /
never_send_when block. The KB doesn't just store what a
document says, it stores when an agent should attach it and
when it shouldn't. That is the line between a document store and an
agent-ready knowledge base.
Pricing was reverse-engineered from real artifacts (one accounting estimate plus four real quote replies the owner had sent), with per-number provenance tracked and unknowns explicitly flagged for human escalation rather than guessed.
The inbound responder loaded the "safe" KB files into context (stuffed-context grounding) and emitted strict JSON with a route: auto-draft, escalate-to-account-manager, escalate-to-owner, opt-out, bounce, or noise. Nine hard rules enforced groundedness, the first being never quote a price. The pricing file was deliberately excluded from the model's context, removing the temptation rather than trusting the model to resist it. Name-drops were constrained to an explicit allowlist and emitted as an auditable field, so the model couldn't invent fake social proof.
Human-in-the-loop by construction: the responder only ever wrote a draft. Nothing sent until a human ran an approve command (with edit, escalate, and reject as the only other options, and no default that sends). Idempotency keys prevented double-sends.
The honest part: this inbound auto-drafter was built and smoke-tested but never wired into the live loop. In production, the owner handled prospect replies by phone, which was the right call for a high-touch B2B relationship. The live inbound system was classify-notify-cancel only: it detected replies, alerted the operator, and cancelled follow-up waves. I'm including the responder here because the engineering (the PDF-to-decision-metadata pipeline, the grounded-reply design, the HITL gate) is real and reusable, but I won't claim it ran when it didn't.
Part 7: Deliverability as an engineered system
Deliverability was treated as a measured, monitored property, not an afterthought:
- Authentication: the sending domain was set up as a verified sender (DKIM/SPF) with the email provider, while keeping the existing inbound mail routing intact. (A misconfigured SPF record that failed to authorize the provider was later diagnosed as the cause of a reply-rate decline, exactly the kind of invisible deliverability bug that silently kills a campaign.)
- Compliance headers on every send:
List-UnsubscribeandList-Unsubscribe-Postheaders (one-click-unsubscribe intent), a physical-address footer, and a plain-text "reply STOP" line. CAN-SPAM's structural requirements, enforced at the template layer. - Gmail-clustering mitigation: the dashboard actively flagged when any single subject line exceeded 35% of a day's batch, because Gmail clusters identical subjects from one sender into spam buckets. This is why the subject generator was explicitly instructed to vary sentence structure across angles.
- Result: a ~1.1% bounce rate (13 of 1,227 recipients) against a 2% kill threshold, and 78 opt-outs honored across the campaign (~1.3% of all ~6,000 send attempts). Clean, by cold-email standards.
Results, honestly
The verifiable campaign numbers:
- 12,876 prospects sourced and deduplicated across 4 sources.
- ~11,100 classifications routed to quarantine (the system's most important output was knowing who not to email).
- ~2,030 unique businesses enriched with verified emails via the scrape → Hunter → verify waterfall (~16% of the master list, by design, since most of the list was correctly out-of-scope).
- 1,227 unique businesses actually emailed, across up to 4 sequenced waves (~6,000 total send records over 2026-04-22 to 2026-05-15).
- ~1.1% bounce rate, 78 opt-outs honored across all future waves, 16 genuine prospect replies, 1 real compliance incident (caught and permanently fixed).
The headline result is the autonomy itself: an agent ran a real B2B outbound campaign end to end, unattended, and stayed inside the law and the deliverability envelope the entire time. The reply volume was real but small, and a late dip traced to a deliverability bug I'd diagnosed but not yet fixed when the engagement ended. These are the real numbers, not a vanity metric. In 2026 the thing worth weighing is not how many replies a human-run campaign would have pulled, it's that a system did the whole thing on its own without doing harm.
What the system unambiguously delivered: a complete, deduplicated metro prospect database; a precision-targeting classifier that quarantined ~87% of the list instead of spraying it; thousands of grounded, on-brand, individually-validated emails sent without a deliverability blowup; full CAN-SPAM compliance enforcement with one caught-and-fixed miss; and unattended daily operation with self-monitoring. For an owner-operator who started with nothing, that's a real outbound motion that ran while he answered his phone.
The clean shutdown
When the engagement ended, I shut the system down properly: all four scheduled jobs unloaded, all remaining future-scheduled sends cancelled (roughly 190 still-pending waves, each logged with a reason and timestamp, never deleted), the job definitions moved aside (recoverable, not destroyed), and every record (send log, reply log, CRM, knowledge base) preserved for the client's offboarding. A system that can't be cleanly and safely turned off isn't really production-grade. This one could.
War stories (the parts I'm proudest of)
- The 500-record export cap → batch tag/untag + multi-sort harvesting. Turning a crippled export into a complete one with two orthogonal tricks.
- The 18-million-row sentinel. A count over 100,000 as proof the search criteria silently failed, born directly from watching the form return the entire national database in the logs.
- The 9.78-million-cell spreadsheet. A CRM sync using
append + INSERT_ROWSsilently grew a Google Sheet to 375,978 rows over ~75 runs, closing in on Google's 10-million-cell hard limit. The non-obvious fix (clearthenappend + OVERWRITE, because plainupdatecan't grow a fresh grid) is exactly the kind of bug you only learn by hitting it. - Fail-closed, twice. Both the Opus draft-watchdog and the opt-out auditor were rewritten from fail-open to fail-closed after near-misses. The instinct to make safety checks fail loud and closed rather than silent and open is the whole job.
Principles that generalized
These are the things I'd carry to any production-LLM system, and they're what I'd want a hiring manager to take away:
- Put the LLM where being wrong is cheap; put deterministic code where being wrong is expensive. Reasoning and classification: model. Compliance, caps, suppression, idempotency: code. Never let a guarantee depend on inference.
- Quarantine over guess. A system that abstains under uncertainty beats one that's accurate-on-average, whenever a wrong action is costly and irreversible. Make abstention a first-class output.
- Distinguish "the model decided X" from "the model failed." Caching or acting on transient failures as if they were verdicts is how pipelines silently rot.
- Fail closed, and page a human. Every safety gate should fail in the safe direction and make noise. The expensive failure is the silent one.
- Design the off-switch first. Cancellable, reversible, fully audited. If you can't safely stop it, you shouldn't have started it.
- Cost is an architecture, not a setting. Tiered models, deterministic-first funnels, aggressive caching, and budget gates are what make LLM-in-the-loop affordable at five-figure volumes.
Stack
Languages/runtime: Python, Node.js, Bash, macOS launchd Models: Claude Opus 4.7 (offline copy generation, nightly orchestration agent, final draft watchdog), Sonnet 4.6 (inbox triage, inbound responder), Haiku 4.5 (high-volume classification, angle routing, email vetting) Browser automation: Playwright (Chromium, anti-automation hardening, hybrid browser-auth + raw-HTTP extraction) Services: Resend (scheduled sends, idempotency keys, RFC 8058 compliance), Hunter.io (email discovery), ZeroBounce (verification), Google Sheets (CRM), Gmail (inbox) Patterns: retrieval-grounded generation, LLM-as-judge gating, multi-tier model routing, prompt caching, two-stage constrained classification, deterministic safety envelopes, autonomous scheduled agents with dead-man's-switch observability, human-in-the-loop send gates
I built this end to end: data acquisition, the LLM pipeline, the compliance engine, the autonomous operations layer, and the knowledge base. The work I care about isn't getting a model to produce a good email once. It's building the envelope that lets a model do that ten thousand times, unattended, against real people's inboxes, without ever doing harm, and that knows how to stop. That's the job I want to keep doing.