Case study

Shipping an Autonomous Cold-Outreach Agent to Production

How I built, operated, and safely shut down a self-driving LLM outreach system for a regional B2B distributor.

I spent roughly six weeks putting an LLM in a loop that emails real strangers, at volume, unattended, and the whole job was making sure it never did something stupid or illegal. The client was a Dallas-Fort Worth oil-and-lubricants distributor. The system sourced ~12,900 prospects, classified and quarantined them with LLMs, generated grounded per-prospect email copy, scheduled multi-touch sequences inside hard deliverability and CAN-SPAM guardrails, ran unattended on a daily cron lifecycle with a dead-man's switch, and monitored its own reputation. When the engagement ended, it shut itself down cleanly: every queued send cancelled, every job unloaded, full audit trail preserved.

This is a writeup of how it actually works. The headline isn't a conversion number. It's that a fully autonomous system ran a real outbound campaign end to end, in production, with no human in the loop, and never once stepped outside the law or the deliverability envelope. That is the actual job of an applied-AI or forward-deployed engineer: not the demo, the production envelope around the demo.

I have anonymized the client and abstracted the data sources. Everything else (architecture, code, numbers) is real and pulled directly from the repository.


At a glance

Prospects sourced & deduplicated 12,876 (from 4 independent data sources, merged on normalized phone)
Quarantined by the classifier ~11,100 shop-classifications routed to "do not pitch" (out-of-scope / too ambiguous)
Prospects enriched with verified emails ~2,030 unique businesses (scrape → Hunter.io → verify waterfall)
Unique businesses emailed 1,227, across up to 4 sequenced waves (~6,000 send records)
Bounce rate ~1.1% (13 bounces / 1,227), against a 2% kill threshold
Opt-outs honored 78 all-time, enforced across every future-scheduled wave
LLM calls ~25,700 cached decisions across 3 model tiers (classification, angle-routing, draft-vetting), plus live calls
Autonomous jobs 4 scheduled agents (15-min, hourly, nightly, morning) running unattended
Compliance incidents 1 real CAN-SPAM miss, root-caused and fixed with a self-auditing watchdog
Stack Python, Node/Playwright, Claude (Opus / Sonnet / Haiku), Resend, Hunter.io, ZeroBounce, Google Sheets, macOS launchd

The one-line version: I treated "an LLM that emails strangers" as a safety-critical system. A single 2,600-line compliance module is most of what made it one.


The problem

The client is the owner-operator of a regional distributor that stocks several hundred SKUs of motor oil, heavy-duty diesel oil, transmission and hydraulic fluids, with same-region delivery and trade credit terms. Auto shops, dealerships, and fleets in the metro were his ideal customers. He had no list, no outbound motion, and no time. He answers his own phone.

The brief was simple to state and hard to do well: find every relevant shop in the metro, reach the right person, say something true and specific enough that a busy shop owner would reply, and do it without torching the domain's email reputation or breaking anti-spam law. Then keep doing it every day without me babysitting it.

Two constraints shaped every design decision:

  1. The client is non-technical and busy. The system had to run itself and only surface a human when something genuinely needed a human. No dashboards he'd never open.
  2. One careless send is unrecoverable. Emailing a shop that already said "stop" is a federal CAN-SPAM violation. Blasting identical subject lines gets the whole domain flagged by Gmail. There is no undo on a sent email. The cost of a mistake is asymmetric, so the system had to be biased toward not acting when uncertain.

That second constraint is the whole story. Everything below is downstream of "the cheap failure is sending nothing; the expensive failure is sending the wrong thing."


System architecture

QUARANTINE~11,100 heldout of scope · never pitchedDROPPEDBlocked draftsfail the gate · never sentSTEP 01Acquire + dedupe4 data sources · 12,876 prospectsLLM · CLASSIFYSegment + archetypeHaiku · quarantine over guessLLM · PICK ANGLEConstrained shortlistno fitting angle drops outENRICHFind + verify emaildiscovery API + verificationGATE · SUPPRESSOpt-out + bounce checkdeterministic, before any sendGENERATEGrounded per-prospect copytemplates · four wavesGATE · SAFETYDraft gatemerge-leak · spam · slop · brandSENDSchedule + sendcaps · waves · idempotency
Reasoning runs in the LLM stages (blue). Guarantees run in the deterministic gates (rust). A prospect drops at any branch it fails.

The design principle that runs through all of it: the LLM does the reasoning, deterministic code holds the guarantees. An LLM decides whether a shop is a diesel fleet or a body shop, and which pitch fits. Plain Python decides whether someone opted out, whether we've hit today's send cap, and whether a draft leaked a {merge_field}. The model is allowed to be wrong; the guarantees are not allowed to depend on the model being right.


Part 1: Data acquisition at scale

The prospect universe was assembled from four independent sources, each contributing something the others couldn't:

These merged into 12,876 unique prospects, deduplicated primarily on normalized phone number (the only identifier that survived across heterogeneous sources), with cross-source fields enriching matched records.

The engineering that's worth showing

The firmographic source was the hard one. It sat behind an authenticated session and a stateful, JavaScript-heavy search form, and it capped exports at 500 records per query, far below the thousands a single SIC-code × county combination returned. Getting a complete extraction took eight iterations, and the evolution is the story:

The lesson I'd put on a whiteboard: when a system fights you, the win is usually not brute force, it's finding the cheap orthogonal angle (a second sort order, a session key you can replay) that turns an impossible request into a possible one.


Part 2: The classification brain

Sourcing 12,876 prospects is easy. Knowing which ones to leave alone is the hard, valuable part, and it's where the LLMs earned their keep.

Classification ran in two LLM stages, both on Claude Haiku 4.5 (the cheap, fast tier, chosen deliberately for a job that runs across tens of thousands of records):

  1. Segment + subsegment (passenger_lube / hd_diesel / dealer_fleet / out-of-scope), which routes a prospect to the right product taxonomy.
  2. Archetype (one of 10: euro specialist, hybrid specialist, lube chain, transmission specialist, body shop, heavy-duty diesel, independent general, mobile mechanic, out-of-scope, unclear), which routes to the right pitch.

Four of those archetypes (mobile mechanic, out-of-scope, unclear, and a distinct needs-retry state) form a quarantine set. They never get an email. Over the campaign, the classifier routed roughly 11,100 shop-classifications into quarantine (out-of-scope dominated at ~9,200), concentrating the entire outbound effort on the qualified minority.

The design rule, written into the code as a comment, was "quarantine over guess. No proximity-fallback. No 'when in doubt → safe pick' bias." This matters more than it sounds:

This is the difference between a classifier that's accurate on average and one that's safe to put in front of a stranger's inbox. Average accuracy is fine when a wrong answer costs nothing. Here a wrong "this body shop wants diesel oil" costs a real person a stupid email and costs the client a reputation. The system is built to abstain.


Part 3: Prompt and context engineering for copy

Personalized copy at this scale is a cost and consistency problem. My answer was to split writing from routing:

This is a deliberate cost architecture. Generating bespoke prose for 12,000 prospects with a frontier model is slow and expensive and inconsistent. Generating 24 excellent templates once and routing them intelligently is fast, cheap, auditable, and you can actually read and approve every possible email a prospect might receive.

The angle pick is a constrained decision, not a free one

Rather than ask Haiku "pick 1 of 24 templates" (a wide, error-prone choice), I made it a two-stage funnel:

  1. Classify into one of 10 archetypes.
  2. Each archetype maps to a curated shortlist of 2-6 candidate angles. Intersect that shortlist with the segment's available templates. If zero candidates remain, quarantine. If exactly one, pick it deterministically with no LLM call at all. If two or more, a second Haiku call chooses among them.

Shrinking the decision space makes wrong answers structurally harder, and lets the common cases skip the model entirely. When a batched angle-pick call timed out, the code recursively split the batch in half and retried each half, salvaging work while still honoring "every answer is a real model pick, never a fallback."

Context engineering: grounding, not dumping

Each prospect's prompt was assembled from three joined sources: the firmographic row (name, SIC description, city, employee count), a maps-API join on normalized phone (business type, rating, review count), and the client's own existing-customer list joined by city (to produce a same-city name-drop for social proof, gated on an explicit CanNameDrop flag).

Critically, the prompt didn't just dump these fields. It included interpretation instructions: "1-2 employees = solo/single-bay; 16+ = chain"; "primary_type 'Car dealer' → very likely dealer_fleet"; "100+ reviews at 4.5+ = serious volume, a real buyer." And the templates themselves were grounded in verified domain facts, exact OEM specifications (BMW LL-01, MB 229.5, VW 502/505/507), real SKUs, with an explicit forbidden list of claims the client wasn't confident about. Hallucinating a certification the distributor doesn't hold is exactly the failure that destroys B2B trust, so the unsafe claims were removed from the model's reach rather than left to its judgment.

Caching as a first-class concern

Across the campaign, the system accumulated ~21,000 cached classifications, ~1,700 cached angle picks, ~2,800 cached email-legitimacy verdicts, and ~2,100 cached Hunter.io lookups. The cache key for classification was name | SIC | sha1(city+website)[:8], hashing city and website specifically so two same-named shops in different cities don't collide. A per-night classification budget meant cached rows were free and only fresh calls counted against the cap; the comment in the code notes that over ~13 nights the cache would cover all 12,876 rows, and that slow-but-complete beats fast-but-lossy.


Part 4: The compliance and safety engine

This is the centerpiece, a 2,600-line module that everything else routes through. It implements defense-in-depth, because the cost of a single bad send is asymmetric.

Pre-send draft gates (deterministic, first-hit-wins, no LLM):

The LLM watchdog (the most expensive model, at the last gate): drafts that pass the deterministic gates and aren't from free email domains go to Opus 4.7 at extra-high effort, batched, to catch scraper-misfire and wrong-entity contamination. The cheapest model does the high-volume classification; the most expensive model guards the final step before sends go out. And this gate fails closed: an earlier version let an Opus blip ship unvalidated drafts (fail-open), so I rewrote it to route failed batches to a separate needs-retry file that is never approved and never sent, and to page me if more than 20% of drafts fail-open.

Opt-out enforcement (three layers, deterministic-first):

  1. A conservative regex pre-pass runs before the LLM inbox triage. A bare "stop" reply matches; "stop by anytime" deliberately does not.
  2. The LLM triage semantically classifies replies (opt-out / bounce / prospect-reply / urgent).
  3. An independent drift auditor re-scans the inbox against the reply log to catch anything triage missed, and fails the nightly pipeline with a specific exit code if it finds a high-severity miss containing a stop-keyword.

When an opt-out is detected, the kill-chain is: append to the suppression list → cancel every future-scheduled wave for that address in the email provider → append to the dedup log → mark the thread handled. If the pipeline lock is held, the cancellation is deferred and retried, never silently dropped.

Suppression at send time: before any wave goes out, a separate auditor loads every suppression source (opt-outs, bounces, "replied elsewhere," and real replies), scans for any future-scheduled send to a suppressed address, cancels it via the provider's API, then re-verifies with a second scan and exits non-zero if any violation remains.

The incident that proves the design

On 2026-04-30, the system committed a real CAN-SPAM violation: it sent a follow-up wave to a prospect who had replied "stop." I want to be straight about this, because how it was handled is the actual portfolio piece.

Root cause: opt-out deduplication keyed on Gmail's read/unread state. But read/unread state is owned by the human inbox, a person reading a "STOP" reply could silently flip a thread out of the "needs processing" set, un-gating it before the suppression ran. The state that the safety check depended on was mutable by someone outside the system.

The fix was a re-architecture, not a patch:

That's the shape of responsible production AI: a real failure, root-caused to a state-ownership bug (not "the model got it wrong"), fixed with redundant deterministic layers and a self-auditing watchdog so the same class of bug can't recur silently. Out of the entire campaign, this was the one real violation, and the system was rebuilt so it would catch itself next time.


Part 5: Autonomous operation

The system ran itself on four scheduled macOS launchd agents (chosen over cron so the jobs run in the GUI session and can reach Keychain OAuth tokens):

Job Cadence Role
poll-inbox every 15 min opt-out pre-pass → LLM triage → bounce canary → CRM sync
nightly-pipeline 21:00 daily enrich → generate → validate → schedule → sync, then deterministic self-audit
heartbeat hourly dead-man's switch
morning-snapshot 08:00 daily one status email to the operator

A few decisions made this safe to leave unattended:

Sends are decoupled from my machine being awake. Emails are queued into the provider (Resend) with absolute scheduled_at timestamps and idempotency keys; the provider fires them in an 11:00-11:30 UTC window. The orchestration layer only ever plans. It never has to be awake to send. This matters when the operator's machine is a laptop being SSH'd into from a cafe.

The nightly orchestrator is itself an LLM agent, an Opus call given a prompt to run the five-step pipeline via Bash and decide volumes against the caps, wrapped in a 90-minute hard timeout so a hung model call can't block tomorrow's run. But every guarantee it relies on (opt-out suppression, cap enforcement, missed-opt-out detection) runs as deterministic code wired into the shell script, after the agent, "so it's deterministic even if the LLM agent skips a step or times out." The agent is allowed to be lazy. The audit is not.

Drip scheduling protects deliverability. A warmup ramp (50 → 100 → 200 → weekend-0 → 400 → 500/day, then steady) eased the domain into volume. Multi-touch waves dripped over 14 days (T+0, T+3, T+7, T+14), each placed on the earliest open weekday slot under the cap, staggered across the send window with ±20s jitter for organic pacing. The cap logic falls through to sane weekday/weekend defaults past the explicit table, specifically so the schedule never hits a future date with "no cap → silently send zero."

Observability is a dead-man's switch, not a dashboard. The heartbeat distinguishes never ran (missing log) from stopped running (stale log), has a weekday-midday zero-send tripwire that catches silent under-fills the staleness checks would miss, and even checks pmset/caffeinate to catch the root cause (the Mac asleep) of scheduling drift. A mid-day bounce canary on the 15-minute cadence emergency-cancels the next 6 hours of sends if bounces spike past threshold, closing the gap between the nightly and morning reputation checks. The operator got exactly one email a day unless something was wrong.

A read-only supervisor agent on top. A separate watchdog-agent prompt (designed for an external GPT-5-class runtime) monitored files, the email provider, the inbox, and the CRM on a tiered cadence, strictly read-and-alert: it could page me, but it was forbidden from mutating state or pushing code without a human owning the change. When it found something it couldn't diagnose, it could escalate by spawning a more capable model for root-cause analysis. Agents watching agents, with humans on the only write path that matters.


Part 6: Knowledge base and grounded inbound

For handling inbound replies, I built a retrieval-grounded responder (designed, tested, and deliberately held back from live auto-send, which I'll get to).

The knowledge base combined hand-authored facts with a document-extraction pipeline. 26 vendor PDFs (spec sheets, brand line cards, certifications, price lists) were each run through a pipeline: raw PDF → standardized PDF → markdown summary + structured JSON → master index. The standout detail: each JSON record carries not just summary and extracted_text and key_facts, but a send_when / never_send_when block. The KB doesn't just store what a document says, it stores when an agent should attach it and when it shouldn't. That is the line between a document store and an agent-ready knowledge base.

Pricing was reverse-engineered from real artifacts (one accounting estimate plus four real quote replies the owner had sent), with per-number provenance tracked and unknowns explicitly flagged for human escalation rather than guessed.

The inbound responder loaded the "safe" KB files into context (stuffed-context grounding) and emitted strict JSON with a route: auto-draft, escalate-to-account-manager, escalate-to-owner, opt-out, bounce, or noise. Nine hard rules enforced groundedness, the first being never quote a price. The pricing file was deliberately excluded from the model's context, removing the temptation rather than trusting the model to resist it. Name-drops were constrained to an explicit allowlist and emitted as an auditable field, so the model couldn't invent fake social proof.

Human-in-the-loop by construction: the responder only ever wrote a draft. Nothing sent until a human ran an approve command (with edit, escalate, and reject as the only other options, and no default that sends). Idempotency keys prevented double-sends.

The honest part: this inbound auto-drafter was built and smoke-tested but never wired into the live loop. In production, the owner handled prospect replies by phone, which was the right call for a high-touch B2B relationship. The live inbound system was classify-notify-cancel only: it detected replies, alerted the operator, and cancelled follow-up waves. I'm including the responder here because the engineering (the PDF-to-decision-metadata pipeline, the grounded-reply design, the HITL gate) is real and reusable, but I won't claim it ran when it didn't.


Part 7: Deliverability as an engineered system

Deliverability was treated as a measured, monitored property, not an afterthought:


Results, honestly

The verifiable campaign numbers:

The headline result is the autonomy itself: an agent ran a real B2B outbound campaign end to end, unattended, and stayed inside the law and the deliverability envelope the entire time. The reply volume was real but small, and a late dip traced to a deliverability bug I'd diagnosed but not yet fixed when the engagement ended. These are the real numbers, not a vanity metric. In 2026 the thing worth weighing is not how many replies a human-run campaign would have pulled, it's that a system did the whole thing on its own without doing harm.

What the system unambiguously delivered: a complete, deduplicated metro prospect database; a precision-targeting classifier that quarantined ~87% of the list instead of spraying it; thousands of grounded, on-brand, individually-validated emails sent without a deliverability blowup; full CAN-SPAM compliance enforcement with one caught-and-fixed miss; and unattended daily operation with self-monitoring. For an owner-operator who started with nothing, that's a real outbound motion that ran while he answered his phone.

The clean shutdown

When the engagement ended, I shut the system down properly: all four scheduled jobs unloaded, all remaining future-scheduled sends cancelled (roughly 190 still-pending waves, each logged with a reason and timestamp, never deleted), the job definitions moved aside (recoverable, not destroyed), and every record (send log, reply log, CRM, knowledge base) preserved for the client's offboarding. A system that can't be cleanly and safely turned off isn't really production-grade. This one could.


War stories (the parts I'm proudest of)


Principles that generalized

These are the things I'd carry to any production-LLM system, and they're what I'd want a hiring manager to take away:

  1. Put the LLM where being wrong is cheap; put deterministic code where being wrong is expensive. Reasoning and classification: model. Compliance, caps, suppression, idempotency: code. Never let a guarantee depend on inference.
  2. Quarantine over guess. A system that abstains under uncertainty beats one that's accurate-on-average, whenever a wrong action is costly and irreversible. Make abstention a first-class output.
  3. Distinguish "the model decided X" from "the model failed." Caching or acting on transient failures as if they were verdicts is how pipelines silently rot.
  4. Fail closed, and page a human. Every safety gate should fail in the safe direction and make noise. The expensive failure is the silent one.
  5. Design the off-switch first. Cancellable, reversible, fully audited. If you can't safely stop it, you shouldn't have started it.
  6. Cost is an architecture, not a setting. Tiered models, deterministic-first funnels, aggressive caching, and budget gates are what make LLM-in-the-loop affordable at five-figure volumes.

Stack

Languages/runtime: Python, Node.js, Bash, macOS launchd Models: Claude Opus 4.7 (offline copy generation, nightly orchestration agent, final draft watchdog), Sonnet 4.6 (inbox triage, inbound responder), Haiku 4.5 (high-volume classification, angle routing, email vetting) Browser automation: Playwright (Chromium, anti-automation hardening, hybrid browser-auth + raw-HTTP extraction) Services: Resend (scheduled sends, idempotency keys, RFC 8058 compliance), Hunter.io (email discovery), ZeroBounce (verification), Google Sheets (CRM), Gmail (inbox) Patterns: retrieval-grounded generation, LLM-as-judge gating, multi-tier model routing, prompt caching, two-stage constrained classification, deterministic safety envelopes, autonomous scheduled agents with dead-man's-switch observability, human-in-the-loop send gates


I built this end to end: data acquisition, the LLM pipeline, the compliance engine, the autonomous operations layer, and the knowledge base. The work I care about isn't getting a model to produce a good email once. It's building the envelope that lets a model do that ten thousand times, unattended, against real people's inboxes, without ever doing harm, and that knows how to stop. That's the job I want to keep doing.