Back to all projects

B2B Sales Outreach System - From Scraping to Autonomous Agents

September 22, 2025
sales·outreach·ai-agents·scraping·b2b·automation·lead-generation·crawlee·typescript

An end-to-end sales infrastructure that evolved from manual lead research to fully autonomous AI agent processing.

Built for AITax.sg once the product was ready. Now a system that could work for any B2B company.

The Evolution

Phase 1: Lead Generation at Scale (September 2025)

Manual research was the bottleneck. Hours spent browsing directories, copying contact info, organizing spreadsheets. Built Ultra-Man-Scraper to fix this.

The System:

  • Multi-vertical scraping (accounting firms, tuition centres, etc.)
  • Adaptive rate limiting with Bottleneck - auto-scales based on 429 response rates
  • Smart contact extraction with libphonenumber-js for Singapore (+65) phone validation
  • WhatsApp-capability detection for mobile numbers
  • Country-aware parsing (SG, MY, HK, AU domains)
  • Structured JSON/CSV output ready for pipeline ingestion

Technical Stack:

  • TypeScript + Crawlee (CheerioCrawler) for fast concurrent crawling
  • Cheerio for HTML parsing with intelligent content cleaning
  • libphonenumber-js for international phone number validation
  • Bottleneck for per-host rate limiting with adaptive adjustment

Key Insight: The best automation tools integrate seamlessly into larger systems. The scraper outputs structured data that feeds directly into the AI research pipeline.


Phase 2: Folder-Based Pipeline (September 2025)

The folder IS the database. The package IS the CRM.

18-Stage Pipeline Architecture:

01-discovered/       → New companies from CSV
02-researched/       → Comprehensive research completed
02c-no-email-track/  → Companies without verified emails
03-outreach-ready/   → Drafts created and approved
04-outreach-sent/    → Messages sent via Gmail/WhatsApp
05-replied/          → Prospects who responded
06-follow-up/        → In follow-up sequences
07-meeting/          → Meetings scheduled
08-pilot/            → Running pilot programs
09-negotiation/      → Contract discussions
10-closed-won/       → Deals closed

Each prospect is a comprehensive markdown "outreach package" containing:

  • Company intelligence and research
  • Decision maker profiles with verified contacts
  • Pain points and opportunities identified
  • Personalized outreach drafts (email, WhatsApp, LinkedIn)
  • Complete communication history
  • Pipeline stage tracking

Current Scale: 843+ outreach packages across all stages.


Phase 3: Autonomous AI Processing (October 2025)

AI agents designed to run 24/7 on a Digital Ocean VPS. Processing 96 companies per day without human intervention.

The Architecture:

  • Orchestrator: Main bash script coordinating daily operations
  • Follow-up Engine: 30-stage follow-up sequence over 600 days
  • Daily Summary: Automatic reporting and git backups

The Innovation - Unix Primitives as AI Infrastructure:

  • systemd service: Automatic daily restart at 6 AM SGT
  • tmux sessions: Each agent runs visibly and debuggably
  • Folders as state machines: ls pipeline/ shows your entire CRM
  • Git as backup: Every change committed, hourly pushes to GitHub
  • Markdown as database: cat any prospect, grep your pipeline

The 15-Minute Cycle:

  1. Pick next company from CSV
  2. Run fast confidence check (>70% threshold)
  3. Research with GLM 4.6 (find decision makers, verify emails)
  4. Generate personalized outreach
  5. Send via Gmail API
  6. Update CSV status
  7. Git commit changes

Email Verification Protocol - CRITICAL:

  • NEVER guess email patterns ([email protected])
  • ALWAYS require source URL for every email found
  • Better to have "NO EMAIL FOUND" than wrong email
  • Companies without verified emails go to the no-email track (LinkedIn/WhatsApp only)

The Complete Stack

Lead Sources (directories, databases)
    ↓
Ultra-Man-Scraper (TypeScript, Crawlee, adaptive rate limiting)
    ↓
Pipeline Folders (18 stages, markdown packages)
    ↓
AI Orchestrator (GLM 4.6, 96/day processing rate)
    ↓
Multi-Channel Outreach (Gmail API, WhatsApp MCP)
    ↓
Follow-up Engine (30-stage, 600-day sequence)

Technical Components

  • Ultra-Man-Scraper: Lead extraction (TypeScript, Crawlee, Cheerio, libphonenumber-js)
  • Pipeline: State management (Folders, Markdown files, Git)
  • Orchestrator: 24/7 automation (Bash, Python, systemd, Digital Ocean VPS)
  • AI Research: Prospect intelligence (GLM 4.6, Claude orchestration)
  • Outreach: Execution (Gmail API, WhatsApp MCP, LinkedIn)

Lessons Learned

1. Manual First, Automate Second Started with founder-led sales to understand what actually works. Only then automated the repetitive parts.

2. Folders > Databases No Postgres. No Redis. No message queues. Just folders and markdown files. When you can ls your pipeline and cat your CRM, debugging becomes trivial.

3. AI Needs Structure, Not Intelligence The breakthrough wasn't smarter agents - it was clear boundaries, simple interfaces, and foolproof state management. The 18-stage folder system gives AI agents unambiguous instructions.

4. Never Guess Emails Pattern-based email guessing destroys sender reputation. Every email must have a source URL. Companies without verified emails get the alternative track (LinkedIn/phone).

5. Follow-up is Everything 30-stage follow-up sequence over 600 days. Days 3-7: value focus. Days 10-90: industry news hooks. Days 90+: break-up style. Most deals close after touchpoint 7+.

Current Pipeline Status

  • Discovered: 57
  • Researched: 55
  • No Email Track: 202
  • Outreach Ready: 23
  • Outreach Sent: 131
  • Replied: 5
  • Follow-up: 1
  • Meeting: 4
  • Pilot: 1
  • Closed Won: 1

Total Active Pipeline: 843+ prospects

Venture Potential

This system was built for AITax.sg, but the components are generalizable:

  • Lead scraping works for any B2B vertical (just change the extraction patterns)
  • Folder pipeline can manage any sales process
  • AI research agents can deep-dive any prospect list
  • Follow-up engine applies to any high-touch B2B sale

The question: is there a standalone business in B2B sales infrastructure?

Current Status

VPS orchestrator currently paused. The infrastructure is built and proven - one real fulfillment completed through the pipeline (delivered tax filing service, need to formalize payment structure). Need 9 more partnerships like that to validate the model.

The system is designed to handle:

  • Initial outreach at scale
  • 30-stage follow-up sequences over 600 days
  • Channel rotation (email → WhatsApp → LinkedIn)
  • Git backup of all operations

Next steps:

  1. Get the VPS orchestrator running again
  2. Formalize the partnership payment model
  3. Scale to 10 active partnerships