From Cockiness to 8 AI Agents: Building a Production Tax System in 3 Months

The Humbling Beginning: When All AIs Failed

It started with brotherly cockiness. My brother - the smartest person I knew growing up, #1 in tax at university, fresh off graduating top of his class at Oxford Tax - was "dying with work" while I was living my best life automating everything with AI.

"Give me a valuable task of what you are doing, and I will easily 1-shot prompt it into existence," I boasted.

Reality hit hard. o1 pro failed. Claude 3.7 failed. DeepSeek failed. Grok failed. Gemini 2.5 failed. Every single AI model I threw at Singapore tax computation produced nonsense.

The Core Problem: Teaching AI complex tax law is like teaching someone to drive by only letting them touch the steering wheel once every 5 minutes and only telling them "you crashed" without explaining why. The feedback loops are slow, visibility is limited, and the system is inherently fragmented.

What followed was a humbling two-week deep dive, reverse-engineering tax papers, studying IRAS ITA guides, and rebuilding my understanding from first principles. The breakthrough: AI won't replace tax professionals, but it can be a powerful assistant with the right architecture and guidance.

The Journey: 3 Months, 20+ Projects, 8 Final Agents

Phase 1: The Foundation (March 2025)

Project: tax-annihilator-v1

I started with traditional programming - a Python/Flask application with ML-powered expense categorization using scikit-learn. It could process transactions from JSON, CSV, and Excel, apply Singapore tax rules, and generate reports.

# The naive beginning - ML but not LLM
class TaxCalculator:
    def __init__(self):
        self.income_tax_rate = 0.17  # Singapore corporate tax
        self.partial_exemption_threshold = 10000
        self.partial_exemption_rate = 0.75
    
    def calculate_tax(self, chargeable_income):
        # First $10,000 at 75% exemption
        # Next $190,000 at 50% exemption
        # Complex but deterministic rules

Key Learning: Rule-based systems work but don't scale. Every edge case requires new code.

The Research Phase: I built python-dl-iras-etax-guides to download every IRAS guide, realizing I needed deep domain knowledge to make this work.

Early Business Development: Created tax-linkedin-email - a sophisticated automation tool that scraped LinkedIn, researched companies via Perplexity API, and generated personalized outreach. This wasn't just about building tech; it was about understanding the market.

Phase 2: The LLM Awakening (April 2025)

Project: tax-annihilator-v2

The first real LLM integration. I built a modular system with swappable engines - rule-based vs LLM-based - to benchmark approaches.

# First LLM integration - the excitement was real
class LLMExpenseEngine:
    def tag_expense(self, transaction):
        prompt = f"""
        Analyze this Singapore business expense:
        Vendor: {transaction.vendor}
        Amount: ${transaction.amount}
        Description: {transaction.description}
        
        Determine tax treatment under Singapore tax law.
        """
        
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

Security Revelation: Financial data + LLMs = privacy concerns. Built privacy measures and local processing options.

The UI Challenge: Created multiple frontend versions:

taxtagger-frontend - React showcase for demos
animejs-taxtagger - Animated explanations of tax flow (because tax is complex!)

RAG Implementation: langchain-tax introduced vector search over Singapore tax documents:

# The game-changer - RAG for tax knowledge
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = FAISS.from_documents(tax_documents, embeddings)

Phase 3: The Agent Revolution (May 2025)

Project: tax-agent-3

This is where things got serious. I abandoned monolithic LLM calls for specialized agents:

# The breakthrough - specialized agents
agents = {
    "orchestrator": OrchestratorAgent(model="gemini-2.0-flash"),
    "expense_classifier": ExpenseClassificationAgent(model="gpt-4o-mini"),
    "capital_allowance": CapitalAllowanceAgent(model="gpt-4o"),
    "evaluator": EvaluatorAgent(model="gpt-4o-mini")
}

# Each agent had specific prompts and tools
class ExpenseClassificationAgent:
    tools = [CalculatorTool(), KnowledgeRetrievalTool(), LookupTool()]
    
    def process(self, transaction):
        # Specialized logic for expense classification
        # Can retrieve specific tax rules when needed

Multi-Pass Validation: Built systems that use different models in sequence, escalating to more powerful (expensive) models only when needed.

Phase 4: The Crown Jewel (June 2025)

Project: tax-compute-mvp - The 8-Agent System

After 20+ iterations, I finally cracked it. Eight specialized agents, each mastering one aspect of Singapore tax law:

# The final architecture - 8 specialized agents
# Each agent handles a specific aspect of Singapore tax law:
# - Income classification and exemptions
# - Expense breakdown and categorization  
# - Source mapping and deduction rules
# - Special deductions (R&D, training)
# - Capital allowances and depreciation
# - Final tax computation with exemptions
# 
# Combined system accuracy: 90%+

The Results:

Over 90% accuracy matching tax professionals
Under $1 cost vs hundreds for manual process (orders of magnitude cheaper)
Minutes vs hours of processing time (dramatically faster)

Technical Innovations That Made It Work

1. Schema-First Architecture

The biggest breakthrough: forcing LLMs to output valid JSON with automatic fixing.

// Every agent output is validated and auto-corrected
const schema = z.object({
  items: z.array(z.object({
    id: z.string().uuid(),  // Auto-generates if missing
    amount: z.number(),     // Auto-coerces strings
    description: z.string()
  })),
  total: z.number()
});

// Automatic fixes for common LLM errors
if (output.id === "FAKE-ID-123") {
  output.id = crypto.randomUUID();
}

2. Arithmetic Post-Processing

LLMs are bad at math. So we built automatic correction:

// LLM identifies items correctly but fails addition
// Before: { items: [100, 200, 300], total: 500 }  ❌
// After:  { items: [100, 200, 300], total: 600 }  ✅

function fixArithmetic(output) {
  const calculatedTotal = output.items.reduce((sum, item) => 
    sum + item.amount, 0
  );
  if (output.total !== calculatedTotal) {
    console.log(`🧮 Fixed: ${output.total} → ${calculatedTotal}`);
    output.total = calculatedTotal;
  }
}

3. Hybrid Model Strategy

Not all tasks need expensive models:

// Different tasks require different model capabilities
// Complex reasoning tasks use more powerful models
// Pattern matching and basic calculations use efficient models
// Result: high accuracy at a fraction of the cost

4. YAML-Based Prompt Management

Tax rules as data, not code:

# Version-controlled tax logic
critical_rules:
  - rule: "S-PLATED VEHICLES ALWAYS NON-DEDUCTIBLE"
    section: "s15(1)(o)"
    examples:
      - "SBA1234A - private car"
      - "Any S-plate registration"
    test_cases: ["uber_car_expense", "grab_vehicle_lease"]

5. RAG with Retrieval Gating

Not every query needs document retrieval:

def should_retrieve(query):
    # Simple queries don't need RAG
    if "calculate total" in query.lower():
        return False
    
    # Complex tax rules need documentation
    if any(term in query for term in ["s15", "capital allowance", "exemption"]):
        return True

Business Model Evolution

The Market Journey

Started (March): "We'll help SMEs with tax filing!"

Reality: SMEs spend typical accounting fees annually, trust their accountant, not interested

Pivot 1 (April): "We'll be cheaper than accountants!"

Reality: Trust > Price for tax matters

Pivot 2 (May): "We'll target growing companies!"

Reality: The mythical "middle market" doesn't exist

Final Position (June): "AI Tax Brain for Modern Businesses"

Freemium for small businesses
Affordable monthly subscriptions for growing companies
Enterprise deals for large corporations
Channel partnerships with accountants

The Numbers That Matter

# Key Performance Metrics:
  Accuracy: 90%+
  Cost_Reduction: 99%  
  Time_Savings: 98%
  
# Business Model:
  - Freemium for small businesses
  - Affordable monthly subscriptions for growing companies
  - Enterprise deals for large corporations
  - Channel partnerships with accountants

Key Lessons for AI Builders

1. Domain Expertise Is Irreplaceable

My brother's tax knowledge was the secret weapon. AI amplifies expertise; it doesn't replace it.

2. Evolution, Not Revolution

March: Basic ML categorization
April: First LLM integration
May: Multi-agent architecture
June: Production-ready system

Each iteration built on previous learnings.

3. Schema-First for Production

// This saved the project
const validateOutput = (output: unknown): ValidatedOutput => {
  return outputSchema.parse(output);  // Throws if invalid
};

4. Hybrid Models Are The Future

Don't use a sledgehammer for every nail:

Complex reasoning: Expensive models (o3-mini)
Pattern matching: Cheap models (gpt-4o-mini)
Calculations: Post-processing (not LLM)

5. The Business Model Takes More Iteration Than The Tech

4 different brandings (Deductly → TaxEase → TaxTag)
3 pricing models
2 market segments explored
1 final positioning that worked

6. Building In Public Accelerates Learning

20+ repositories in 3 months seems chaotic, but each was a learning experiment. Fast iteration beats perfect planning.

Technical Assets Created (All Reusable)

# 1. Schema Validation System
class SchemaValidator:
    """Force LLMs to output valid, consistent JSON"""
    
# 2. Arithmetic Post-Processor  
class MathFixer:
    """Automatically fix LLM calculation errors"""
    
# 3. Multi-Agent Pipeline
class AgentPipeline:
    """Orchestrate specialized agents with error recovery"""
    
# 4. Hybrid Model Router
class ModelSelector:
    """Choose optimal model based on task complexity"""
    
# 5. YAML Prompt Manager
class PromptVersionControl:
    """Manage prompts as configuration, not code"""
    
# 6. RAG with Gating
class SmartRetrieval:
    """Retrieve documents only when necessary"""

What's Next?

The tax compliance journey provided invaluable lessons, but the real insight is bigger: production AI systems require specialized architectures, not just API calls to LLMs.

The 8-agent system we built for tax can be adapted to any complex domain:

Legal document analysis
Medical diagnosis assistance
Financial planning
Compliance automation

The key is understanding that AI agents should be specialists, not generalists. Just like my brother specializes in tax, each AI agent should master one thing exceptionally well.

The Real Success Metric

Not the high accuracy. Not the minimal cost. Not even the dramatic speed improvement.

The real success? My brother now uses the system daily. The smartest tax person I know trusts AI to help with his work. That's when I knew we'd built something real.

Building AI for complex domains? Let's connect. The journey from "AI can't do this" to "AI does this better than humans" is shorter than you think - with the right architecture.

Appendix: The Full Project Timeline

March 2025:
- tax-annihilator-v1: ML-based categorization
- python-dl-iras-etax-guides: Domain research
- tax-linkedin-email: Market validation

April 2025:
- tax-annihilator-v2: First LLM integration
- tax-adjustment-analyzer: P&L analysis
- taxtagger-frontend: Demo UI
- animejs-taxtagger: Visual explanations
- langchain-tax: RAG implementation

May 2025:
- tax-adjustment-analyzer-v2: Multi-pass validation
- tax-agent-3: Agent architecture
- taxtagger-mcp: Integration attempts

June 2025:
- tax-agent-sdk: Productization
- sg-corp-tax: Specialization
- tax-agent-v4-js: JavaScript port
- tax-ai-with-fe: Full-stack app
- tax-tagger-fe-10jun: Final UI
- tax-compute-mvp: The 8-agent system

20+ projects. 3 months. 1 working system. Countless lessons learned.