Normalized for Mintlify from
knowledge-base/aiconnected-business-platform/ai-services-business-directory.mdx.AI Services Business Directory
Developer Product Requirements Document (PRD)
Version: 1.0Last Updated: October 2, 2025
Target Audience: Junior Developer (Beginner Level)
Project Goal: Build an automated system to collect, validate, and maintain 50,000 AI services businesses across the United States
Table of Contents
- Project Overview
- System Architecture
- Technical Stack
- Database Design
- n8n Workflow Implementation
- API Integration Details
- Data Quality & Validation
- Automatic Update System
- Implementation Roadmap
- Testing Procedures
- Deployment Guide
- Troubleshooting
- Glossary
1. Project Overview
1.1 What We’re Building
We’re creating an automated business directory containing 50,000 companies that provide AI services (artificial intelligence consulting, machine learning development, chatbot creation, etc.) across the United States. Key Features:- Organized by State → County → City hierarchy
- Top 100 highest-income cities per state (5,000 cities total)
- Average 10 businesses per city
- Automatically collects business information from APIs
- Validates data quality (checks emails, phones, addresses)
- Updates information automatically every 90 days
1.2 Success Criteria
- Quantity: 50,000 verified AI services businesses
- Quality Standards:
- 95%+ valid email addresses
- 90%+ valid phone numbers
- 95%+ accurate addresses
- 80%+ overall data completeness
- Organization: Properly categorized by location hierarchy
- Maintenance: Automatic quarterly updates
1.3 Timeline
- Week 1: Setup and configuration
- Week 2-3: Pilot test (1,000 businesses)
- Week 4-6: Full production (49,000 remaining businesses)
- Week 7: Quality assurance
- Week 8: Finalization and documentation
- Ongoing: Automatic updates every 90 days
2. System Architecture
2.1 High-Level Overview
Think of our system like a factory assembly line:- Raw Materials (Input): City names and locations
- Machines (n8n Workflows): Automated processes that collect data
- Quality Control (Validation): Check that data is correct
- Storage (Database): Keep all the verified information
- Maintenance (Auto-Update): Refresh old information regularly
2.2 Component Diagram
2.3 Why This Architecture?
Orchestrator Pattern Benefits:- Parallel Processing: Multiple workers run simultaneously (faster)
- Resilience: If one worker fails, others continue
- Memory Efficient: Process small batches instead of all 50,000 at once
- Resumable: Can restart from failure point, not from beginning
3. Technical Stack
3.1 Core Technologies
| Component | Technology | Cost | Why We Choose It |
|---|---|---|---|
| Automation Platform | n8n (self-hosted) | $10/month VPS | Visual workflow builder, 400+ integrations, no code required |
| Database | PostgreSQL | $20-50/month | Handles millions of records easily, free and open-source |
| Primary Data Source | Apollo.io API | $79/month + credits | Best AI/tech company coverage, 275M contacts |
| Geographic Data | SimpleMaps | $199 one-time | Pre-calculated income rankings, saves weeks of work |
| Email Validation | ZeroBounce API | ~$200 for 50k | Ensures emails are deliverable |
| Phone Validation | Twilio Lookup API | ~$250 for 50k | Verifies phone numbers are real |
| Hosting | DigitalOcean Droplet | $10/month | Simple VPS for n8n |
Monthly Cost After Collection: ~$30/month (database + hosting)
3.2 Free Alternatives (If Budget Constraints)
- Database: SQLite (free, but slower)
- Email Validation: Hunter.io free tier (50 checks/month)
- Geographic Data: US Census API (free, but requires more processing)
- Hosting: Oracle Cloud Free Tier (limited resources)
3.3 Development Tools
- Code Editor: VS Code (free)
- API Testing: Postman (free tier)
- Database Management: DBeaver (free)
- Version Control: Git + GitHub (free)
4. Database Design
4.1 Understanding Databases (Beginner Explanation)
A database is like a giant Excel spreadsheet that computers can read extremely fast. Instead of one giant table, we organize data into multiple related tables. Key Concepts:- Table: Like one sheet in Excel
- Row: One record (e.g., one business)
- Column: One piece of information (e.g., business name)
- Primary Key: Unique ID for each row (like a social security number)
- Foreign Key: Links to another table (like a reference)
4.2 Database Schema
We’ll create 4 main tables:Table 1: states
Stores information about all 50 US states.
Table 2: cities
Stores the top 100 cities per state (5,000 cities total).
Table 3: businesses
Stores all business information (our main table with 50,000 records).
Table 4: collection_logs
Tracks our progress and errors.
4.3 Setting Up the Database
Step-by-Step Instructions:- Install PostgreSQL:
- Create Database:
- Run Schema Creation:
- Verify Setup:
5. n8n Workflow Implementation
5.1 Understanding n8n (Beginner Explanation)
n8n is a visual automation tool where you connect “nodes” (boxes) together to create workflows. Think of it like Lego blocks:- Each block (node) does one specific task
- You connect blocks in sequence
- Data flows from one block to the next
- No programming required (mostly)
5.2 Installing n8n
Option A: Docker (Recommended for Beginners)5.3 Workflow Architecture
We’ll build 3 main workflows:- Orchestrator Workflow (Main controller)
- Data Collection Workflow (Worker)
- Validation Workflow (Quality checker)
- Update Workflow (Automatic refresh)
5.4 WORKFLOW #1: Orchestrator
Purpose: Controls the entire collection process, manages batches, tracks progress. Nodes in Order:Node 1: Schedule Trigger
Node 2: PostgreSQL - Get Cities
Node 3: Split in Batches
Node 4: Loop Through Each City
Node 5: HTTP Request - Trigger Worker
Node 6: Wait Between Batches
Node 7: Log Progress
5.5 WORKFLOW #2: Data Collection Worker
Purpose: Collects business data for one city from Apollo.io API. Nodes in Order:Node 1: Webhook Trigger
Node 2: Apollo.io API Search
Node 3: Loop Through Results
Node 4: Transform Data
Node 5: Generate Record Hash
Node 6: Check for Duplicates
Node 7: IF Duplicate Exists
Node 8: Email Validation
Node 9: Phone Validation
Node 10: Calculate Completeness Score
Node 11: Insert to Database
Node 12: Respond to Orchestrator
5.6 WORKFLOW #3: Validation & Quality Check
Purpose: Performs deeper validation on collected data. Trigger: Runs once per day after collection. Nodes:5.7 WORKFLOW #4: Automatic Update System
Purpose: Refreshes business data every 90 days to keep directory current. Nodes:6. API Integration Details
6.1 Apollo.io Setup
Step 1: Create Account- Visit https://www.apollo.io/
- Sign up for Professional plan ($79/month)
- Navigate to Settings → API
- Generate API key (keep this secret!)
- Free Tier: 50 searches/month
- Professional: Unlimited searches, 12,000 email credits/year
- Best Practice: Add 500ms delay between requests
6.2 SimpleMaps Setup
Step 1: Purchase Database- Visit https://simplemaps.com/data/us-cities
- Purchase “Comprehensive” version ($199)
- Download CSV file
6.3 Validation APIs
ZeroBounce (Email Validation)
Setup:- Create account at https://www.zerobounce.net/
- Purchase credits ($16 per 1,000 validations)
- Get API key from dashboard
Twilio Lookup (Phone Validation)
Setup:- Create account at https://www.twilio.com/
- Purchase credits ($0.005 per lookup)
- Get Account SID and Auth Token
7. Data Quality & Validation
7.1 Multi-Level Deduplication Strategy
Level 1: Exact Match (100% confidence)7.2 Data Completeness Scoring
Formula:7.3 Automated Quality Reports
SQL Query for Daily Quality Report:8. Automatic Update System
8.1 Update Strategy
Principle: Keep data fresh without excessive API costs. Update Schedule:- Critical fields (email, phone, website): Every 90 days
- Nice-to-have fields (funding, employees): Every 180 days
- Static fields (founded_year): Never update
8.2 Update Workflow Logic
8.3 Detecting Closed Businesses
8.4 Incremental Update vs Full Refresh
Incremental (Recommended):- Update ~700 businesses per day
- Spreads API costs over time
- Completes full refresh in ~72 days
- Less disruptive to live directory
- Update all 50,000 in 1-2 weeks
- High API costs
- Risk of hitting rate limits
- Use only when data quality degrades severely
9. Implementation Roadmap
Week 1: Setup & Configuration
Day 1-2: Infrastructure Setup- [ ] Provision DigitalOcean Droplet ($10/month)
- [ ] Install Docker and n8n
- [ ] Install PostgreSQL
- [ ] Set up database backups (automated daily)
- [ ] Configure firewall rules
- [ ] Create database schema (run all CREATE TABLE commands)
- [ ] Import SimpleMaps data
- [ ] Calculate city rankings
- [ ] Verify data integrity (check row counts)
- [ ] Create database indexes
- [ ] Create Apollo.io account, verify API access
- [ ] Create ZeroBounce account, purchase credits
- [ ] Create Twilio account, purchase credits
- [ ] Test each API with Postman
- [ ] Document API keys in secure location (password manager)
- [ ] Create simple n8n workflow: Get 1 city → Call Apollo → Save to DB
- [ ] Test with San Francisco (should return ~10-20 businesses)
- [ ] Verify data saves correctly to PostgreSQL
- [ ] Check for errors in execution logs
Weeks 2-3: Pilot Testing
Week 2: Build Core Workflows- [ ] Build Orchestrator Workflow (Days 1-2)
- [ ] Build Data Collection Worker (Days 3-4)
- [ ] Build Validation Workflow (Day 5)
- [ ] Connect workflows via webhooks (Day 6-7)
- [ ] Select 100 cities for pilot (top 2 per state)
- [ ] Run collection workflow (Days 1-3)
- [ ] Expected: ~10 businesses × 100 cities = 1,000 records
- [ ] Monitor execution logs for errors
- [ ] Run validation workflow (Day 4)
- [ ] Analyze results (Day 5):
- Email deliverability rate
- Phone validity rate
- Average completeness score
- Data quality tier distribution
- [ ] Calculate actual costs (Day 6)
- [ ] Optimize workflow based on findings (Day 7)
- ✓ At least 800 valid businesses collected
- ✓ >90% email deliverability
- ✓ >85% phone validity
- ✓ Average completeness >75%
- ✓ Actual cost <$100 for 1,000 businesses
Weeks 4-6: Full Production
Strategy: Collect 10,000 businesses per week Daily Routine:- Morning (9 AM): Check overnight execution logs
- Review quality metrics from previous day
- Address any errors or failures
- Monitor API credit usage
- Run validation on yesterday’s new records
- Expected: ~16,500 businesses
- Apollo searches: ~1,650
- Monitor rate limiting
- Expected: ~16,500 businesses
- Check running costs vs budget
- Adjust batch sizes if needed
- Expected: ~16,500 businesses
- Total should reach ~49,500
- Prepare for final validation
Week 7: Quality Assurance
Day 1: Automated Validation- [ ] Run comprehensive validation workflow on all 50,000 records
- [ ] Re-verify emails (sample 1,000)
- [ ] Re-verify phones (sample 1,000)
- [ ] Update quality scores
- [ ] Randomly select 200 businesses across states
- [ ] Visit websites to verify businesses exist
- [ ] Check if AI services are actually offered
- [ ] Document findings
- [ ] Identify cities with <10 businesses
- [ ] Run supplemental searches for low-count cities
- [ ] Consider alternative data sources for gaps
- [ ] Fill missing LinkedIn URLs (sample)
- [ ] Add missing descriptions
- [ ] Improve categorization of AI services
- [ ] Present quality report
- [ ] Document known limitations
- [ ] Get approval to proceed
Week 8: Finalization
Day 1: Final Deduplication- [ ] Run aggressive deduplication
- [ ] Review flagged duplicates manually
- [ ] Merge or remove duplicates
- [ ] Export to CSV for backup
- [ ] Export to JSON for API
- [ ] Create state-specific exports
- [ ] Build automatic update workflow
- [ ] Schedule weekly execution
- [ ] Test update on 100 businesses
- [ ] Document all workflows
- [ ] Create operations manual
- [ ] Write troubleshooting guide
- [ ] Move to production environment (if applicable)
- [ ] Set up monitoring alerts
- [ ] Configure backup schedule
- [ ] Launch! 🚀
10. Testing Procedures
10.1 Unit Tests (Test Individual Nodes)
Test 1: Apollo API Connection10.2 Integration Tests (Test Full Workflows)
Test 1: End-to-End Collection10.3 Load Testing
Test 1: Batch Processing10.4 Data Quality Tests
Test 1: Completeness Distribution11. Deployment Guide
11.1 Production Environment Setup
Server Requirements:- CPU: 2 cores minimum (4 recommended)
- RAM: 4GB minimum (8GB recommended)
- Storage: 50GB SSD
- Network: 100Mbps
- OS: Ubuntu 22.04 LTS
11.2 SSL Certificate (HTTPS)
11.3 Backup Configuration
11.4 Monitoring Setup
12. Troubleshooting
12.1 Common Issues & Solutions
Issue 1: “API Rate Limit Exceeded”12.2 Debugging Tips
Enable Debug Mode:- In n8n editor, click on a node
- Click “Execute Node” button
- View output in right panel
- Check for errors in JSON data
13. Glossary
API (Application Programming Interface): A way for different software programs to talk to each other. Like a waiter taking your order to the kitchen. Batch Processing: Processing data in groups (batches) instead of one at a time. Faster and more efficient. Database: Organized collection of data stored electronically. Like a giant, super-fast filing cabinet. Deduplication: Removing duplicate (identical or similar) records from a database. Docker: Software that packages applications in “containers” so they run the same everywhere. Foreign Key: A field in a database table that links to the primary key of another table. Fuzzy Matching: Finding records that are similar but not exactly identical (e.g., “IBM Corp” vs “IBM Corporation”). Hash: A unique fingerprint for data. Same input always produces same hash.550e8400-e29b-41d4-a716-446655440000.
VPS (Virtual Private Server): A virtual computer running in the cloud that you can rent.
Webhook: A URL that receives data when an event occurs. Like a mailbox that programs can send messages to.
Workflow: A sequence of automated steps that accomplish a task.
Appendix A: Cost Breakdown (Detailed)
One-Time Setup Costs
| Item | Cost | Notes |
|---|---|---|
| SimpleMaps Database | $199 | One-time purchase |
| Initial Apollo.io Credits | $79 | First month subscription |
| ZeroBounce Credits (50k) | $200 | ~$4 per 1,000 validations |
| Twilio Credits (50k) | $250 | $0.005 per lookup |
| Total One-Time | $728 |
Monthly Recurring Costs (During Collection)
| Item | Cost | Duration |
|---|---|---|
| DigitalOcean Droplet | $24 | 2 months |
| PostgreSQL (managed) | $15 | 2 months |
| Apollo.io Subscription | $79 | 1 month |
| Total Monthly | $118 | |
| Total for 2 Months | $236 |
Monthly Recurring Costs (After Collection)
| Item | Cost |
|---|---|
| DigitalOcean Droplet | $24 |
| PostgreSQL | $15 |
| Total Monthly | $39 |
Total Project Cost
- Setup Phase: 236 = $964
- Annual Maintenance: 39 × 12 \= **468**
- Total First Year: $1,432
Appendix B: Example API Responses
Apollo.io Search Response
Appendix C: Sample SQL Queries
Query 1: Find AI Companies in Top 10 Cities by State
Query 2: Identify Data Gaps
Query 3: Best Quality Businesses
Query 4: Technology Distribution
Final Checklist
Before launching to production, verify:- [ ] All API keys are configured and tested
- [ ] Database schema is created with all indexes
- [ ] SimpleMaps data is imported and ranked
- [ ] Test workflow successfully collects 10 businesses
- [ ] Pilot test of 1,000 businesses meets quality targets
- [ ] Orchestrator workflow handles batches correctly
- [ ] Worker workflows don’t hit rate limits
- [ ] Validation workflows verify data accurately
- [ ] Update workflow runs without errors
- [ ] Backups are automated and tested
- [ ] Monitoring alerts are configured
- [ ] Documentation is complete and clear
- [ ] Junior developer can follow all instructions
End of Product Requirements Document Version 1.0 - October 2, 2025 For questions or clarifications, refer to the troubleshooting section or contact the technical lead.