Normalized for Mintlify from
knowledge-base/aiconnected-apps-and-modules/modules/aiConnected-voice/aiConnected-voice-junior-dev-prd.mdx.Voice by aiConnected - Junior Developer PRD
Comprehensive Outline
Purpose: This outline defines a PRD detailed enough that a junior developer with no prior context could build the entire system. Every decision is documented. Nothing is assumed.PART 1: Foundation & Context
Estimated: 15-20 pages1. Project Overview
- 1.1 What We’re Building (plain English)
- 1.2 Why We’re Building It (business problem)
- 1.3 Who It’s For (target users)
- 1.4 Success Looks Like (measurable outcomes)
2. Glossary of Terms
- 2.1 Telephony Terms (PSTN, SIP, DTMF, IVR, PBX, etc.)
- 2.2 WebRTC Terms (ICE, STUN, TURN, SDP, etc.)
- 2.3 AI/ML Terms (LLM, STT, TTS, VAD, embeddings, etc.)
- 2.4 Platform Terms (tenant, agency, knowledge base, etc.)
- 2.5 Infrastructure Terms (container, webhook, WebSocket, etc.)
3. Architecture Overview
- 3.1 System Diagram (with explanation of each box)
- 3.2 Data Flow Narrative (step-by-step what happens on a call)
- 3.3 Technology Choices (what we’re using and WHY)
- 3.4 What We’re NOT Building (explicit scope boundaries)
4. Development Environment Setup
- 4.1 Required Accounts & API Keys
- 4.2 Local Development Tools
- 4.3 Repository Structure
- 4.4 Environment Variables Reference
- 4.5 How to Run Locally
PART 2: Database Design
Estimated: 20-25 pages5. Database Architecture
- 5.1 Why PostgreSQL
- 5.2 Database Naming Conventions
- 5.3 Common Patterns Used (UUIDs, timestamps, soft deletes)
6. Schema: Core Entities
- 6.1
agenciestable (full DDL + field explanations) - 6.2
tenantstable (full DDL + field explanations) - 6.3
userstable (full DDL + field explanations) - 6.4
user_rolesandpermissionstables
7. Schema: Telephony Entities
- 7.1
phone_numberstable - 7.2
callstable - 7.3
call_eventstable (state machine history) - 7.4
call_transferstable
8. Schema: AI & Content Entities
- 8.1
knowledge_basestable - 8.2
knowledge_documentstable - 8.3
knowledge_chunkstable (with embeddings) - 8.4
transcriptstable - 8.5
recordingstable
9. Schema: Configuration Entities
- 9.1
voice_configurationstable - 9.2
agent_personalitiestable - 9.3
greetingstable - 9.4
business_hourstable
10. Schema: Billing & Analytics
- 10.1
usage_recordstable - 10.2
billing_eventstable - 10.3
call_analyticstable
11. Indexes & Performance
- 11.1 Required Indexes (with explanations)
- 11.2 Partitioning Strategy
- 11.3 Query Patterns to Optimize For
12. Migrations
- 12.1 Migration File Naming Convention
- 12.2 Initial Migration Script
- 12.3 How to Add New Migrations
PART 3: API Design
Estimated: 25-30 pages13. API Architecture
- 13.1 REST vs GraphQL Decision
- 13.2 URL Structure & Naming
- 13.3 Authentication (JWT implementation)
- 13.4 Authorization (RBAC implementation)
- 13.5 Error Response Format
- 13.6 Pagination Standard
- 13.7 Rate Limiting
14. Agency Management APIs
- 14.1
POST /api/v1/agencies- Create agency - 14.2
GET /api/v1/agencies/{id}- Get agency - 14.3
PUT /api/v1/agencies/{id}- Update agency - 14.4
GET /api/v1/agencies/{id}/tenants- List tenants - 14.5
GET /api/v1/agencies/{id}/usage- Get usage
15. Tenant Management APIs
- 15.1
POST /api/v1/tenants- Create tenant - 15.2
GET /api/v1/tenants/{id}- Get tenant - 15.3
PUT /api/v1/tenants/{id}- Update tenant - 15.4
DELETE /api/v1/tenants/{id}- Deactivate tenant - 15.5
GET /api/v1/tenants/{id}/config- Get configuration - 15.6
PUT /api/v1/tenants/{id}/config- Update configuration
16. Phone Number APIs
- 16.1
GET /api/v1/phone-numbers/available- Search available - 16.2
POST /api/v1/phone-numbers- Provision number - 16.3
GET /api/v1/tenants/{id}/phone-numbers- List tenant numbers - 16.4
PUT /api/v1/phone-numbers/{id}- Configure number - 16.5
DELETE /api/v1/phone-numbers/{id}- Release number
17. Call Control APIs
- 17.1
POST /api/v1/calls/outbound- Initiate call - 17.2
GET /api/v1/calls/{id}- Get call status - 17.3
POST /api/v1/calls/{id}/transfer- Transfer call - 17.4
POST /api/v1/calls/{id}/hold- Hold call - 17.5
POST /api/v1/calls/{id}/resume- Resume call - 17.6
POST /api/v1/calls/{id}/hangup- End call - 17.7
GET /api/v1/tenants/{id}/calls- List calls
18. Knowledge Base APIs
- 18.1
POST /api/v1/tenants/{id}/knowledge-base- Create KB - 18.2
GET /api/v1/tenants/{id}/knowledge-base- Get KB - 18.3
POST /api/v1/knowledge-bases/{id}/documents- Add document - 18.4
DELETE /api/v1/knowledge-bases/{id}/documents/{did}- Remove - 18.5
POST /api/v1/knowledge-bases/{id}/query- Query KB
19. Recording & Transcript APIs
- 19.1
GET /api/v1/calls/{id}/recording- Get recording URL - 19.2
GET /api/v1/calls/{id}/transcript- Get transcript - 19.3
GET /api/v1/tenants/{id}/recordings- List recordings
20. Analytics APIs
- 20.1
GET /api/v1/tenants/{id}/analytics/summary- Dashboard data - 20.2
GET /api/v1/tenants/{id}/analytics/calls- Call metrics - 20.3
GET /api/v1/tenants/{id}/analytics/usage- Usage metrics
21. Webhook Endpoints (Inbound)
- 21.1
POST /webhooks/gotoconnect- Call events - 21.2
POST /webhooks/livekit- Room events - 21.3
POST /webhooks/deepgram- Transcription events - 21.4 Webhook signature validation
PART 4: GoToConnect Integration
Estimated: 20-25 pages22. GoToConnect Account Setup
- 22.1 Required Account Type
- 22.2 API Credentials Location
- 22.3 Webhook Configuration Steps
- 22.4 Phone Number Provisioning
23. GoToConnect Authentication
- 23.1 OAuth 2.0 Flow (step-by-step)
- 23.2 Token Storage
- 23.3 Token Refresh Logic
- 23.4 Error Handling
24. Webhook Events from GoToConnect
- 24.1
call.ringing- Inbound call arriving - 24.2
call.answered- Call connected - 24.3
call.ended- Call terminated - 24.4 Event Payload Schemas
- 24.5 Event Processing Logic
25. GoToConnect API Calls
- 25.1 Answer Call
- 25.2 Transfer Call
- 25.3 Hold/Resume
- 25.4 Hangup
- 25.5 Get Call Status
- 25.6 List Lines/Extensions
26. Phone Number Management
- 26.1 Search Available Numbers
- 26.2 Provision Number
- 26.3 Configure Number Routing
- 26.4 Release Number
27. Ooma WebRTC Softphone Integration
- 27.1 What is Ooma Softphone
- 27.2 Why We Need It
- 27.3 Auto-Answer Configuration
- 27.4 Audio Stream Access
PART 5: WebRTC Bridge Service
Estimated: 20-25 pages28. Bridge Architecture
- 28.1 Purpose of the Bridge
- 28.2 Component Diagram
- 28.3 Threading Model
- 28.4 State Machine
29. Browser Automation Layer
- 29.1 Puppeteer/Playwright Setup
- 29.2 Ooma Login Automation
- 29.3 Session Management
- 29.4 Health Monitoring
- 29.5 Crash Recovery
30. Audio Capture
- 30.1 Capturing Browser Audio
- 30.2 Audio Format (sample rate, channels, encoding)
- 30.3 Buffer Management
- 30.4 Latency Considerations
31. LiveKit Connection
- 31.1 Creating LiveKit Room
- 31.2 Publishing Audio Track
- 31.3 Subscribing to Agent Audio
- 31.4 Track Management
32. Audio Routing
- 32.1 Caller → Agent Flow
- 32.2 Agent → Caller Flow
- 32.3 Mixing (if needed)
- 32.4 Volume Normalization
33. Bridge Lifecycle
- 33.1 Initialization Sequence
- 33.2 Call Setup Sequence
- 33.3 Active Call Management
- 33.4 Call Teardown Sequence
- 33.5 Error Recovery
PART 6: LiveKit Integration
Estimated: 20-25 pages34. LiveKit Cloud Setup
- 34.1 Account Creation
- 34.2 Project Configuration
- 34.3 API Credentials
- 34.4 Webhook Configuration
35. Room Management
- 35.1 Room Naming Convention
- 35.2 Room Creation Logic
- 35.3 Room Configuration Options
- 35.4 Room Deletion/Cleanup
36. Participant Management
- 36.1 Participant Types (caller, agent, supervisor)
- 36.2 Participant Identity Format
- 36.3 Permissions by Role
- 36.4 Participant Lifecycle
37. Token Generation
- 37.1 JWT Structure
- 37.2 Claims & Grants
- 37.3 Token Service Implementation
- 37.4 Token Refresh Strategy
38. Audio Track Handling
- 38.1 Track Publication
- 38.2 Track Subscription
- 38.3 Track Quality Settings
- 38.4 Mute/Unmute
39. LiveKit Webhooks
- 39.1 Room Started
- 39.2 Room Finished
- 39.3 Participant Joined
- 39.4 Participant Left
- 39.5 Track Published/Unpublished
40. Recording with Egress
- 40.1 Egress Types
- 40.2 Starting Recording
- 40.3 Stopping Recording
- 40.4 Storage Configuration
- 40.5 Recording Retrieval
PART 7: Voice AI Pipeline
Estimated: 25-30 pages41. Pipeline Architecture
- 41.1 Component Diagram
- 41.2 Data Flow (audio in → text → response → audio out)
- 41.3 Latency Budget Breakdown
- 41.4 Error Handling Strategy
42. Deepgram STT Integration
- 42.1 Account Setup
- 42.2 WebSocket Connection
- 42.3 Audio Streaming Format
- 42.4 Transcription Options (model, language, punctuation)
- 42.5 Handling Interim Results
- 42.6 Handling Final Results
- 42.7 Error Recovery
43. Voice Activity Detection (VAD)
- 43.1 What VAD Does
- 43.2 Silero VAD Setup
- 43.3 Configuration Parameters
- 43.4 Speech Start Detection
- 43.5 Speech End Detection
- 43.6 Barge-In Handling
44. Claude LLM Integration
- 44.1 API Setup
- 44.2 System Prompt Design
- 44.3 Conversation History Management
- 44.4 Streaming Responses
- 44.5 Function Calling (tools)
- 44.6 Token Management
- 44.7 Error Handling
45. Knowledge Base Retrieval (RAG)
- 45.1 When to Query KB
- 45.2 Query Construction
- 45.3 Embedding Generation
- 45.4 Vector Search
- 45.5 Context Injection into Prompt
- 45.6 Citation Handling
46. Chatterbox TTS Integration
- 46.1 RunPod Setup
- 46.2 API Endpoint Configuration
- 46.3 Voice Selection
- 46.4 Text Preprocessing
- 46.5 Audio Generation
- 46.6 Streaming Audio Output
- 46.7 Error Handling
47. Pipeline Orchestration
- 47.1 Turn-Taking Logic
- 47.2 Interruption Handling
- 47.3 Silence Handling
- 47.4 Timeout Handling
- 47.5 Graceful Degradation
PART 8: Agent Service
Estimated: 20-25 pages48. LiveKit Agents Framework
- 48.1 What is LiveKit Agents
- 48.2 Agent Architecture
- 48.3 Worker Setup
- 48.4 Agent Dispatch
49. Agent Lifecycle
- 49.1 Agent Pool Management
- 49.2 Agent Assignment
- 49.3 Agent State Machine
- 49.4 Agent Cleanup
50. Conversation State
- 50.1 State Structure
- 50.2 State Persistence
- 50.3 State Transitions
- 50.4 State Recovery
51. Intent Handling
- 51.1 Intent Detection Approach
- 51.2 Common Intents
- 51.3 Intent → Action Mapping
- 51.4 Fallback Handling
52. Call Actions
- 52.1 Transfer to Human
- 52.2 Transfer to Another AI
- 52.3 Place on Hold
- 52.4 Schedule Callback
- 52.5 End Call
53. Multi-Tenant Agent Configuration
- 53.1 Loading Tenant Config
- 53.2 Personality Injection
- 53.3 Voice Selection
- 53.4 Knowledge Base Binding
PART 9: Multi-Tenancy & Security
Estimated: 15-20 pages54. Multi-Tenant Architecture
- 54.1 Tenant Isolation Model
- 54.2 Data Segregation
- 54.3 Resource Quotas
- 54.4 Tenant Context Propagation
55. Authentication
- 55.1 JWT Implementation Details
- 55.2 Token Claims
- 55.3 Token Validation
- 55.4 Session Management
56. Authorization
- 56.1 Role Definitions
- 56.2 Permission Matrix
- 56.3 RBAC Implementation
- 56.4 Resource-Level Permissions
57. Data Security
- 57.1 Encryption at Rest
- 57.2 Encryption in Transit
- 57.3 PII Handling
- 57.4 Data Retention Policies
58. API Security
- 58.1 Rate Limiting Implementation
- 58.2 Input Validation
- 58.3 SQL Injection Prevention
- 58.4 CORS Configuration
59. Audit Logging
- 59.1 What to Log
- 59.2 Log Format
- 59.3 Log Storage
- 59.4 Log Retention
PART 10: Deployment & Operations
Estimated: 20-25 pages60. Infrastructure Setup
- 60.1 DigitalOcean Configuration
- 60.2 Dokploy Setup
- 60.3 Network Architecture
- 60.4 SSL/TLS Configuration
61. Container Configuration
- 61.1 Dockerfile for Each Service
- 61.2 Docker Compose (local dev)
- 61.3 Resource Limits
- 61.4 Health Checks
62. Environment Configuration
- 62.1 Environment Variables Reference
- 62.2 Secrets Management
- 62.3 Configuration by Environment
63. CI/CD Pipeline
- 63.1 GitHub Actions Setup
- 63.2 Build Process
- 63.3 Test Process
- 63.4 Deploy Process
64. Monitoring
- 64.1 Health Check Endpoints
- 64.2 Metrics Collection
- 64.3 Log Aggregation
- 64.4 Alerting Rules
65. Scaling
- 65.1 Horizontal Scaling Strategy
- 65.2 Auto-Scaling Configuration
- 65.3 Load Balancing
- 65.4 Database Scaling
66. Disaster Recovery
- 66.1 Backup Strategy
- 66.2 Recovery Procedures
- 66.3 Failover Configuration
67. Runbooks
- 67.1 Common Issues & Fixes
- 67.2 Escalation Procedures
- 67.3 Incident Response
68. Cost Management
- 68.1 Cost Breakdown by Component
- 68.2 Cost Monitoring
- 68.3 Optimization Strategies
Summary: 10 Parts
| Part | Sections | Focus Area | Est. Pages |
|---|---|---|---|
| 1 | 1-4 | Foundation & Context | 15-20 |
| 2 | 5-12 | Database Design | 20-25 |
| 3 | 13-21 | API Design | 25-30 |
| 4 | 22-27 | GoToConnect Integration | 20-25 |
| 5 | 28-33 | WebRTC Bridge Service | 20-25 |
| 6 | 34-40 | LiveKit Integration | 20-25 |
| 7 | 41-47 | Voice AI Pipeline | 25-30 |
| 8 | 48-53 | Agent Service | 20-25 |
| 9 | 54-59 | Multi-Tenancy & Security | 15-20 |
| 10 | 60-68 | Deployment & Operations | 20-25 |
Part 1: Foundation & Context
Document Version: 1.0Last Updated: January 25, 2026
Part: 1 of 10
Sections: 1-4
Audience: Junior developers with no prior context
Section 1: Project Overview
1.1 What We’re Building (Plain English)
Voice by aiConnected is a white-label Voice AI contact center platform. Let’s break down what each of those words means: White-label: The platform is designed to be rebranded. When Agency X uses our platform to serve their client (a dental office), the dental office never sees “aiConnected” anywhere. They see Agency X’s branding. We’re invisible. We’re the infrastructure behind the scenes. Voice AI: The core product is an artificial intelligence that talks on the phone. Real phone calls. A human calls a phone number, and an AI answers. The AI can:- Understand what the human is saying (speech-to-text)
- Figure out what they want (intent recognition)
- Look up information to answer questions (knowledge base)
- Generate appropriate responses (large language model)
- Speak the response out loud (text-to-speech)
- Take actions like transferring to a human, scheduling callbacks, etc.
The Product in One Sentence
Voice by aiConnected lets marketing agencies offer their clients AI-powered phone systems that answer calls, help customers, and sound completely natural - all under the agency’s own brand.A Concrete Example
- Oxford Pierpont (an agency) signs up for Voice by aiConnected
- Oxford Pierpont onboards their client, Smile Dental (a dental office)
- We provision a phone number: (555) 123-4567
- Smile Dental advertises this number for appointments
- Sarah (a patient) calls (555) 123-4567
- Our AI answers: “Thank you for calling Smile Dental, this is Dr. Smith’s office. How can I help you today?”
- Sarah says: “I need to schedule a teeth cleaning”
- The AI accesses Smile Dental’s scheduling information and helps Sarah book an appointment
- The entire call is recorded and transcribed
- Oxford Pierpont can see analytics across all their clients
- Smile Dental can see their own call history and transcripts
- Sarah never knows she talked to an AI - it sounded that natural
What Makes This Different
| Traditional IVR | Competitors | Voice by aiConnected |
|---|---|---|
| ”Press 1 for sales, press 2 for support…” | AI voice, but single-tenant | AI voice with full multi-tenant architecture |
| Frustrating, limited | No white-label option | Complete white-label - agencies can resell |
| No intelligence | Expensive ($0.15+/min) | Cost-effective (~$0.05-0.08/min to customer) |
| Can’t understand natural speech | Limited customization | Per-tenant knowledge bases, voices, personalities |
1.2 Why We’re Building It (Business Problem)
The Pain Points We’re Solving
For Businesses (End Customers like Smile Dental):- Phone calls go unanswered. Small businesses miss 40-60% of calls because staff are busy with in-person customers. Each missed call is a missed opportunity - potentially $500+ in lost revenue for a dental office.
- Hiring is expensive and unreliable. A receptionist costs $35,000-50,000/year plus benefits. They call in sick. They quit. They need training. They can only work 8 hours a day.
- After-hours coverage is nearly impossible. Answering services cost $1-3 per call and often provide poor experiences. The business loses customers who call at 7 PM.
- Consistency is a challenge. Human staff have good days and bad days. The customer experience varies wildly.
- Agencies want to offer AI solutions but can’t build them. They see the opportunity but lack technical expertise.
- Existing solutions don’t allow reselling. Most AI voice products are direct-to-business. Agencies can’t white-label them.
- Agencies need recurring revenue. One-time website builds are feast-or-famine. Voice AI is a monthly subscription model.
Market Timing
Why build this now? Because all the pieces finally exist:- LLMs are good enough. Claude, GPT-4, and others can now hold genuinely helpful conversations. Two years ago, they couldn’t.
- Speech technology has matured. Deepgram’s Nova-2 model achieves >95% accuracy. Text-to-speech voices (like Chatterbox) are nearly indistinguishable from humans.
- Real-time infrastructure exists. LiveKit provides sub-100ms audio routing. WebRTC is battle-tested.
- Costs have plummeted. What would have cost 1/minute in 2022 now costs \~0.025/minute.
- Businesses are actively seeking automation. Post-pandemic labor shortages have made every business owner aware of the need to automate.
1.3 Who It’s For (Target Users)
Primary Users: Agencies
Profile:- Marketing agencies with 10-100 clients
- Digital agencies expanding into AI services
- Call center operators looking to add AI options
- Managed service providers (MSPs)
- Zero technical expertise required to deploy
- Ability to brand as their own
- Management dashboard for all their clients
- Competitive pricing to mark up and profit
- Runs a 5-person marketing agency
- Has 30 small business clients
- Offers websites, SEO, social media
- Wants to add “AI services” to his offerings
- Needs to be able to set up a new client in under an hour
- Wants to charge clients $300-500/month for the service
Secondary Users: Tenants (Agency’s Clients)
Profile:- Small to medium businesses
- Service-based businesses (dental, legal, HVAC, etc.)
- High call volume but can’t staff phones adequately
- Value customer experience
- Calls answered professionally 24/7
- Accurate information about their business
- Easy access to call recordings and transcripts
- Simple setup (they’re not technical)
- 3-dentist practice
- Receives 50-100 calls/day
- Front desk staff overwhelmed
- Misses 30% of calls
- Loses an estimated $10,000/month in missed appointments
- Willing to pay $400/month to never miss a call again
Tertiary Users: Platform Admins (Us - aiConnected)
What we need:- Visibility into all agencies and tenants
- Ability to manage billing
- System health monitoring
- Support access when agencies need help
1.4 Success Looks Like (Measurable Outcomes)
Technical Success Metrics
| Metric | Target | How We’ll Measure |
|---|---|---|
| Call Answer Rate | 99.9% | Calls answered / calls received |
| First Response Latency | <2 seconds | Time from call connect to AI speaking |
| Response Latency | <1000ms | Time from human stops speaking to AI starts |
| Speech Recognition Accuracy | >95% | Deepgram reported confidence scores |
| Call Completion Rate | >85% | Calls that end normally vs. dropped/failed |
| System Uptime | 99.9% | Total uptime / total time |
| Concurrent Call Capacity | 100/tenant | Load tested maximum |
Business Success Metrics (Year 1)
| Metric | Target | Notes |
|---|---|---|
| Agency Partners | 25 | Paying agencies |
| Total Tenants | 250 | Across all agencies |
| Monthly Call Minutes | 500,000 | Billable minutes |
| Monthly Recurring Revenue | $50,000 | From agency subscriptions |
| Gross Margin | >60% | Revenue minus direct costs |
| Net Promoter Score | >40 | Customer satisfaction |
| Churn Rate | <5%/month | Agencies leaving |
What “Done” Looks Like for MVP
The MVP is complete when:- ✅ An agency can sign up and create their account
- ✅ The agency can create a tenant (their client)
- ✅ A phone number can be provisioned for the tenant
- ✅ The tenant can upload documents to create a knowledge base
- ✅ Inbound calls to that number are answered by AI
- ✅ The AI can answer questions using the knowledge base
- ✅ The AI can transfer calls to a human number
- ✅ All calls are recorded and transcribed
- ✅ The agency can view calls across all tenants
- ✅ The tenant can view their own calls
- ✅ Response latency is consistently under 1 second
- ✅ The system handles 10 concurrent calls without degradation
Section 2: Glossary of Terms
This glossary exists so you never have to Google a term. Every technical word used in this document is defined here. Read this section once, then use it as a reference.2.1 Telephony Terms
PSTN (Public Switched Telephone Network)
The traditional phone system. When you pick up a landline or make a cell phone call, you’re using PSTN. It’s the global network of telephone lines, fiber optic cables, switching centers, and cellular networks that allow any phone to call any other phone. Why it matters: Our AI needs to receive calls from PSTN. Regular people dial regular phone numbers. We need to bridge PSTN to our internet-based AI system.VoIP (Voice over Internet Protocol)
Phone calls transmitted over the internet instead of traditional phone lines. Skype, Zoom, and WhatsApp calls are VoIP. The audio is converted to data packets and sent over the internet. Why it matters: Once we receive a call from PSTN, we convert it to VoIP to route through our system.SIP (Session Initiation Protocol)
A signaling protocol for starting, maintaining, and ending VoIP calls. SIP handles the “who’s calling whom” and “call is ending” messages - but not the actual audio. Why it matters: GoToConnect and many telephony systems use SIP. Understanding SIP helps debug call connection issues.WebRTC (Web Real-Time Communication)
A technology that enables real-time audio/video communication directly in web browsers. Unlike SIP, WebRTC is designed for the modern web and handles both signaling and media. Why it matters: Our WebRTC bridge converts between the telephony world (SIP/PSTN) and the AI world (LiveKit). WebRTC is how audio gets from the phone call into our processing pipeline.DTMF (Dual-Tone Multi-Frequency)
The tones generated when you press buttons on a phone keypad. Each button produces a unique combination of two frequencies. “Press 1 for sales” systems use DTMF. Why it matters: Some callers may try to press buttons to navigate. Our system needs to detect and handle DTMF input appropriately.IVR (Interactive Voice Response)
Those automated phone systems that say “Press 1 for sales, press 2 for support.” Traditional IVRs are frustrating and limited because they can’t understand natural speech. Why it matters: We’re replacing IVR with conversational AI. Understanding IVR helps explain our value proposition.PBX (Private Branch Exchange)
A private telephone network within an organization. Think of the phone system inside a corporate office where everyone has extensions. Why it matters: GoToConnect provides cloud PBX functionality. We integrate with their system.Trunk / SIP Trunk
A connection between phone systems. A SIP trunk is a virtual connection that allows VoIP calls to flow between two systems over the internet. Why it matters: Telephony providers charge based on trunks and concurrent call capacity.DID (Direct Inward Dialing)
A phone number that routes directly to a specific endpoint without requiring the caller to dial an extension. When you call a business’s main number, that’s a DID. Why it matters: Each tenant gets one or more DIDs. These are the phone numbers customers actually call.ANI (Automatic Number Identification)
The caller’s phone number, transmitted with the call. This is how caller ID works. Why it matters: We capture ANI to identify repeat callers and log call metadata.CDR (Call Detail Record)
A record of a phone call containing metadata: who called, who answered, when, how long, etc. Every call generates a CDR. Why it matters: CDRs are essential for billing, analytics, and compliance.E.164
The international standard format for phone numbers: +[country code][number]. Example: +15551234567 for a US number. Why it matters: We store all phone numbers in E.164 format for consistency. Always convert to E.164 before storing or comparing.2.2 WebRTC Terms
ICE (Interactive Connectivity Establishment)
A framework for establishing peer-to-peer connections through NATs and firewalls. ICE tries multiple connection methods and picks the best one that works. Why it matters: WebRTC connections can be tricky because of network configurations. ICE handles the complexity of actually connecting two endpoints.STUN (Session Traversal Utilities for NAT)
A protocol that helps a client discover its public IP address and what type of NAT (network address translation) is between it and the public internet. Why it matters: STUN servers help establish direct connections when possible.TURN (Traversal Using Relays around NAT)
A protocol that relays traffic through an intermediary server when direct connections aren’t possible. It’s a fallback when STUN fails. Why it matters: TURN servers cost money (bandwidth) but ensure connections work in restrictive network environments.SDP (Session Description Protocol)
A format for describing multimedia communication sessions. When two WebRTC endpoints connect, they exchange SDP messages describing what codecs they support, what media they want to send/receive, etc. Why it matters: SDP is how WebRTC endpoints negotiate connection parameters.Oer-to-Peer (P2P)
Direct communication between two endpoints without an intermediary server. WebRTC prefers P2P for lowest latency. Why it matters: P2P is ideal but not always possible. We use LiveKit as an SFU when P2P isn’t feasible.SFU (Selective Forwarding Unit)
A server that receives media streams from multiple participants and selectively forwards them to other participants. Unlike MCU (mixing), SFU just routes streams without processing them. Why it matters: LiveKit is an SFU. It receives audio from the caller and forwards it to the AI, and vice versa.Media Track
A single stream of audio or video. An audio track carries sound; a video track carries images. WebRTC connections can have multiple tracks. Why it matters: We work exclusively with audio tracks. The caller publishes an audio track; the AI publishes an audio track.Codec
A algorithm that encodes and decodes audio or video. Different codecs have different trade-offs between quality, latency, and bandwidth. Why it matters: We use Opus codec for audio because it’s designed for real-time voice communication with low latency.Opus
An audio codec specifically designed for interactive real-time applications. It handles everything from low-bandwidth voice to high-quality music. It’s the default codec for WebRTC audio. Why it matters: All our audio is encoded with Opus. Sample rate is typically 48kHz with 20ms frames.Sample Rate
How many audio samples are captured per second. 48000 Hz (48 kHz) means 48,000 samples per second. Higher sample rates = better quality but more data. Why it matters: Different components expect different sample rates. We standardize on 48kHz for LiveKit but may need 16kHz for some STT services.Frame
A chunk of audio samples. Audio is processed in frames, not individual samples. A 20ms frame at 48kHz contains 960 samples. Why it matters: Audio processing is frame-based. Understanding frame size helps with buffer management and latency calculations.2.3 AI/ML Terms
LLM (Large Language Model)
An AI model trained on massive amounts of text that can understand and generate human-like text. Examples: Claude (Anthropic), GPT-4 (OpenAI), Llama (Meta). Why it matters: The LLM is the “brain” of our AI agent. It understands what the caller wants and generates appropriate responses.STT (Speech-to-Text)
The process of converting spoken audio into written text. Also called ASR (Automatic Speech Recognition). Why it matters: We must convert the caller’s speech to text before the LLM can process it. Deepgram Nova-2 is our STT provider.TTS (Text-to-Speech)
The process of converting written text into spoken audio. Also called speech synthesis. Why it matters: After the LLM generates a text response, we must convert it to audio for the caller to hear. Chatterbox is our TTS provider.VAD (Voice Activity Detection)
Detecting when someone is speaking versus when there’s silence or background noise. Why it matters: VAD tells us when the caller starts and stops speaking. This is critical for turn-taking in conversation.Barge-In
When a caller interrupts the AI while it’s speaking. The AI should stop talking and listen. Why it matters: Natural conversations include interruptions. Our AI must handle barge-in gracefully.Turn-Taking
The conversational pattern of one party speaking, then the other, back and forth. Humans do this naturally; AI must be programmed to do it. Why it matters: Poor turn-taking makes conversations awkward. The AI shouldn’t talk over the caller or leave long silences.Latency
The delay between cause and effect. In our context: the time between when the caller stops speaking and when the AI starts responding. Why it matters: High latency feels unnatural. We target <1000ms total latency.Streaming
Processing data as it arrives rather than waiting for all of it. Streaming STT transcribes words as they’re spoken; streaming TTS generates audio as text is produced. Why it matters: Streaming is essential for low latency. We can’t wait for the caller to finish a complete sentence before starting to process.Embeddings
Numerical representations of text that capture semantic meaning. Similar texts have similar embeddings. Why it matters: We use embeddings to search the knowledge base. When a caller asks a question, we embed the question and find knowledge chunks with similar embeddings.Vector Database
A database optimized for storing and searching embeddings. Regular databases search by exact match; vector databases search by similarity. Why it matters: Knowledge base search uses vector similarity. We store document embeddings and query by similarity.RAG (Retrieval-Augmented Generation)
A technique where the LLM is given relevant information retrieved from a knowledge base before generating a response. This grounds the AI’s responses in actual facts. Why it matters: RAG is how our AI answers questions about a specific business. We retrieve relevant knowledge and inject it into the LLM prompt.Prompt
The input given to an LLM. This includes system instructions, context, and the user’s message. Why it matters: Prompt design significantly affects AI quality. We carefully craft prompts to make the AI behave appropriately for each tenant.System Prompt
Instructions given to the LLM that set its behavior, personality, and constraints. The system prompt is typically hidden from the end user. Why it matters: Each tenant has a customized system prompt that defines their AI’s personality and knowledge.Context Window
The maximum amount of text an LLM can process at once. Measured in tokens. Claude Sonnet has a 200K token context window. Why it matters: Conversation history must fit in the context window. Long calls may require summarization.Token
A unit of text processing for LLMs. Roughly 4 characters or 0.75 words in English. LLMs charge by token and have token limits. Why it matters: Token usage affects cost and context limits. We track tokens for billing and to avoid exceeding limits.Function Calling / Tool Use
The ability of an LLM to request execution of external functions. The AI says “I need to check the calendar” and we execute that function and return results. Why it matters: Function calling lets our AI take actions - transfer calls, look up information, schedule appointments, etc.Hallucination
When an LLM generates plausible-sounding but false information. The AI confidently states something that isn’t true. Why it matters: Hallucinations are dangerous in business contexts. RAG and careful prompting reduce but don’t eliminate hallucinations.2.4 Platform Terms
Agency
In our platform, an agency is a business partner who resells Voice by aiConnected to their clients. The agency is our direct customer. Example: Oxford Pierpont is an agency with 30 client tenants.Tenant
An end-customer business that uses the platform through an agency. The tenant is the agency’s customer. Example: Smile Dental is a tenant under Oxford Pierpont.Platform Admin
An aiConnected employee who manages the overall platform. Can see all agencies and tenants.Agency Admin
A user who manages an agency account. Can create/manage tenants, view agency-wide analytics, etc.Tenant Admin
A user who manages a single tenant account. Can configure their knowledge base, view their call history, etc.Knowledge Base
A collection of information about a tenant’s business that the AI uses to answer questions. Can include documents, FAQs, and structured data. Example: Smile Dental’s knowledge base includes their service list, pricing, hours, and policies.Voice Configuration
Settings that define how the AI sounds and behaves for a tenant. Includes voice selection, speaking rate, personality traits.Personality
The behavioral characteristics of the AI agent - formal vs casual, concise vs verbose, etc.2.5 Infrastructure Terms
Container
A lightweight, standalone package that includes everything needed to run a piece of software. Containers are consistent across development and production. Why it matters: We deploy our services as Docker containers. This ensures consistency across environments.Docker
The most popular containerization platform. We write Dockerfiles that define how to build containers.Kubernetes / K8s
A system for orchestrating containers at scale - handling deployment, scaling, and management. We use Dokploy (which uses Docker Swarm) instead of Kubernetes for simplicity.Dokploy
An open-source platform for deploying Docker applications. Simpler than Kubernetes. This is our deployment platform on DigitalOcean.Webhook
An HTTP callback - a way for one service to notify another when something happens. Instead of polling “did anything happen?”, the service pushes notifications. Why it matters: GoToConnect sends webhooks when calls arrive. LiveKit sends webhooks when participants join/leave. Our system is event-driven via webhooks.WebSocket
A protocol for persistent, bidirectional communication between client and server. Unlike HTTP (request/response), WebSocket connections stay open for real-time data flow. Why it matters: Deepgram STT uses WebSocket for streaming audio in and transcriptions out. Real-time communication requires WebSocket.REST API
A standard way to build web APIs using HTTP methods (GET, POST, PUT, DELETE) and JSON data. Why it matters: Our management APIs are REST. Agencies and tenants interact with the platform via REST API (and UI built on it).JWT (JSON Web Token)
A compact, self-contained token for securely transmitting information. Used for authentication - proving who a user is. Why it matters: Our authentication system uses JWT. Users log in and receive a token that proves their identity.UUID (Universally Unique Identifier)
A 128-bit identifier that’s practically guaranteed to be unique. Example:550e8400-e29b-41d4-a716-446655440000
Why it matters: We use UUIDs as primary keys for most database records. They’re generated client-side without coordination.
Environment Variable
A configuration value set outside the code. Allows the same code to run differently in development vs production. Why it matters: API keys, database URLs, and feature flags are environment variables. Never hardcode secrets.Redis
An in-memory data store used for caching, session storage, and pub/sub messaging. Very fast because data is in RAM. Why it matters: We use Redis for real-time state (active calls), caching, and as a message queue.PostgreSQL
A powerful open-source relational database. Our primary data store for all persistent data.n8n
An open-source workflow automation tool. Think “Zapier but self-hosted.” We use n8n for orchestrating webhooks and automations.Section 3: Architecture Overview
3.1 System Diagram (With Explanation)
Component-by-Component Explanation
PSTN Network The traditional phone network. When Sarah picks up her phone and dials (555) 123-4567, her call travels through PSTN. This is outside our control - it’s the global telephone infrastructure. GoToConnect Our telephony provider. They give us:- Phone numbers (DIDs) that customers call
- The ability to answer and control calls programmatically
- Webhook notifications when calls arrive
- APIs to transfer, hold, and hangup calls
- Receives the webhook
- Looks up the phone number to find the tenant
- Triggers the WebRTC bridge to answer
- Initiates LiveKit room creation
- Dispatches an AI agent
- Runs a headless browser with Ooma’s WebRTC softphone
- Auto-answers incoming calls
- Captures the audio stream from the browser
- Forwards that audio to LiveKit
- Receives audio from LiveKit (the AI speaking)
- Plays that audio through the browser to the caller
- Creates “rooms” for each call
- Routes audio between participants (caller, AI agent, supervisors)
- Records calls (via Egress)
- Handles all the WebRTC complexity
- Subscribes to the caller’s audio from LiveKit
- Streams audio to Deepgram for transcription
- Sends transcriptions to Claude for response generation
- Streams Claude’s response to Chatterbox for speech synthesis
- Publishes synthesized audio back to LiveKit
- Deepgram Nova-2: Converts caller’s speech to text. Streaming, real-time.
- Claude Sonnet: Generates intelligent responses. Understands context, follows instructions.
- Knowledge Base (RAG): Vector database with tenant-specific information. Grounds Claude’s responses in facts.
- Chatterbox-Turbo: Converts Claude’s text responses to natural-sounding speech. Runs on RunPod GPU.
- PostgreSQL: All persistent data - users, tenants, calls, transcripts, etc.
- Redis: Fast, temporary data - active call state, caching, pub/sub messaging
- S3/DigitalOcean Spaces: Object storage for call recordings (audio files)
- REST API: Backend service that powers all management operations
- Web UI: React-based dashboard for agencies and tenants
- Webhooks (Out): Notify external systems when events occur (call completed, etc.)
3.2 Data Flow Narrative (Step-by-Step What Happens on a Call)
Let’s follow a complete call from start to finish. Sarah is calling Smile Dental.Phase 1: Call Initiation (0-3 seconds)
T+0.0s: Sarah dials (555) 123-4567- Her phone connects to PSTN
- PSTN routes to GoToConnect (which owns that number)
- Looks up routing for (555) 123-4567
- Finds it’s configured to ring the Ooma softphone extension
- Sends HTTP POST webhook to our n8n endpoint:
- Workflow triggers
- Looks up phone number +15551234567 in database
- Finds: tenant_id = “smile-dental”, agency_id = “oxford-pierpont”
- Loads tenant configuration: voice settings, greeting, personality
- Creates a call record in PostgreSQL with status = “ringing”
- Sends command: “Answer call on line X”
- Bridge’s browser-based softphone picks up
- GoToConnect sees the softphone answered
- Audio path established: Sarah ↔ GoToConnect ↔ Softphone in Browser
- GoToConnect sends “call.answered” webhook
- Room name: “call-smile-dental-call-123456”
- Generates access tokens for bridge (as “caller”) and agent
- Opens WebSocket connection to LiveKit
- Starts publishing caller audio as an audio track
- Subscribes to receive agent audio track
- n8n notifies Agent Service: “Join room call-smile-dental-call-123456”
- Agent Service assigns an available agent worker
- Agent loads Smile Dental’s configuration and knowledge base
- Subscribes to caller’s audio track
- Ready to publish agent audio track
- Initializes STT connection to Deepgram
- Prepares Claude conversation with system prompt
- Retrieves greeting from tenant config: “Thank you for calling Smile Dental, this is Dr. Smith’s office. How can I help you today?”
- Sends greeting to Chatterbox TTS
- Receives audio stream back
- Publishes to LiveKit
- Sarah hears the greeting through her phone
Phase 2: Conversation (Duration varies)
T+3.0s: Sarah starts speaking- “Yeah, hi, I need to schedule a teeth cleaning”
- Audio flows: Sarah’s phone → PSTN → GoToConnect → Softphone → Bridge → LiveKit → Agent
- Agent streams audio to Deepgram via WebSocket
- Deepgram sends interim results as Sarah speaks:
- T+3.2s: “Yeah”
- T+3.5s: “Yeah hi”
- T+3.8s: “Yeah hi I need to”
- T+4.2s: “Yeah hi I need to schedule”
- T+4.8s: “Yeah hi I need to schedule a teeth cleaning”
- VAD (Voice Activity Detection) detects Sarah stopped speaking at T+5.0s
- Deepgram sends final transcript: “Yeah, hi, I need to schedule a teeth cleaning.”
- Recognizes intent: appointment scheduling
- Queries knowledge base: “teeth cleaning appointment scheduling”
- Retrieves relevant chunks:
- “Teeth cleaning appointments are 45 minutes”
- “Available Monday-Friday 8am-5pm, Saturday 9am-2pm”
- “New patient cleaning: 100”
- Constructs prompt with:
- System prompt (personality, instructions)
- Knowledge base context (retrieved chunks)
- Conversation history (just the greeting so far)
- User message: “Yeah, hi, I need to schedule a teeth cleaning.”
- Claude processes and generates response (streaming)
- As tokens stream back, agent buffers them into sentence chunks
- First sentence ready: “I’d be happy to help you schedule a cleaning!”
- Sends “I’d be happy to help you schedule a cleaning!” to Chatterbox
- Chatterbox generates audio and streams back
- Publishes audio to LiveKit
- Audio flows: Agent → LiveKit → Bridge → Softphone → GoToConnect → PSTN → Sarah’s phone
- Sarah hears: “I’d be happy to help you schedule a cleaning!”
- Meanwhile, Claude has generated more: “Are you an existing patient with us, or will this be your first visit?”
- TTS generates audio, agent publishes
- Sarah hears the complete response
Phase 3: Continued Conversation
This back-and-forth continues:- Sarah: “I’ve been there before, maybe two years ago?”
- Agent: (checks if it matters, decides to proceed) “Great, let me check our availability. What days work best for you?”
- Sarah: “Anytime Thursday or Friday afternoon”
- Agent: “I have openings Thursday at 2pm, 3:30pm, or Friday at 1pm and 4pm. Which works best?”
- Sarah: “Thursday at 3:30 works”
- Agent: “Perfect! I have you down for Thursday at 3:30pm for a teeth cleaning. Can I confirm your phone number for appointment reminders?”
- …and so on
- Every utterance is transcribed and stored
- Conversation history grows, sent to Claude each turn
- Agent can access tenant knowledge base as needed
- Full audio is being recorded via LiveKit Egress
Phase 4: Call Completion
Sarah: “That’s all I needed, thanks!” Agent recognizes call is ending- Intent: end conversation
- Response: “You’re all set! We’ll see you Thursday at 3:30. Have a great day!”
- Sends command to n8n: “End call call-123456”
- n8n tells GoToConnect to hang up
- GoToConnect terminates the call
- LiveKit room closes (all participants left)
- LiveKit Egress finalizes recording, uploads to storage
- n8n workflow triggers:
- Updates call record: status = “completed”, duration = 180 seconds
- Triggers transcript finalization
- Generates call summary (optional Claude call)
- Calculates usage for billing
- Sends webhook to tenant (if configured)
3.3 Technology Choices (What We’re Using and WHY)
Every technology choice has a reason. Here’s why we chose each component:Telephony: GoToConnect
What it is: Cloud-based business phone system with API access. Why we chose it:- Ooma WebRTC Softphone - Critical. GoToConnect offers a browser-based softphone through Ooma, which lets us capture audio without specialized telephony hardware.
- Webhook support - Sends real-time notifications for call events.
- Call control API - Programmatic transfer, hold, hangup.
- Reasonable pricing - ~$0.005/minute for PSTN usage.
- Existing relationship - Bob’s company already uses GoToConnect.
- Twilio - More developer-friendly but more expensive, no Ooma equivalent
- Vonage - Similar capabilities but less familiar
- Direct SIP - Would require significant telephony expertise
Real-Time Media: LiveKit Cloud
What it is: Managed WebRTC infrastructure for real-time audio/video. Why we chose it:- LiveKit Agents Framework - Purpose-built for AI voice agents. Handles VAD, turn-taking, pipeline orchestration.
- Cloud-hosted - No infrastructure to manage.
- Low latency - Sub-100ms audio routing.
- Recording built-in - Egress feature for call recording.
- Scalable - Handles thousands of concurrent rooms.
- Self-hosted LiveKit - More control but operational burden
- Twilio Video - Less AI-focused, no agents framework
- Daily.co - Good but less mature agent tooling
- Custom WebRTC - Too much complexity
Speech-to-Text: Deepgram Nova-2
What it is: Real-time speech recognition API. Why we chose it:- Accuracy - Nova-2 is best-in-class for conversational speech.
- Streaming - Real-time results as speech happens.
- Latency - Designed for real-time use cases.
- Pricing - $0.0043/minute is competitive.
- LiveKit integration - Works well with LiveKit Agents.
- Google Speech-to-Text - Good but more expensive
- AWS Transcribe - Higher latency
- Whisper - Not designed for real-time streaming
- AssemblyAI - Good but Deepgram has edge on latency
Language Model: Claude Sonnet (Anthropic)
What it is: Large language model for generating responses. Why we chose it:- Quality - Claude produces natural, helpful responses.
- Instruction following - Excellent at staying in character.
- Function calling - Reliable tool use for actions.
- Context window - 200K tokens handles long conversations.
- Safety - Built-in refusal of harmful requests.
- GPT-4 - Comparable but OpenAI has reliability concerns
- Llama - Would need to self-host, more complexity
- Claude Opus - Overkill for this use case, more expensive
Text-to-Speech: Chatterbox-Turbo on RunPod
What it is: Open-source TTS model running on GPU cloud. Why we chose it:- Quality - Natural-sounding voice synthesis.
- Cost - Much cheaper than commercial TTS at scale.
- Customization - Can fine-tune for specific voices.
- Latency - Fast enough for real-time with GPU acceleration.
- No per-character fees - Just GPU time.
- ElevenLabs - Excellent quality but $0.30/1000 chars adds up
- Amazon Polly - Robotic sounding
- Google TTS - Better than Polly but not great
- Play.ht - Good but expensive for volume
Database: PostgreSQL
What it is: Open-source relational database. Why we chose it:- Reliability - Battle-tested, ACID compliant.
- pgvector extension - Native vector similarity search for RAG.
- JSON support - Flexible for varied data shapes.
- Familiar - Team knows it well.
- Managed options - DigitalOcean, AWS RDS, etc.
- MySQL - No native vector support
- MongoDB - Less suited for relational data
- Separate vector DB - Added complexity
Cache/State: Redis
What it is: In-memory data store. Why we chose it:- Speed - Sub-millisecond operations.
- Pub/Sub - Real-time messaging between services.
- TTL support - Automatic expiration for temporary data.
- Familiar - Industry standard.
- Memcached - Less feature-rich
- KeyDB - Compatible but less proven
Orchestration: n8n
What it is: Open-source workflow automation. Why we chose it:- Visual workflows - Easy to build and debug.
- Webhook handling - First-class support.
- Self-hosted - No per-execution fees.
- Extensible - Custom code nodes when needed.
- Bob’s familiarity - Already using it.
- Zapier - Too expensive at scale
- Custom code - More flexibility but slower to develop
- Temporal - Overkill for our needs
Deployment: Dokploy on DigitalOcean
What it is: Container orchestration platform on cloud VMs. Why we chose it:- Simplicity - Easier than Kubernetes.
- Cost - DigitalOcean is affordable.
- Control - Self-managed but not too complex.
- Docker-native - Standard containerization.
- Kubernetes - Overkill for initial scale
- AWS ECS - More complex, vendor lock-in
- Heroku - Expensive at scale
- Render - Good but less control
3.4 What We’re NOT Building (Explicit Scope Boundaries)
Clear boundaries prevent scope creep. Here’s what’s explicitly out of scope:Not Building: Outbound Dialer (MVP)
We will support outbound calls eventually, but MVP is inbound-only. Outbound dialers require:- Campaign management
- Do-not-call list compliance
- Predictive dialing algorithms
- Different conversation patterns
Not Building: Video Calls
Voice only. No video support. Video would require:- Different pipeline (video processing)
- Higher bandwidth
- Different use cases entirely
Not Building: SMS/Chat
Voice only. No text messaging or web chat. These would require:- Different interaction patterns
- Different latency expectations
- Different UI
Not Building: Custom Voice Cloning
We use pre-trained voices. We won’t clone customer voices or create fully custom voices. This would require:- Voice recording sessions
- Fine-tuning pipelines
- Legal consent frameworks
Not Building: On-Premise Deployment
Cloud only. No on-premise option. On-prem would require:- Different deployment models
- Customer-managed infrastructure
- Support complexity
Not Building: Direct Consumer Sales
Agencies only. We don’t sell directly to end businesses. Direct sales would require:- Different sales motion
- Support infrastructure
- Competing with our own customers
Not Building: Full CRM
We capture call data but we’re not a CRM. Integrations with Salesforce, HubSpot, etc. are planned, but we won’t replicate CRM functionality. Why not: Focus. Others do CRM well. We do voice AI well.Not Building: Appointment Scheduling Backend
The AI can help schedule appointments conversationally, but we won’t build a full scheduling system (calendar management, availability, etc.). We’ll integrate with existing systems. Why not: Reinventing the wheel. Calendly, Acuity, etc. exist.Section 4: Development Environment Setup
This section tells you exactly how to set up your development machine to work on this project. Follow these steps in order.4.1 Required Accounts & API Keys
Before writing any code, you need accounts with these services. Create accounts and gather API keys.4.1.1 GoToConnect (Telephony)
What you need:- GoToConnect account with API access
- OAuth 2.0 credentials (Client ID and Client Secret)
- At least one phone number provisioned
- Webhook endpoint configured
- Contact GoToConnect sales for a developer/partner account
- Access the admin portal at admin.goto.com
- Navigate to Integrations → API Credentials
- Create new OAuth 2.0 application
- Note the Client ID and Client Secret
- Configure redirect URI for OAuth flow
4.1.2 LiveKit Cloud
What you need:- LiveKit Cloud account
- API Key and Secret
- WebSocket URL for your project
- Sign up at cloud.livekit.io
- Create a new project
- Go to Settings → Keys
- Note the API Key and Secret
- Note the WebSocket URL (wss://your-project.livekit.cloud)
4.1.3 Deepgram (STT)
What you need:- Deepgram account
- API key
- Sign up at console.deepgram.com
- Create a new project
- Go to API Keys
- Create new key with appropriate permissions
4.1.4 Anthropic (LLM)
What you need:- Anthropic API account
- API key
- Sign up at console.anthropic.com
- Go to API Keys
- Create new key
4.1.5 RunPod (TTS Hosting)
What you need:- RunPod account
- API key
- GPU endpoint URL (after deploying Chatterbox)
- Sign up at runpod.io
- Add payment method
- Go to Settings → API Keys
- Create new key
- Deploy Chatterbox template (instructions in Part 7)
4.1.6 DigitalOcean
What you need:- DigitalOcean account
- API token
- Spaces access keys (for object storage)
- Sign up at digitalocean.com
- Go to API → Tokens → Generate New Token
- Go to Spaces → Manage Keys → Generate New Key
4.1.7 Database Connection
For local development:4.2 Local Development Tools
Install these tools on your development machine.4.2.1 Required Software
Node.js (v20 LTS)4.2.2 Recommended IDE Setup
VS Code Extensions:- Python (Microsoft)
- Pylance
- ESLint
- Prettier
- Docker
- GitLens
- Thunder Client (API testing)
- PostgreSQL (ckolkman)
4.2.3 Helpful CLI Tools
4.3 Repository Structure
The project is organized as a monorepo with the following structure:Service Responsibilities
api/ - REST API Service- Handles all HTTP requests from frontend and external systems
- Manages authentication and authorization
- CRUD operations for all entities
- Exposes webhooks for external systems
- LiveKit Agents worker
- Voice pipeline (STT → LLM → TTS)
- Knowledge base queries
- Conversation management
- Browser automation (Puppeteer)
- Ooma softphone control
- Audio capture and routing
- LiveKit media publishing
- Async job processing
- Post-call processing
- Transcript finalization
- Usage aggregation
- Scheduled tasks
- Agency dashboard
- Tenant dashboard
- Admin dashboard
- Configuration interfaces
4.4 Environment Variables Reference
Complete list of all environment variables used by the system:4.5 How to Run Locally
Step-by-step instructions to get the system running on your machine.Step 1: Clone the Repository
Step 2: Copy Environment Variables
.env and fill in all the API keys from Section 4.1.
Step 3: Start Infrastructure Services
Step 4: Initialize Database
Step 5: Start Backend Services
Option A: Using Docker Compose (Recommended)Step 6: Start Frontend
Step 7: Start n8n (Workflow Automation)
Step 8: Expose Webhooks (For Testing)
GoToConnect needs to reach your local machine with webhooks.Step 9: Verify Everything Works
Step 10: Make a Test Call
- Log into the frontend at http://localhost:3000
- Create a test tenant with a phone number
- Call the phone number
- You should hear the AI greeting!
Troubleshooting Common Issues
Issue: Database connection refuseddocker compose ps. Start it with docker compose up -d postgres.
Issue: Redis connection refused
docker compose up -d redis.
Issue: API key errors
.env file. Check for trailing whitespace or quotes.
Issue: Webhook not received
sudo apt-get install chromium-browser. You may need to configure Puppeteer to use the installed browser.
Issue: Port already in use
lsof -i :8000 then kill -9 <PID>. Or change the port in .env.
End of Part 1
You now have:- ✅ Complete understanding of what we’re building and why
- ✅ Full glossary of every technical term
- ✅ Detailed architecture with component explanations
- ✅ Complete development environment setup
- Complete database schema with DDL
- Every table, column, index explained
- Migration strategy
- Query patterns
Document End - Part 1 of 10
Junior Developer PRD - Part 2: Database Design
Document Version: 1.0Last Updated: January 25, 2026
Part: 2 of 10
Sections: 5-12
Audience: Junior developers with no prior context
Section 5: Database Architecture
5.1 Why PostgreSQL
We use PostgreSQL as our primary database. Here’s why:Reasons for Choosing PostgreSQL
1. Relational Data Model Fits Our Domain Our data is inherently relational:- Agencies have many Tenants
- Tenants have many Phone Numbers
- Phone Numbers receive many Calls
- Calls have Transcripts and Recordings
pgvector extension that adds:
- Vector data type for storing embeddings
- Vector similarity search operators
- Indexes for fast nearest-neighbor queries
- Tenant configuration varies by tenant
- Call metadata varies by call type
- Webhook payloads from external systems
- ACID compliance (data integrity guaranteed)
- Excellent crash recovery
- Mature replication for high availability
- Decades of production use
- DigitalOcean Managed Databases
- AWS RDS
- Supabase
- Neon
What We’re NOT Using
MongoDB - Document databases are great for some use cases, but our relational data benefits from joins and foreign key constraints. MySQL - Good database, but lacks native vector support. We’d need a separate vector database. SQLite - Not suitable for concurrent access from multiple services. Separate Vector Database - Adding Pinecone/Weaviate/Milvus would mean another service to manage, another point of failure, and data synchronization challenges.5.2 Database Naming Conventions
Consistency makes code easier to read and write. Follow these conventions exactly.Table Names
- Plural nouns:
users,calls,tenants(notuser,call,tenant) - Snake_case:
phone_numbers,knowledge_bases(notphoneNumbers,KnowledgeBases) - Lowercase only:
call_events(notCall_EventsorCALL_EVENTS)
Column Names
- Snake_case:
created_at,tenant_id,phone_number - Lowercase only: Always
- Descriptive:
started_atnotstart,duration_secondsnotdur - Boolean columns: Prefix with
is_orhas_:is_active,has_voicemail
- Timestamps: Suffix with
_at:created_at,updated_at,deleted_at,started_at,ended_at
Index Names
ix_calls_tenant_idix_calls_started_atix_users_emailix_phone_numbers_tenant_id_number
Constraint Names
pk_userspk_calls
fk_tenants_agenciesfk_calls_tenants
uq_users_emailuq_phone_numbers_number
ck_calls_duration_positiveck_users_email_format
Enum Types
call_status_enumcall_direction_enumuser_role_enum
5.3 Common Patterns Used
These patterns appear throughout the schema. Understand them once, recognize them everywhere.Pattern 1: UUID Primary Keys
Every table uses UUID as the primary key, not auto-incrementing integers.- Can be generated client-side without database round-trip
- No sequential guessing (security)
- Easy to merge data from multiple sources
- Works well with distributed systems
- Requires database to generate ID
- Sequential IDs leak information (how many records exist)
- Merging data from multiple sources causes conflicts
Pattern 2: Timestamp Columns
Every table has these timestamp columns:updated_at when a row changes:
Pattern 3: Soft Deletes
We don’t actually delete records. We mark them as deleted:deleted_at is NULL: Record is activeIf
deleted_at has a value: Record was deleted at that time
Why soft deletes:
- Data recovery is possible
- Audit trail preserved
- Foreign key relationships don’t break
- Billing and analytics remain accurate
Pattern 4: JSONB Configuration Columns
For flexible, schema-less data within a record:- Data structure varies between records
- External system payloads
- User-configurable settings
- Data you don’t query by frequently
- Data you query/filter by frequently (use columns)
- Relationships to other tables (use foreign keys)
- Data with strict schema requirements
Pattern 5: Enum Types for Status Fields
For fields with a fixed set of values:- Database enforces valid values
- Typos caught at insert time
- Self-documenting schema
- More efficient storage than strings
- Fixed set of values that rarely changes
- Values are known at schema design time
- Values added/removed frequently
- User-defined values
- Hundreds of possible values
Pattern 6: Tenant Isolation
Most tables include atenant_id foreign key:
Pattern 7: Audit Columns for Sensitive Operations
For tables where we need to track who did what:Section 6: Schema - Core Entities
6.1 agencies Table
Agencies are our direct customers - businesses that resell Voice by aiConnected to their clients.
Column Explanations
| Column | Type | Purpose |
|---|---|---|
| id | UUID | Unique identifier, auto-generated |
| name | VARCHAR(255) | Display name: “Oxford Pierpont” |
| slug | VARCHAR(100) | URL-safe identifier: “oxford-pierpont” |
| contact_email | VARCHAR(255) | Primary contact email |
| contact_phone | VARCHAR(50) | Primary contact phone |
| contact_name | VARCHAR(255) | Primary contact person’s name |
| address_* | Various | Physical/billing address |
| company_name | VARCHAR(255) | Legal entity name |
| tax_id | VARCHAR(50) | Tax identification number |
| status | VARCHAR(50) | Account status: active/suspended/cancelled |
| is_verified | BOOLEAN | Has the agency verified their identity |
| max_tenants | INTEGER | Quota: how many tenants allowed |
| max_concurrent_calls | INTEGER | Quota: simultaneous calls across all tenants |
| billing_email | VARCHAR(255) | Where to send invoices |
| stripe_customer_id | VARCHAR(255) | Reference to Stripe customer |
| billing_plan | VARCHAR(50) | Pricing tier |
| settings | JSONB | Flexible configuration |
| metadata | JSONB | Internal tracking data |
| created_at | TIMESTAMPTZ | When record was created |
| updated_at | TIMESTAMPTZ | When record was last modified |
| deleted_at | TIMESTAMPTZ | When record was soft-deleted (NULL if active) |
6.2 tenants Table
Tenants are the end-customer businesses (agency’s clients) that use the voice AI.
Column Explanations
| Column | Type | Purpose |
|---|---|---|
| id | UUID | Unique identifier |
| agency_id | UUID | Which agency owns this tenant |
| name | VARCHAR(255) | Display name: “Smile Dental” |
| slug | VARCHAR(100) | URL-safe identifier, unique within agency |
| business_type | VARCHAR(100) | Industry category for analytics |
| timezone | VARCHAR(50) | IANA timezone (America/New_York) |
| contact_* | Various | Business contact information |
| website_url | VARCHAR(500) | Business website |
| status | VARCHAR(50) | Account status |
| max_concurrent_calls | INTEGER | How many simultaneous calls allowed |
| max_monthly_minutes | INTEGER | Monthly minute quota (NULL = unlimited) |
| settings | JSONB | All tenant configuration |
| metadata | JSONB | Internal tracking |
| created_at | TIMESTAMPTZ | Creation timestamp |
| updated_at | TIMESTAMPTZ | Last modification |
| deleted_at | TIMESTAMPTZ | Soft delete timestamp |
6.3 users Table
Users are humans who log into the platform - agency admins, tenant admins, etc.
Column Explanations
| Column | Type | Purpose |
|---|---|---|
| id | UUID | Unique identifier |
| agency_id | UUID | Agency this user belongs to (NULL if tenant user or platform admin) |
| tenant_id | UUID | Tenant this user belongs to (NULL if agency user or platform admin) |
| VARCHAR(255) | Login email, must be unique | |
| password_hash | VARCHAR(255) | bcrypt hash of password |
| first_name | VARCHAR(100) | User’s first name |
| last_name | VARCHAR(100) | User’s last name |
| phone | VARCHAR(50) | Contact phone number |
| avatar_url | VARCHAR(500) | Profile picture URL |
| role | VARCHAR(50) | Permission level |
| status | VARCHAR(50) | Account status |
| is_verified | BOOLEAN | Has email been verified |
| last_login_at | TIMESTAMPTZ | When user last logged in |
| last_login_ip | VARCHAR(45) | IP address of last login |
| failed_login_attempts | INTEGER | Count of failed logins (for lockout) |
| locked_until | TIMESTAMPTZ | Account locked until this time |
| password_reset_* | Various | Password reset flow fields |
| email_verification_* | Various | Email verification flow fields |
| preferences | JSONB | User preferences and settings |
6.4 user_roles and permissions Tables
Fine-grained permission control for advanced use cases.
Section 7: Schema - Telephony Entities
7.1 phone_numbers Table
Phone numbers provisioned through GoToConnect and assigned to tenants.
7.2 calls Table
The central table tracking all phone calls.
Column Explanations
| Column | Type | Purpose |
|---|---|---|
| id | UUID | Unique call identifier |
| tenant_id | UUID | Which tenant this call belongs to |
| phone_number_id | UUID | Which phone number received/made the call |
| external_call_id | VARCHAR(255) | GoToConnect’s ID for correlation |
| livekit_room_name | VARCHAR(255) | LiveKit room for audio routing |
| direction | ENUM | inbound or outbound |
| from_number | VARCHAR(20) | Caller’s phone number |
| to_number | VARCHAR(20) | Recipient’s phone number |
| status | ENUM | Current call state |
| initiated_at | TIMESTAMPTZ | When call started |
| ringing_at | TIMESTAMPTZ | When ringing began |
| answered_at | TIMESTAMPTZ | When call was answered |
| ended_at | TIMESTAMPTZ | When call ended |
| duration_seconds | INTEGER | Length of conversation |
| outcome | VARCHAR(100) | Result classification |
| sentiment_score | DECIMAL | Caller sentiment (-1 to 1) |
| cost_cents | INTEGER | Total cost for billing |
| recording_url | VARCHAR(500) | URL to access recording |
| transcript_id | UUID | Link to transcript record |
| error_* | Various | Error details if call failed |
| metadata | JSONB | Additional call data |
7.3 call_events Table
State machine history for every call - tracks every status change.
7.4 call_transfers Table
Tracks when calls are transferred to humans or other destinations.
Section 8: Schema - AI & Content Entities
8.1 knowledge_bases Table
Container for tenant knowledge - documents, FAQs, etc.
8.2 knowledge_documents Table
Individual documents uploaded to knowledge bases.
8.3 knowledge_chunks Table
Chunked document content with embeddings for vector search.
Vector Search Query Example
8.4 transcripts Table
Full transcripts of conversations.
8.5 recordings Table
Call recording metadata and storage references.
Section 9: Schema - Configuration Entities
9.1 voice_configurations Table
TTS voice settings for tenants.
9.2 agent_personalities Table
AI personality and behavior configuration.
9.3 greetings Table
Pre-configured greeting messages.
9.4 business_hours Table
Business hours for controlling AI behavior by time.
Section 10: Schema - Billing & Analytics
10.1 usage_records Table
Granular usage tracking for billing.
10.2 billing_events Table
Billing-related events (invoices, payments, etc.).
10.3 call_analytics Table
Pre-aggregated analytics for dashboards.
Section 11: Indexes & Performance
11.1 Required Indexes (With Explanations)
All indexes are defined inline with table definitions above. Here’s a summary of indexing strategy:Primary Access Patterns
1. List calls for a tenant (most common)SELECT * FROM calls WHERE tenant_id = $1 ORDER BY created_at DESC LIMIT 50
2. Find active records (soft delete filter)
SELECT * FROM calls WHERE tenant_id = $1 AND status = 'answered'
4. Time-range queries
WHERE to_tsvector('english', full_text) @@ to_tsquery('appointment')
Index Maintenance
11.2 Partitioning Strategy
For tables that grow very large, we use table partitioning.Calls Table Partitioning (Future)
When the calls table exceeds ~10 million rows, partition by month:Benefits of Partitioning
- Query Performance: Queries filtering by date only scan relevant partitions
- Maintenance: Can vacuum/reindex individual partitions
- Data Retention: Can drop old partitions instead of DELETE
- Parallel Query: PostgreSQL can scan partitions in parallel
When to Partition
- calls: When > 10M rows
- call_events: When > 50M rows
- transcript_turns: When > 100M rows
- usage_records: When > 50M rows
11.3 Query Patterns to Optimize For
Pattern 1: Tenant Call List
Query:- Index:
ix_calls_tenant_id_created_at - Limit result set size
- Consider cursor-based pagination for large offsets
Pattern 2: Analytics Dashboard
Query:- Pre-aggregated table (call_analytics)
- Index:
ix_call_analytics_tenant_date
Pattern 3: Knowledge Base Search
Query:- IVFFlat index on embeddings
- Filter by tenant_id first (reduces vector search scope)
Pattern 4: Transcript Search
Query:- GIN index on tsvector
- Tenant filter with FTS
Section 12: Migrations
12.1 Migration File Naming Convention
We use Alembic for database migrations. Migration files follow this naming pattern:20260125_1000_initial_schema.py20260125_1100_add_phone_numbers.py20260126_0900_add_call_sentiment.py
Migration File Structure
12.2 Initial Migration Script
The initial migration creates all tables defined in this document. Here’s the structure:12.3 How to Add New Migrations
Step 1: Generate Migration File
migrations/versions/.
Step 2: Edit the Migration
Step 3: Test Migration Locally
Step 4: Commit Migration
Migration Best Practices
-
Always test rollback - Every
upgrade()must have a workingdowngrade() - Avoid data loss - Don’t drop columns without migrating data first
- Use transactions - Alembic wraps migrations in transactions by default
- Handle large tables carefully - Adding indexes to large tables can lock them:
- Don’t modify old migrations - Once deployed, migrations are immutable
- Test with production-like data - A migration that works on empty tables might fail on real data
End of Part 2
You now have:- ✅ Complete understanding of database architecture decisions
- ✅ Full DDL for all 25+ tables
- ✅ Comprehensive column documentation
- ✅ Index strategy with explanations
- ✅ Migration workflow
- REST API architecture
- Authentication and authorization
- Complete endpoint specifications
- Request/response schemas
- Error handling
Document End - Part 2 of 10
Junior Developer PRD - Part 3: API Design
Document Version: 1.0Last Updated: January 25, 2026
Part: 3 of 10
Sections: 13-22
Audience: Junior developers with no prior context
Section 13: REST API Architecture
13.1 What is REST (Quick Refresher)
REST (Representational State Transfer) is an architectural style for designing web APIs. Our API follows REST principles: 1. Resources are nouns, not verbs- Good:
/api/v1/calls(noun) - Bad:
/api/v1/getCalls(verb)
GET- Read (retrieve data)POST- Create (new resource)PUT- Update (replace entire resource)PATCH- Partial Update (modify specific fields)DELETE- Delete (remove resource)
/api/v1/tenants/{tenant_id}/calls- Calls belonging to a tenant/api/v1/agencies/{agency_id}/tenants- Tenants belonging to an agency
- Each request contains all information needed
- Server doesn’t store client session state
- Authentication token sent with every request
- 2xx - Success
- 4xx - Client error (your fault)
- 5xx - Server error (our fault)
13.2 API URL Structure
Base URL
URL Pattern
Versioning Strategy
We use URL path versioning (/api/v1/, /api/v2/).
Why URL versioning:
- Explicit and visible
- Easy to route at load balancer
- Clear which version client is using
- Can run multiple versions simultaneously
v1- Current stable versionv2- Next version (when breaking changes needed)- Old versions deprecated with 6-month warning
- Deprecated versions return
Warningheader
13.3 Request Format
Headers
Every request must include:Request Body (POST/PUT/PATCH)
Always JSON:Query Parameters (GET)
For filtering, pagination, and sorting:13.4 Response Format
Successful Response (Single Resource)
Successful Response (Collection)
Error Response
13.5 HTTP Status Codes
Success Codes (2xx)
| Code | Name | When to Use |
|---|---|---|
| 200 | OK | Successful GET, PUT, PATCH |
| 201 | Created | Successful POST (resource created) |
| 202 | Accepted | Request accepted for async processing |
| 204 | No Content | Successful DELETE (no body returned) |
Client Error Codes (4xx)
| Code | Name | When to Use |
|---|---|---|
| 400 | Bad Request | Invalid JSON, missing required fields |
| 401 | Unauthorized | Missing or invalid authentication token |
| 403 | Forbidden | Valid token but insufficient permissions |
| 404 | Not Found | Resource doesn’t exist |
| 409 | Conflict | Resource already exists, state conflict |
| 422 | Unprocessable Entity | Valid JSON but business logic error |
| 429 | Too Many Requests | Rate limit exceeded |
Server Error Codes (5xx)
| Code | Name | When to Use |
|---|---|---|
| 500 | Internal Server Error | Unexpected server error |
| 502 | Bad Gateway | Upstream service error |
| 503 | Service Unavailable | Server overloaded or maintenance |
| 504 | Gateway Timeout | Upstream service timeout |
13.6 Error Code Reference
Standardized error codes for programmatic handling:Authentication Errors (AUTH_*)
| Code | HTTP Status | Description |
|---|---|---|
| AUTH_TOKEN_MISSING | 401 | No Authorization header |
| AUTH_TOKEN_INVALID | 401 | Malformed or expired token |
| AUTH_TOKEN_EXPIRED | 401 | Token has expired |
| AUTH_REFRESH_REQUIRED | 401 | Access token expired, use refresh token |
| AUTH_INVALID_CREDENTIALS | 401 | Wrong email or password |
| AUTH_ACCOUNT_LOCKED | 403 | Account locked due to failed attempts |
| AUTH_ACCOUNT_SUSPENDED | 403 | Account has been suspended |
| AUTH_EMAIL_NOT_VERIFIED | 403 | Email verification required |
Authorization Errors (AUTHZ_*)
| Code | HTTP Status | Description |
|---|---|---|
| AUTHZ_PERMISSION_DENIED | 403 | Lacks required permission |
| AUTHZ_RESOURCE_ACCESS_DENIED | 403 | Can’t access this specific resource |
| AUTHZ_ROLE_REQUIRED | 403 | Specific role required |
| AUTHZ_TENANT_MISMATCH | 403 | Resource belongs to different tenant |
| AUTHZ_AGENCY_MISMATCH | 403 | Resource belongs to different agency |
Validation Errors (VAL_*)
| Code | HTTP Status | Description |
|---|---|---|
| VAL_REQUIRED | 400 | Required field missing |
| VAL_INVALID_FORMAT | 400 | Field format invalid |
| VAL_INVALID_TYPE | 400 | Wrong data type |
| VAL_OUT_OF_RANGE | 400 | Value outside allowed range |
| VAL_TOO_LONG | 400 | String exceeds max length |
| VAL_TOO_SHORT | 400 | String below min length |
| VAL_INVALID_ENUM | 400 | Value not in allowed set |
| VAL_INVALID_EMAIL | 400 | Invalid email format |
| VAL_INVALID_PHONE | 400 | Invalid phone number format |
| VAL_INVALID_URL | 400 | Invalid URL format |
| VAL_INVALID_UUID | 400 | Invalid UUID format |
Resource Errors (RES_*)
| Code | HTTP Status | Description |
|---|---|---|
| RES_NOT_FOUND | 404 | Resource doesn’t exist |
| RES_ALREADY_EXISTS | 409 | Resource already exists |
| RES_CONFLICT | 409 | State conflict |
| RES_DELETED | 410 | Resource was deleted |
| RES_LOCKED | 423 | Resource is locked |
Business Logic Errors (BIZ_*)
| Code | HTTP Status | Description |
|---|---|---|
| BIZ_QUOTA_EXCEEDED | 422 | Quota limit reached |
| BIZ_SUBSCRIPTION_REQUIRED | 422 | Feature requires subscription |
| BIZ_INVALID_STATE | 422 | Invalid state transition |
| BIZ_DEPENDENCY_EXISTS | 422 | Can’t delete, has dependencies |
| BIZ_OPERATION_FAILED | 422 | Business operation failed |
Rate Limiting Errors (RATE_*)
| Code | HTTP Status | Description |
|---|---|---|
| RATE_LIMIT_EXCEEDED | 429 | Too many requests |
| RATE_LIMIT_MINUTE | 429 | Per-minute limit exceeded |
| RATE_LIMIT_HOUR | 429 | Per-hour limit exceeded |
| RATE_LIMIT_DAY | 429 | Per-day limit exceeded |
Server Errors (SRV_*)
| Code | HTTP Status | Description |
|---|---|---|
| SRV_INTERNAL_ERROR | 500 | Unexpected server error |
| SRV_DATABASE_ERROR | 500 | Database operation failed |
| SRV_EXTERNAL_SERVICE | 502 | External service error |
| SRV_TIMEOUT | 504 | Operation timed out |
| SRV_MAINTENANCE | 503 | Server in maintenance mode |
Section 14: Authentication
14.1 Authentication Flow Overview
We use JWT (JSON Web Tokens) for authentication.14.2 JWT Token Structure
Access Token
Header:| Field | Description |
|---|---|
| sub | Subject - User ID |
| User’s email address | |
| role | User’s role (platform_admin, agency_admin, etc.) |
| agency_id | Agency the user belongs to (null for platform admins) |
| tenant_id | Tenant the user belongs to (null for agency users) |
| permissions | Array of permission codes |
| iat | Issued at (Unix timestamp) |
| exp | Expiration time (Unix timestamp) |
| jti | JWT ID - unique identifier for this token |
- Access token: 24 hours
- Refresh token: 30 days
Refresh Token
Refresh tokens are opaque strings stored in the database:14.3 Authentication Endpoints
POST /api/v1/auth/login
Authenticate user and receive tokens. Request:POST /api/v1/auth/refresh
Get new access token using refresh token. Request:POST /api/v1/auth/logout
Revoke refresh token. Request:POST /api/v1/auth/password/forgot
Request password reset. Request:POST /api/v1/auth/password/reset
Reset password with token. Request:POST /api/v1/auth/email/verify
Verify email address. Request:GET /api/v1/auth/me
Get current user information. Request Headers:14.4 Token Validation Implementation
Section 15: Authorization (RBAC)
15.1 Role-Based Access Control Overview
Authorization determines what an authenticated user can do.Role Hierarchy
Scope Isolation
15.2 Permission Matrix
What Each Role Can Do
| Permission | Platform Admin | Agency Admin | Agency User | Tenant Admin | Tenant User |
|---|---|---|---|---|---|
| Agencies | |||||
| agencies.view | ✅ | Own only | ❌ | ❌ | ❌ |
| agencies.create | ✅ | ❌ | ❌ | ❌ | ❌ |
| agencies.edit | ✅ | Own only | ❌ | ❌ | ❌ |
| agencies.delete | ✅ | ❌ | ❌ | ❌ | ❌ |
| Tenants | |||||
| tenants.view | ✅ | ✅ | ✅ | Own only | Own only |
| tenants.create | ✅ | ✅ | ❌ | ❌ | ❌ |
| tenants.edit | ✅ | ✅ | ❌ | Own only | ❌ |
| tenants.delete | ✅ | ✅ | ❌ | ❌ | ❌ |
| Phone Numbers | |||||
| phone_numbers.view | ✅ | ✅ | ✅ | ✅ | ✅ |
| phone_numbers.provision | ✅ | ✅ | ❌ | ❌ | ❌ |
| phone_numbers.configure | ✅ | ✅ | ❌ | ✅ | ❌ |
| phone_numbers.release | ✅ | ✅ | ❌ | ❌ | ❌ |
| Calls | |||||
| calls.view | ✅ | ✅ | ✅ | ✅ | ✅ |
| calls.listen | ✅ | ✅ | ✅ | ✅ | ✅ |
| calls.export | ✅ | ✅ | ❌ | ✅ | ❌ |
| calls.delete | ✅ | ✅ | ❌ | ❌ | ❌ |
| Knowledge Base | |||||
| knowledge.view | ✅ | ✅ | ✅ | ✅ | ✅ |
| knowledge.create | ✅ | ✅ | ❌ | ✅ | ❌ |
| knowledge.edit | ✅ | ✅ | ❌ | ✅ | ❌ |
| knowledge.delete | ✅ | ✅ | ❌ | ✅ | ❌ |
| Analytics | |||||
| analytics.view | ✅ | ✅ | ✅ | ✅ | ✅ |
| analytics.export | ✅ | ✅ | ❌ | ✅ | ❌ |
| analytics.advanced | ✅ | ✅ | ❌ | ❌ | ❌ |
| Users | |||||
| users.view | ✅ | ✅ | Own only | ✅ | Own only |
| users.create | ✅ | ✅ | ❌ | ✅ | ❌ |
| users.edit | ✅ | ✅ | Own only | ✅ | Own only |
| users.delete | ✅ | ✅ | ❌ | ✅ | ❌ |
| Settings | |||||
| settings.view | ✅ | ✅ | ✅ | ✅ | ✅ |
| settings.edit | ✅ | ✅ | ❌ | ✅ | ❌ |
| Billing | |||||
| billing.view | ✅ | ✅ | ❌ | ❌ | ❌ |
| billing.manage | ✅ | ✅ | ❌ | ❌ | ❌ |
15.3 Authorization Implementation
Permission Check Decorator
Resource Access Check
Query Filtering by Scope
Section 16: Pagination, Filtering & Sorting
16.1 Pagination
Offset-Based Pagination
For most list endpoints, we use offset-based pagination: Request:limit- Number of items per page (default: 50, max: 100)offset- Number of items to skip (default: 0)
Cursor-Based Pagination (For Large Datasets)
For real-time or very large datasets, use cursor-based pagination: Request:limit- Number of items per pagecursor- Opaque cursor from previous response
16.2 Filtering
Filter Syntax
Filters are query parameters with the field name and value:Filter Operators
For advanced filtering, use operators:| Operator | Syntax | Example | Meaning |
|---|---|---|---|
| Equals | field=value | status=completed | Exact match |
| Not equals | field[ne]=value | status[ne]=failed | Not equal |
| Greater than | field[gt]=value | duration[gt]=60 | Greater than |
| Greater or equal | field[gte]=value | duration[gte]=60 | Greater or equal |
| Less than | field[lt]=value | duration[lt]=300 | Less than |
| Less or equal | field[lte]=value | duration[lte]=300 | Less or equal |
| In list | field[in]=a,b,c | status[in]=completed,transferred | In list |
| Not in list | field[nin]=a,b,c | status[nin]=failed,cancelled | Not in list |
| Contains | field[contains]=value | from_number[contains]=555 | Contains substring |
| Starts with | field[starts]=value | from_number[starts]=+1 | Starts with |
| Is null | field[null]=true | ended_at[null]=true | Is null |
| Is not null | field[null]=false | ended_at[null]=false | Is not null |
Date Range Filtering
Filterable Fields by Endpoint
Calls:status- Call statusdirection- inbound/outboundfrom_number- Caller numberto_number- Destination numberphone_number_id- Phone number UUIDinitiated_at- Call start timeanswered_at- Answer timeended_at- End timeduration- Duration in secondsoutcome- Call outcomesentiment_label- positive/neutral/negative
status- Tenant statusbusiness_type- Business categorycreated_at- Creation date
status- Number statusprovider- Telephony provider
16.3 Sorting
Sort Syntax
Use thesort parameter with field names:
- for descending order:
Sortable Fields by Endpoint
Calls:initiated_at(default:-initiated_at)answered_atended_atdurationstatusfrom_number
namecreated_at(default:-created_at)status
emailfirst_namelast_namecreated_atlast_login_at
16.4 Search
Full-text search on specific endpoints:- Transcript content
- Caller number
- Outcome description
- Document name
- Document content
Section 17: Agency Management API
17.1 List Agencies
Endpoint:GET /api/v1/agencies
Authorization: platform_admin only
Query Parameters: | Parameter | Type | Description | |-----------|------|-------------| | limit | integer | Items per page (default: 50, max: 100) | | offset | integer | Items to skip | | status | string | Filter by status | | billing_plan | string | Filter by billing plan | | sort | string | Sort field (default: -created_at) | | q | string | Search name, contact_email |
Request:
17.2 Get Agency
platform_admin or agency_admin/agency_user (own agency only)
Request:
17.3 Create Agency
Endpoint:POST /api/v1/agencies
Authorization: platform_admin only
Request:
17.4 Update Agency
platform_admin or agency_admin (own agency, limited fields)
Request (Platform Admin - Full Access):
- name
- contact_email, contact_phone, contact_name
- address fields
- settings (branding, defaults)
- slug
- status
- billing_plan
- max_tenants, max_concurrent_calls
17.5 Delete Agency
platform_admin only
Request:
Section 18: Tenant Management API
18.1 List Tenants
Endpoint:GET /api/v1/tenants
18.2 Get Tenant
18.3 Create Tenant
Endpoint:POST /api/v1/tenants
platform_admin or agency_admin
Request:
18.4 Update Tenant
null.
Response (200):
18.5 Delete Tenant
platform_admin or agency_admin
Response (204): No content
Error (422):
18.6 Get Tenant Settings
tenant_admin or higher
Response (200):
Section 19: Phone Number Management API
19.1 List Phone Numbers
Endpoint:GET /api/v1/phone-numbers
19.2 Search Available Numbers
Endpoint:GET /api/v1/phone-numbers/available
Authorization: agency_admin or higher
Query Parameters: | Parameter | Type | Description | |-----------|------|-------------| | area_code | string | Filter by area code (e.g., “404”) | | contains | string | Number contains pattern | | state | string | US state code (e.g., “GA”) | | country | string | Country code (default: “US”) | | limit | integer | Results to return (default: 20) |
Request:
19.3 Provision Phone Number
Endpoint:POST /api/v1/phone-numbers
Authorization: agency_admin or higher
Request:
19.4 Update Phone Number
agency_admin or tenant_admin (own tenant)
Request:
19.5 Release Phone Number
agency_admin or higher
Response (204): No content
Note: This releases the number back to the provider. The number may be reassigned to someone else. This action cannot be undone.
Confirmation Required:
For safety, require confirmation header:
Section 20: Call Management API
20.1 List Calls
Endpoint:GET /api/v1/calls
20.2 Get Call
20.3 Get Call Transcript
20.4 Get Call Recording
calls.listen permission
Response (200):
20.5 Get Call Events
20.6 Initiate Call Transfer
20.7 End Call
Section 21: Knowledge Base API
21.1 Get Knowledge Base
21.2 List Documents
knowledge.view
Response (200):
21.3 Create Document (Text)
knowledge.create
Request (Text Document):
21.4 Create Document (File Upload)
knowledge.create
Request: multipart/form-data
- PDF (.pdf)
- Word (.docx)
- Text (.txt)
- Markdown (.md)
21.5 Create Document (URL)
21.6 Get Document
knowledge.view
Response (200):
21.7 Update Document
knowledge.edit
Request:
21.8 Delete Document
knowledge.delete
Response (204): No content
21.9 Search Knowledge Base
knowledge.view
Request:
Section 22: Webhook Endpoints (Inbound)
These endpoints receive webhook notifications from external services.22.1 GoToConnect Webhooks
Endpoint:POST /api/v1/webhooks/gotoconnect
Authentication: Webhook signature validation
Signature Validation
GoToConnect signs webhooks with HMAC-SHA256:Event: call.ringing
Received when an incoming call starts ringing. Payload:- Look up phone number +15551234567 → find tenant
- Create call record in database
- Trigger WebRTC bridge to answer
- Create LiveKit room
- Dispatch AI agent
Event: call.answered
Received when call is answered. Payload:- Update call record: status = answered, answered_at = timestamp
- Record call_event
Event: call.ended
Received when call ends. Payload:- Update call record: status = completed, ended_at, duration_seconds
- Close LiveKit room
- Trigger post-call processing (transcript finalization, analytics)
22.2 LiveKit Webhooks
Endpoint:POST /api/v1/webhooks/livekit
Authentication: Webhook signature validation (JWT)
Signature Validation
LiveKit signs webhooks with the API secret:Event: room_started
Payload:Event: room_finished
Payload:- Mark LiveKit room as closed
- If call still marked as active, end it
Event: participant_joined
Payload:- Record call_event: participant_joined
- Update participant count
Event: participant_left
Payload:- Record call_event: participant_left
- If caller left, initiate call ending
Event: track_published
Payload:Event: egress_ended
Payload:- Update recording record with file info
- Mark recording as ready
- Trigger transcript processing if enabled
22.3 Deepgram Webhooks (If Using Callback Mode)
Endpoint:POST /api/v1/webhooks/deepgram
Note: We primarily use streaming WebSocket, but callback mode is used for batch transcription.
Payload:
22.4 Webhook Security Best Practices
1. Always Validate Signatures
2. Check Timestamp Freshness
3. Use HTTPS Only
4. Idempotency
5. Respond Quickly
6. Retry Logic on Failures
External services retry failed webhooks. Handle retries:End of Part 3
You now have:- ✅ Complete REST API architecture and conventions
- ✅ Full authentication system with JWT
- ✅ Comprehensive RBAC authorization
- ✅ Pagination, filtering, and sorting patterns
- ✅ Complete endpoint specifications for all resources
- ✅ Inbound webhook handling
- GoToConnect account setup
- OAuth 2.0 implementation
- Webhook event processing
- Call control API usage
- Phone number management
- Ooma WebRTC Softphone automation
Document End - Part 3 of 10
Junior Developer PRD - Part 4: GoToConnect Integration
Document Version: 1.0Last Updated: January 25, 2026
Part: 4 of 10
Sections: 23-30
Audience: Junior developers with no prior context
Section 23: GoToConnect Overview
23.1 What is GoToConnect
GoToConnect (formerly Jive) is a cloud-based business phone system owned by GoTo (formerly LogMeIn). It provides:- VoIP Phone Service - Cloud-hosted phone lines
- Phone Numbers - Provision and manage DIDs
- Call Routing - PBX functionality
- WebRTC Calling - Browser-based soft phone
- APIs - Programmatic control of calls and data
23.2 GoToConnect API Architecture
GoToConnect provides several API domains:| API | Base URL | Purpose |
|---|---|---|
| Authentication | https://authentication.logmeininc.com | OAuth 2.0 token management |
| Admin | https://api.goto.com | Account & user management |
| Voice Admin | https://api.goto.com/voice-admin/v1 | Phone system configuration |
| Web Calls | https://webrtc.jive.com/web-calls-v1 | WebRTC call control |
| Call Events | https://api.goto.com/call-events-report/v1 | Call event reporting & webhooks |
| Recording | https://api.goto.com/recording/v1 | Call recording access |
| Notification Channel | https://api.goto.com/notification-channel/v1 | Webhook subscriptions |
23.3 Authentication Scopes
GoToConnect uses OAuth 2.0 scopes to control API access:| Scope | Permission |
|---|---|
identity:scim.me | Read user identity |
voice-admin.v1.read | Read phone system config |
voice-admin.v1.write | Modify phone system config |
call-events.v1.notifications.manage | Manage call event webhooks |
call-events.v1.reads.read | Read call history |
calls.v2.initiate | Initiate outbound calls |
cr.v1.read | Read call recordings |
users.v1.lines.read | Read user line assignments |
23.4 Key Concepts
Account Key
Every GoToConnect organization has anaccountKey - a unique identifier for the account. This is required for most API calls.
Line
A “line” represents a phone line/extension in the system. Each user can have multiple lines assigned. Lines have:- Extension number (e.g., “1001”)
- Phone numbers (DIDs) associated
- Call forwarding rules
Device
A device is an endpoint that can make/receive calls:- Physical desk phone
- Mobile app
- WebRTC softphone (browser)
Session
For WebRTC calls, a “session” represents an active connection between a client and GoToConnect’s WebRTC infrastructure.Section 24: OAuth 2.0 Authentication
24.1 OAuth Flow Overview
GoToConnect uses OAuth 2.0 Authorization Code flow:24.2 Step 1: Redirect to Authorization
Build the authorization URL and redirect the user:24.3 Step 2: Handle Callback
When user authorizes, GoToConnect redirects to your callback URL:24.4 Step 3: Exchange Code for Tokens
24.5 Token Refresh
Access tokens expire (typically 1 hour). Use refresh token to get new access token:24.6 Token Storage Schema
Section 25: Voice Admin API
The Voice Admin API manages phone system configuration.25.1 Get Account Information
Retrieve account configuration details.voice-admin.v1.read
25.2 List Lines (Extensions)
Get all phone lines in the account. Endpoint:GET /voice-admin/v1/lines
Required Scope: voice-admin.v1.read
25.3 Get Line Details
Get details for a specific line.25.4 Search Available Phone Numbers
Search for phone numbers available for purchase. Endpoint:GET /voice-admin/v1/phone-numbers/available
Required Scope: voice-admin.v1.read
25.5 Create Phone Number Order
Order a new phone number. Endpoint:POST /voice-admin/v1/phone-number-orders
Required Scope: voice-admin.v1.write
25.6 Get Phone Number Order Status
Check the status of a phone number order.25.7 Assign Phone Number to Line
Assign a provisioned number to a specific line.Section 26: WebRTC Call Control API
The Web Calls API (webrtc.jive.com) provides programmatic control of WebRTC calls. This is how we answer incoming calls and route audio to our AI system.
26.1 Understanding WebRTC Sessions
A WebRTC session represents an authenticated connection to GoToConnect’s real-time infrastructure:26.2 Create WebRTC Session
Create a session to receive and control calls. Endpoint:POST /web-calls-v1/sessions
Base URL: https://webrtc.jive.com
26.3 Session WebSocket Events
Connect to the WebSocket URL to receive real-time events:Incoming Call Event
Call State Changed Event
26.4 Answer Incoming Call
Answer a ringing call.26.5 Place Call on Hold
Put an active call on hold.26.6 Resume Call from Hold
Take a call off hold.26.7 Mute/Unmute
Control audio muting.26.8 Send DTMF Tones
Send touch-tone digits during a call.26.9 Blind Transfer
Transfer call directly to another number without consultation.26.10 Warm Transfer
Transfer with consultation - speak to recipient before completing transfer.26.11 End Call (Hangup)
Terminate an active call.26.12 Refresh Session
Keep session alive (call periodically).Section 27: Notification Channel API (Webhooks)
The Notification Channel API lets us subscribe to events via webhooks.27.1 Create Notification Channel
Register a webhook URL to receive events. Endpoint:POST /notification-channel/v1/channels
27.2 Extend Channel Lifetime
Prevent channel from expiring.27.3 Delete Notification Channel
Remove a webhook subscription.Section 28: Call Events API
Subscribe to and receive call events via webhooks.28.1 Create Call Events Subscription
Subscribe to call events for specific lines. Endpoint:POST /call-events-report/v1/subscriptions
28.2 Webhook Event Payloads
Call Started Event
Call Answered Event
Call Ended Event
Call Transferred Event
28.3 Process Webhook Events
Section 29: Recording API
Access call recordings from GoToConnect.29.1 Get Recording Content
Download the audio file for a recording.29.2 Get Recording Token URL
Get a time-limited URL to access recording.29.3 Subscribe to Recording Events
Get notified when recordings are available. Endpoint:POST /recording/v1/subscriptions
Recording Available Event
Section 30: Complete Integration Flow
30.1 Initial Setup Flow
When an agency connects GoToConnect:30.2 Phone Number Provisioning Flow
When provisioning a new phone number for a tenant:30.3 Call Handling Flow
Complete flow when an inbound call arrives:30.4 Background Jobs
Channel Lifetime Renewal
Token Refresh
30.5 Error Handling Reference
GoToConnect Error Codes
| Error Code | HTTP Status | Meaning | Action |
|---|---|---|---|
AUTHN_INVALID_TOKEN | 401 | Token invalid | Refresh or reauthorize |
AUTHN_EXPIRED_TOKEN | 401 | Token expired | Refresh token |
AUTHN_MALFORMED_TOKEN | 401 | Token malformed | Reauthorize |
AUTHZ_INSUFFICIENT_SCOPE | 403 | Missing scope | Request additional scopes |
NOT_FOUND | 404 | Resource not found | Check IDs |
INVALID_ACCOUNT_KEY | 400 | Bad account key | Verify account key |
INVALID_AREA_CODE | 400 | Invalid area code | Use valid area code |
NO_PHONE_NUMBER_FOUND | 400 | No numbers available | Try different area code |
TOO_MANY_REQUESTS | 429 | Rate limited | Back off and retry |
UNKNOWN_ERROR | 500 | Server error | Retry with backoff |
Error Handling Strategy
End of Part 4
You now have:- ✅ Complete OAuth 2.0 authentication implementation
- ✅ Voice Admin API for account and phone number management
- ✅ WebRTC Call Control API for programmatic call handling
- ✅ Notification Channel API for webhooks
- ✅ Call Events API for event subscriptions
- ✅ Recording API for call recording access
- ✅ Complete integration flows with code examples
- Bridge architecture connecting GoToConnect WebRTC to LiveKit
- Audio capture and forwarding
- SDP negotiation handling
- Bridge state management
- Error recovery and reconnection
Document End - Part 4 of 10
Junior Developer PRD Part 5: WebRTC Bridge Service
Document Version: 1.0Last Updated: January 25, 2026
Sections: 28-33
Estimated Reading Time: 45 minutes
How to Use This Document
This is Part 5 of a 10-part PRD series. Each part is designed to be read in order, building on concepts from previous parts. Prerequisites: Before reading this document, you should have completed:- Part 1: Foundation & Context (understanding of the overall system)
- Part 2: Database Design (understanding of data models)
- Part 3: API Design (understanding of REST endpoints)
- Part 4: GoToConnect Integration (understanding of telephony layer)
- What the WebRTC Bridge does and why it exists
- How audio flows between phone callers and AI agents
- How to implement bidirectional audio streaming
- How to manage WebRTC connections with aiortc
- How to integrate with LiveKit for real-time communication
- How to handle connection lifecycle and error recovery
Table of Contents
- Section 28: Bridge Architecture
- Section 29: aiortc WebRTC Implementation
- Section 30: Audio Capture & Processing
- Section 31: LiveKit Connection
- Section 32: Audio Routing
- Section 33: Bridge Lifecycle
Section 28: Bridge Architecture
28.1 What is the WebRTC Bridge?
The WebRTC Bridge is the most critical component in our voice infrastructure. It’s the “glue” that connects telephone callers to our AI processing pipeline. Without it, there’s no way to get audio from a phone call into our AI system.The Problem It Solves
When someone calls our platform:- Their phone call arrives via PSTN (public phone network)
- GoToConnect receives the call and converts it to WebRTC audio
- But how do we get that audio to our AI agent?
- Establishes a WebRTC connection with GoToConnect to receive caller audio
- Establishes a separate connection with LiveKit to send audio to AI agents
- Routes audio bidirectionally between these two connections
- Handles codec conversion, resampling, and buffering
Why Can’t We Just Connect GoToConnect Directly to LiveKit?
Good question! In theory, both use WebRTC. But there are several problems:- Different Authentication: GoToConnect uses OAuth + proprietary signaling; LiveKit uses JWT tokens
- Different Signaling: GoToConnect controls SDP exchange through their API; LiveKit uses their own protocol
- Codec Negotiation: GoToConnect may offer different codecs than LiveKit expects
- Participant Management: LiveKit needs to track participants in rooms; GoToConnect doesn’t know about rooms
- Processing Opportunity: We need to capture audio for our AI pipeline anyway
28.2 High-Level Architecture
28.3 Component Responsibilities
GoTo Connection Handler
- Receives SDP offers from GoToConnect
- Negotiates audio codecs (prefers Opus, falls back to G.711)
- Manages ICE candidate exchange
- Receives caller audio from GoToConnect WebRTC
- Sends AI response audio back to GoToConnect
Audio Bridge
- Decodes incoming audio (Opus or G.711 to PCM)
- Resamples audio between different sample rates (8kHz, 16kHz, 48kHz)
- Buffers audio to handle timing variations
- Encodes outgoing audio (PCM to Opus)
LiveKit Connection Handler
- Creates and joins LiveKit rooms
- Publishes caller audio as a track
- Subscribes to AI agent audio tracks
- Manages participant lifecycle
28.4 Design Goals
| Goal | Target | Why It Matters |
|---|---|---|
| Audio latency | < 50ms bridge overhead | Users notice delays > 150ms total |
| Connection setup | < 2 seconds | Callers expect fast answers |
| Audio quality | No degradation | Poor quality = poor user experience |
| Reliability | 99.9% call completion | Dropped calls lose customers |
| Scalability | 1000 concurrent calls | Support growth |
| Resource efficiency | < 50MB RAM per call | Keep infrastructure costs low |
28.5 Technology Choice: aiortc
We use aiortc (Python asyncio WebRTC implementation) for the bridge. Here’s why:What is aiortc?
aiortc is a Python library that implements the WebRTC specification. It provides:- Full WebRTC stack in pure Python
- asyncio-based for non-blocking I/O
- Built-in codecs (Opus, G.711, VP8, H.264)
- Support for audio/video/data channels
Why Not Browser-Based?
An alternative approach would be to run a headless browser (Playwright/Puppeteer) and use the browser’s WebRTC. We don’t do this because:- Resource Usage: Each browser instance uses 200-500MB RAM; aiortc uses ~50MB
- Startup Time: Browsers take 2-5 seconds to start; aiortc is instant
- Direct Control: With aiortc, we have direct access to audio frames; browsers add abstraction
- Simpler Deployment: No need to install Chrome/Chromium in containers
- Better Debugging: Python code is easier to debug than browser internals
When Browser Automation IS Used
We do use Playwright in Part 4 for the Ooma softphone login automation. That’s a different use case - we need to authenticate to GoToConnect’s web interface. Once authenticated, we hand off to aiortc for the actual WebRTC connection.28.6 Threading Model
The bridge uses multiple threads/tasks for performance:Why Multiple Threads?
- Asyncio Event Loop: Handles all I/O operations (network, API calls)
- Media Thread: aiortc runs RTP/RTCP processing in a dedicated thread for timing accuracy
- Audio Processing Thread: Heavy operations like resampling don’t block the event loop
28.7 State Machine
The bridge follows a strict state machine to ensure consistent behavior:28.8 Environment Variables
The bridge requires these environment variables:Section 29: aiortc WebRTC Implementation
29.1 What is WebRTC?
Before diving into code, let’s understand WebRTC (Web Real-Time Communication):Core Concepts
Peer Connection: A connection between two endpoints that can carry audio/video/data. SDP (Session Description Protocol): A text format describing what media capabilities each peer has. Example:Offer/Answer Model
WebRTC uses an “offer/answer” model:- Offerer creates an SDP offer listing their capabilities
- Answerer receives offer, creates answer with compatible settings
- Both exchange ICE candidates (network paths)
- Connection established when media flows
29.2 aiortc Basics
Installation
Core Classes
29.3 WebRTC Connection Implementation
Here’s our WebRTC connection wrapper:29.4 SDP Negotiation
SDP parsing and manipulation is complex. Here’s our negotiator:29.5 ICE Candidate Handling
ICE candidates are exchanged asynchronously (trickle ICE):Section 30: Audio Capture & Processing
30.1 Audio Fundamentals
Before processing audio, understand these concepts:Sample Rate
How many audio samples per second:- 8000 Hz: Telephone quality (G.711)
- 16000 Hz: Wideband telephony
- 48000 Hz: High-quality audio (Opus default)
Bit Depth
How many bits per sample:- 16-bit: Standard for voice (-32768 to 32767)
- 32-bit float: Used in processing
Frame Size
Audio is processed in chunks called “frames”:| Duration | 8kHz | 16kHz | 48kHz |
|---|---|---|---|
| 10ms | 80 samples | 160 samples | 480 samples |
| 20ms | 160 samples | 320 samples | 960 samples |
| 40ms | 320 samples | 640 samples | 1920 samples |
Byte Size
For 16-bit mono audio, bytes = samples × 2:| Duration | 8kHz | 16kHz | 48kHz |
|---|---|---|---|
| 10ms | 160 bytes | 320 bytes | 960 bytes |
| 20ms | 320 bytes | 640 bytes | 1920 bytes |
| 40ms | 640 bytes | 1280 bytes | 3840 bytes |
30.2 Audio Frame Processing with PyAV
aiortc uses PyAV (FFmpeg bindings) for audio frames:30.3 Audio Buffering
Audio streams need buffering to handle timing variations:30.4 Audio Resampling
Different parts of the pipeline use different sample rates:30.5 Custom Audio Tracks
We need custom track implementations for both receiving and sending audio:Section 31: LiveKit Connection
31.1 LiveKit Overview
LiveKit is an open-source WebRTC SFU (Selective Forwarding Unit) that provides:- Room-based architecture for real-time communication
- Server-side SDKs for Python, Go, Node.js
- Low-latency media routing
- Automatic scaling
Why LiveKit?
| Feature | Benefit |
|---|---|
| Room model | Logical grouping for calls |
| Participant management | Track who’s in each call |
| Server-side API | Create rooms, manage participants |
| Recording (Egress) | Built-in recording service |
| Agents framework | Dispatch AI agents to rooms |
Room Architecture for Calls
31.2 LiveKit Python SDK
Installation
livekit package contains the real-time client SDK. The livekit-api package contains the server-side API client.
31.3 LiveKit Connection Handler
Here’s our LiveKit integration:31.4 Token Generation
Tokens are generated for different participant types:Section 32: Audio Routing
32.1 Bidirectional Audio Flow
The bridge routes audio in two directions:32.2 Audio Bridge Implementation
Here’s the complete audio bridge that coordinates both directions:32.3 Volume Normalization
Optional volume normalization to ensure consistent levels:Section 33: Bridge Lifecycle
33.1 Complete Call Lifecycle
The bridge goes through distinct phases during a call:33.2 Lifecycle Manager
The lifecycle manager coordinates all phases:33.3 Bridge Manager (Service Level)
Manages all active bridges across the service:33.4 Error Recovery
Handling various failure modes:33.5 Testing the Bridge
Unit Tests
Integration Tests
Part 5 Summary
In this part, you learned about the WebRTC Bridge Service:Section 28: Bridge Architecture
- The bridge connects GoToConnect (phone calls) to LiveKit (AI processing)
- Uses aiortc (Python WebRTC) for direct control over audio
- Multi-threaded design for performance
- State machine ensures consistent behavior
Section 29: aiortc WebRTC Implementation
- WebRTC uses offer/answer model for negotiation
- SDP describes media capabilities
- ICE handles NAT traversal for connectivity
- Custom connection wrapper for simplified usage
Section 30: Audio Capture & Processing
- Audio frames are processed at 20ms intervals
- Sample rates vary: 8kHz (G.711), 16kHz (wideband), 48kHz (Opus)
- Ring buffers handle timing variations
- Resampling converts between rates
Section 31: LiveKit Connection
- LiveKit provides room-based real-time communication
- Token-based authentication with scoped permissions
- Participants publish and subscribe to tracks
- Audio flows from bridge to agent and back
Section 32: Audio Routing
- Bidirectional audio: inbound (caller→agent) and outbound (agent→caller)
- Buffers smooth out timing jitter
- Optional volume normalization for consistency
Section 33: Bridge Lifecycle
- Five phases: Setup → Negotiation → Connection → Active → Teardown
- Lifecycle manager coordinates all components
- Metrics collected for monitoring
- Error recovery handles common failures
What’s Next
In Part 6: LiveKit Integration, you’ll learn:- LiveKit Cloud setup and configuration
- Room management for calls
- AI Agent framework integration
- Recording with LiveKit Egress
- Real-time events and monitoring
End of Part 5
Junior Developer PRD - Part 6: LiveKit Integration
Comprehensive Implementation Guide for Junior Developers
Document Information
| Field | Value |
|---|---|
| Document Title | Junior Developer PRD - Part 6: LiveKit Integration |
| Version | 1.0.0 |
| Last Updated | January 2026 |
| Author | Voice by aiConnected Technical Team |
| Status | Draft |
| Audience | Junior Developers |
| Prerequisites | Parts 1-5 of this PRD |
| Estimated Reading Time | 45 minutes |
Table of Contents
- Section 34: LiveKit Cloud Setup
- Section 35: Room Management
- Section 36: Participant Management
- Section 37: Token Generation
- Section 38: Audio Track Handling
- Section 39: LiveKit Webhooks
- Section 40: Recording with Egress
Section 34: LiveKit Cloud Setup
34.1 Account Creation
What is LiveKit?
LiveKit is an open-source platform for real-time audio and video communication. Think of it as the infrastructure that enables multiple people (or AI agents and phone callers) to talk to each other in real-time, similar to how Zoom or Google Meet works behind the scenes. For Voice by aiConnected, LiveKit serves as the central hub where:- Phone callers (via GoToConnect) connect
- AI agents join to process speech
- Human supervisors can monitor calls
- All audio is routed between participants
Why LiveKit Cloud?
LiveKit offers two deployment options:| Option | Description | When to Use |
|---|---|---|
| Self-hosted | You run LiveKit servers yourself | Large scale, strict data requirements |
| LiveKit Cloud | LiveKit manages servers for you | Faster setup, automatic scaling, global reach |
- Eliminates server management overhead
- Provides automatic global distribution
- Scales automatically with call volume
- Reduces operational complexity
Creating a LiveKit Cloud Account
Step 1: Sign Up Go to https://cloud.livekit.io and create an account. Step 2: Create a Project After signing in:- Click “Create Project”
- Name it something like “voice-aiconnected-prod” or “voice-aiconnected-dev”
- Select a primary region (we use us-west-2)
| Credential | What It Is | Example Format |
|---|---|---|
| API Key | Public identifier for your project | APIxxxxxxxx |
| API Secret | Private key for signing tokens | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |
LiveKit URLs
Your LiveKit Cloud project provides these URLs:34.2 Project Configuration
Environment Variables
Create these environment variables for your LiveKit configuration:Configuration Data Class
In Python, we create a configuration class to manage LiveKit settings:Why These Settings Matter
| Setting | Purpose | Impact if Wrong |
|---|---|---|
api_key | Identifies your project | Can’t authenticate |
api_secret | Signs tokens | Tokens rejected |
ws_url | Where clients connect | Connection fails |
api_url | Where server calls go | Can’t create rooms |
reconnect_attempts | Retry limit | Drops calls too easily OR hangs |
room_empty_timeout | Room cleanup | Wastes resources OR drops calls |
34.3 API Credentials
Understanding API Keys vs API Secrets
Think of these like a username and password:| Credential | Public/Private | Where Used | Can Be Shared? |
|---|---|---|---|
| API Key | Public | In tokens, logs, debugging | Yes |
| API Secret | PRIVATE | Only on server | NEVER |
How Credentials Are Used
Secure Credential Storage
DO NOT do this:Credential Rotation
If you suspect your API secret has been compromised:- Go to LiveKit Cloud dashboard
- Navigate to your project settings
- Click “Rotate API Secret”
- Update your environment variables immediately
- Restart all services
34.4 Webhook Configuration
What Are Webhooks?
Webhooks are HTTP callbacks that LiveKit sends to your server when events happen. Instead of constantly asking LiveKit “Did anything happen?”, LiveKit tells you when something happens.Events LiveKit Can Notify You About
| Event | When It Fires | What You Might Do |
|---|---|---|
room_started | Room is created | Start billing timer |
room_finished | Room closes | Stop billing, save analytics |
participant_joined | Someone joins | Update dashboard, log |
participant_left | Someone leaves | Check if call ended |
track_published | Audio/video starts | Verify connection working |
egress_started | Recording begins | Log for compliance |
egress_ended | Recording ends | Process/store recording |
Setting Up Webhooks in LiveKit Cloud
Step 1: Configure Your Webhook Endpoint In your LiveKit Cloud project settings:- Go to “Webhooks” section
- Add your endpoint URL:
https://api.yourdomain.com/webhooks/livekit - Select which events you want to receive
Webhook Security Best Practices
- Always validate signatures - Never process webhooks without checking the JWT
- Use HTTPS - LiveKit won’t send webhooks to HTTP endpoints
- Respond quickly - Return 200 OK within a few seconds
- Process asynchronously - Queue events for background processing
- Handle duplicates - Webhooks might be sent more than once
Section 35: Room Management
35.1 Room Naming Convention
Why Naming Conventions Matter
In LiveKit, rooms are identified by name. A good naming convention:- Makes debugging easier (you can tell what a room is for)
- Enables filtering (find all rooms for a specific tenant)
- Prevents collisions (two different calls won’t share a room name)
- Supports multi-tenancy (isolate tenants from each other)
Voice by aiConnected Room Naming Format
Breaking Down the Format
| Component | Purpose | Rules | Example |
|---|---|---|---|
| type | What kind of call | call, outbound, transfer, conference, test | call |
| tenant_id | Which customer | Lowercase, alphanumeric with hyphens | acme-corp |
| call_id | Unique call identifier | UUID format | 550e8400-e29b-41d4... |
| suffix | Optional variant | Lowercase, alphanumeric with hyphens | warm |
Room Types
Room Naming Implementation
35.2 Room Creation Logic
When Rooms Are Created
Rooms are created at the start of a call:Room Configuration
Room Service Implementation
35.3 Room Configuration Options
Configuration Options Explained
| Option | Type | Default | Purpose |
|---|---|---|---|
empty_timeout | int (seconds) | 300 | How long room stays open with no participants |
max_participants | int | 10 | Maximum concurrent participants |
metadata | JSON string | - | Custom data stored with the room |
Different Configurations for Different Call Types
35.4 Room Deletion/Cleanup
When Rooms Are Deleted
Rooms are deleted:- Automatically - When
empty_timeoutexpires with no participants - Explicitly - When we call
delete_roomafter a call ends - On Error - When something goes wrong and we need to clean up
Room Lifecycle States
Room Lifecycle Manager
Section 36: Participant Management
36.1 Participant Types
Who Joins LiveKit Rooms?
In Voice by aiConnected, several types of participants can join a call room:Participant Type Implementation
36.2 Participant Identity Format
Structured Identities
We use a structured format for participant identities that encodes:- What type of participant they are
- Which tenant they belong to
- A unique identifier
Why Structured Identities?
| Benefit | Explanation |
|---|---|
| Debugging | Instantly see who’s who in logs |
| Filtering | Find all agents, all callers for a tenant |
| Security | Verify participant belongs to correct tenant |
| Analytics | Track metrics by participant type |
Identity Implementation
36.3 Permissions by Role
Permission Matrix
| Permission | Caller | AI Agent | Supervisor | Observer | Recorder | Human Agent |
|---|---|---|---|---|---|---|
can_publish | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
can_subscribe | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
can_publish_data | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ |
can_update_metadata | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ |
hidden | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
What Each Permission Means
can_publish- Allows publishing audio (and video if supported)
- If false, participant can only listen
- Callers and agents need this to speak
- Allows receiving others’ audio
- Almost always true - everyone needs to hear
- Could be false for one-way broadcast
- Allows sending data channel messages
- Agents use this to send transcription updates
- Not needed for basic voice calls
- Allows changing own metadata
- Agents update status (processing, responding)
- Not needed for callers
- Other participants don’t know you’re there
- Perfect for supervisors monitoring calls
- Observers and recorders are always hidden
36.4 Participant Lifecycle
States a Participant Goes Through
Participant Manager Implementation
Section 37: Token Generation
37.1 JWT Structure
What is a JWT?
JWT (JSON Web Token) is a standard for securely transmitting information. LiveKit uses JWTs to authenticate participants and authorize their actions. A JWT has three parts:- Header - Says it’s a JWT and which algorithm signed it
- Payload - Contains the actual data (claims)
- Signature - Proves the token wasn’t tampered with
37.2 Claims & Grants
Standard JWT Claims
| Claim | Full Name | Purpose | Example |
|---|---|---|---|
sub | Subject | Who the token is for | ai_agent:acme:001 |
iss | Issuer | Who created the token | APIxxxxxxxxx |
nbf | Not Before | When token becomes valid | 1704067200 |
exp | Expiration | When token expires | 1704070800 |
name | Name | Display name | AI Assistant |
LiveKit Video Grants
Thevideo claim contains LiveKit-specific permissions:
| Grant | Type | Purpose |
|---|---|---|
room | string | Which room this token is for |
roomJoin | bool | Can join the room |
canPublish | bool | Can publish tracks (audio/video) |
canSubscribe | bool | Can receive others’ tracks |
canPublishData | bool | Can send data messages |
hidden | bool | Invisible to other participants |
recorder | bool | Special recording permissions |
roomCreate | bool | Can create new rooms |
roomList | bool | Can list rooms |
roomAdmin | bool | Full admin access |
37.3 Token Service Implementation
37.4 Token Refresh Strategy
Why Token Refresh Matters
Tokens have limited lifetimes for security. But what happens if a call lasts longer than the token’s TTL? We need to refresh tokens before they expire.Token Refresh Manager
Section 38: Audio Track Handling
38.1 Track Publication
What is Track Publication?
When a participant wants to share audio (or video), they “publish” a track to the room. Other participants can then “subscribe” to that track to receive the audio.Audio Track Options
Audio Track Publisher
38.2 Track Subscription
Subscribing to Audio
When you join a room, you can subscribe to audio tracks published by other participants. This is how the AI agent hears the caller.38.3 Track Quality Settings
Quality vs Bandwidth Tradeoff
| Setting | Voice Call | Music/Ads | Low Bandwidth |
|---|---|---|---|
| Sample Rate | 48000 Hz | 48000 Hz | 16000 Hz |
| Channels | 1 (mono) | 2 (stereo) | 1 (mono) |
| Bitrate | 32 kbps | 96 kbps | 16 kbps |
| DTX | On | Off | On |
| FEC | On | On | Off |
| Latency | ~50ms | ~50ms | ~30ms |
Adaptive Quality
LiveKit automatically adjusts quality based on network conditions:38.4 Mute/Unmute
Muting Tracks
Muting prevents audio from being transmitted without disconnecting:Section 39: LiveKit Webhooks
39.1 Room Started
When Room Started Fires
Theroom_started webhook fires when a new LiveKit room is created and ready for participants.
39.2 Room Finished
When Room Finished Fires
Theroom_finished webhook fires when a room closes (all participants left and timeout expired, or explicitly deleted).
39.3 Participant Joined
Webhook Payload Structure
Handler Implementation
39.4 Participant Left
39.5 Track Published/Unpublished
Track Events
Section 40: Recording with Egress
40.1 Egress Types
What is Egress?
Egress is LiveKit’s term for extracting media from a room for recording, streaming, or other processing. There are several types:| Type | Purpose | Output |
|---|---|---|
| Room Composite | Record entire room as single file | MP4/WebM video or audio-only |
| Track Composite | Record specific tracks | Audio/video file |
| Participant Egress | Record specific participant | Audio/video file |
| Web Egress | Render a web page with room content | Video file |
Egress Configuration
40.2 Starting Recording
Recording Service Implementation
40.3 Stopping Recording
40.4 Storage Configuration
S3 Configuration
LiveKit Egress can output directly to Amazon S3 (or S3-compatible storage like DigitalOcean Spaces, MinIO, etc.).Bucket Structure
40.5 Recording Retrieval
Retrieving Recordings
Part 6 Summary
What You Learned
In this part, you learned about LiveKit integration for Voice by aiConnected:| Section | Key Concepts |
|---|---|
| 34. LiveKit Cloud Setup | Account creation, API credentials, webhooks |
| 35. Room Management | Naming conventions, creation, lifecycle |
| 36. Participant Management | Types, permissions, identity format |
| 37. Token Generation | JWT structure, claims, grants, refresh |
| 38. Audio Track Handling | Publication, subscription, quality, muting |
| 39. LiveKit Webhooks | Room/participant/track events |
| 40. Recording with Egress | Recording types, storage, retrieval |
Key Takeaways
- LiveKit is the central hub where all participants meet
- Tokens control access - they encode identity and permissions
- Webhooks provide real-time updates about what’s happening
- Rooms have lifecycles - create, use, destroy
- Audio tracks are the core - publishing and subscribing is how voice flows
- Egress enables recording - important for compliance and training
Next Steps
In Part 7, you’ll learn about the Voice AI Pipeline:- Deepgram speech-to-text integration
- Voice Activity Detection
- Claude LLM integration
- Chatterbox text-to-speech
Quick Reference
Environment Variables
Common Operations
Continue to Part 7 for Voice AI Pipeline details…
Junior Developer PRD — Part 7A: Pipeline Architecture & Deepgram STT
Document Version: 1.0Last Updated: January 25, 2026
Part: 7A of 10 (Sub-part 1 of 3)
Sections: 41-42
Audience: Junior developers with no prior context
Estimated Reading Time: 20 minutes
How to Use This Document
This is Part 7A of the PRD series—the first of three sub-parts covering the Voice AI Pipeline. Part 7 was divided into sub-parts due to its comprehensive nature:- Part 7A (this document): Pipeline Architecture + Deepgram STT
- Part 7B: Voice Activity Detection + Claude LLM Integration
- Part 7C: Chatterbox TTS + Barge-In Handling + State Management
Table of Contents
Section 41: Pipeline Architecture
41.1 What is the Voice Pipeline?
The voice pipeline is the heart of Voice by aiConnected. It’s the processing chain that transforms a caller’s spoken words into AI responses and back to synthesized speech. Think of it like a relay race with four runners:- VAD (Voice Activity Detection) — Detects when someone is speaking
- STT (Speech-to-Text) — Converts speech to text
- LLM (Large Language Model) — Generates a response
- TTS (Text-to-Speech) — Converts the response back to speech
Why Latency Matters
In human conversation, we naturally expect responses within 200-400ms. Here’s how different latencies feel:| Latency | User Perception |
|---|---|
| < 500ms | Feels instant, like talking to a human |
| 500-1000ms | Feels responsive, acceptable |
| 1000-1500ms | Noticeable delay, still usable |
| 1500-2000ms | Awkward pause, frustrating |
| > 2000ms | Feels broken, users hang up |
41.2 High-Level Architecture
Detailed Flow
41.3 Component Summary
| Component | Technology | Purpose | Latency |
|---|---|---|---|
| Audio Transport | LiveKit | Real-time audio streaming | ~40ms |
| VAD | Silero VAD | Detect speech activity | ~10ms |
| STT | Deepgram Nova-2 | Transcribe speech | ~200-350ms |
| LLM | Claude Sonnet | Generate responses | ~300-500ms |
| TTS | Chatterbox | Synthesize speech | ~100-200ms |
| State Manager | Redis | Track conversation state | ~5ms |
41.4 Latency Budget
41.5 Latency Optimization Strategies
Strategy 1: Streaming Everything
Instead of waiting for complete results, we stream at every stage:Strategy 2: Sentence-Level TTS
We don’t wait for the entire LLM response. As soon as we have a complete sentence, we send it to TTS:Strategy 3: Warm Connections
Keep connections to external services pre-established:41.6 Pipeline States
The pipeline operates as a state machine:State Definitions
| State | Description | Entry Condition |
|---|---|---|
| IDLE | Pipeline initialized, waiting | Call connected |
| LISTENING | Waiting for speech | Ready for input |
| CAPTURING | Recording utterance | VAD detected speech |
| PROCESSING | Generating response | User finished speaking |
| SPEAKING | Playing AI response | TTS audio ready |
41.7 Data Flow Example
When a caller says “What are your business hours?”:41.8 Error Handling Strategy
Error Categories
| Category | Example | Recovery |
|---|---|---|
| Transient | Network timeout | Retry with backoff |
| Provider | Deepgram API error | Failover to backup |
| Fatal | Invalid configuration | End call gracefully |
Fallback Chains
Section 42: Deepgram STT Integration
42.1 What is Deepgram?
Deepgram is a speech-to-text (STT) service that converts spoken audio into written text. We chose Deepgram because:- Low Latency: ~200ms for streaming transcription
- High Accuracy: 95%+ word accuracy
- Streaming API: Real-time results as the user speaks
- Interim Results: Preview of transcription before final result
- Automatic Punctuation: Adds periods, commas, question marks
Provider Comparison
| Provider | Latency | Streaming | Cost/min | Notes |
|---|---|---|---|---|
| Deepgram Nova-2 | ~200ms | ✓ | $0.0043 | Our choice |
| Google Speech | ~300ms | ✓ | $0.006 | Higher latency |
| AWS Transcribe | ~500ms | ✓ | $0.024 | Too slow |
| Whisper (OpenAI) | ~1000ms | ✗ | $0.006 | No streaming |
| AssemblyAI | ~300ms | ✓ | $0.0065 | Backup option |
42.2 Account Setup
Step 1: Create Deepgram Account
- Go to https://console.deepgram.com
- Sign up with email or Google
- Verify your email
Step 2: Create API Key
- In the console, go to API Keys
- Click Create New Key
- Name it:
voice-aiconnected-production - Select permissions:
usage:readkeys:readtranscription:read
- Copy the key (you won’t see it again)
Step 3: Configure Environment
42.3 Configuration
42.4 WebSocket Connection Flow
Deepgram uses WebSocket for real-time streaming:42.5 Deepgram Client Implementation
42.6 Interim vs Final Results
Deepgram sends two types of results:Interim Results (is_final=False)
- Sent while the user is still speaking
- May change as more audio is processed
- Use for displaying live transcription
- Don’t send to LLM
Final Results (is_final=True)
- Sent when Deepgram is confident the text won’t change
- May still be mid-utterance
- Use for building the complete transcript
Speech Final (speech_final=True)
- Indicates the user has stopped speaking
- Time to send to LLM
Transcript Accumulator
42.7 Audio Format Requirements
Deepgram expects audio in a specific format:| Parameter | Value | Notes |
|---|---|---|
| Sample Rate | 16000 Hz | Optimal for speech |
| Channels | 1 (mono) | Stereo wastes bandwidth |
| Encoding | linear16 | 16-bit signed PCM |
| Byte Order | Little-endian | Standard |
Audio Conversion
LiveKit outputs 48kHz stereo. We need to convert:42.8 Error Handling and Retry
Common Errors
| Error | Cause | Solution |
|---|---|---|
| 401 Unauthorized | Invalid API key | Check DEEPGRAM_API_KEY |
| 429 Too Many Requests | Rate limited | Backoff, check plan |
| Connection dropped | Network issue | Reconnect |
| Empty transcript | Silence or noise | Check audio input |
Retry Logic
42.9 Integration with Pipeline
Here’s how Deepgram integrates with the voice pipeline:Summary: What You’ve Learned in Part 7A
Section 41: Pipeline Architecture
- The voice pipeline transforms speech → text → AI response → speech
- Target latency: <1000ms mouth-to-ear
- Key optimization: streaming at every stage
- Pipeline states: IDLE → LISTENING → CAPTURING → PROCESSING → SPEAKING
Section 42: Deepgram STT Integration
- Deepgram Nova-2 provides low-latency streaming transcription
- WebSocket connection for real-time audio streaming
- Interim results for preview, final results for processing
- Audio format: 16kHz, mono, 16-bit PCM
What’s Next
In Part 7B, you’ll learn:- Voice Activity Detection (VAD) with Silero
- Claude LLM integration for conversation AI
- System prompt design for voice
- Function calling (tools) in voice context
Document Metadata
| Field | Value |
|---|---|
| Document ID | PRD-007A |
| Title | Junior Developer PRD — Part 7A |
| Version | 1.0 |
| Status | Complete |
End of Part 7A — Continue to Part 7B
Junior Developer PRD — Part 7B: VAD & Claude LLM Integration
Document Version: 1.0Last Updated: January 25, 2026
Part: 7B of 10 (Sub-part 2 of 3)
Sections: 43-44
Audience: Junior developers with no prior context
Estimated Reading Time: 20 minutes
How to Use This Document
This is Part 7B—the second of three sub-parts covering the Voice AI Pipeline:- Part 7A: Pipeline Architecture + Deepgram STT ✓
- Part 7B (this document): VAD + Claude LLM Integration
- Part 7C: Chatterbox TTS + Barge-In + State Management
Table of Contents
Section 43: Voice Activity Detection (VAD)
43.1 What is VAD?
Voice Activity Detection (VAD) determines when someone is speaking versus when there’s silence or background noise. It’s crucial for:- Knowing when to transcribe: Don’t waste resources on silence
- Endpointing: Detecting when the user finished speaking
- Interruption handling: Detecting barge-in during TTS playback
Why Use Separate VAD (Not Just Deepgram)?
Deepgram has built-in VAD, but we use our own because:- Lower latency: Local VAD is faster than waiting for Deepgram
- Interruption detection: Need VAD running during TTS playback
- More control: Tune sensitivity for our use case
- Redundancy: Don’t rely on single provider
43.2 Silero VAD
We use Silero VAD, a lightweight neural network that runs locally:- Fast: ~10ms per frame on CPU
- Accurate: 95%+ accuracy
- Small: ~2MB model size
- Open source: MIT license
How It Works
Silero VAD outputs a probability from 0 to 1 for each audio frame:| Probability | Meaning |
|---|---|
| 0.0 - 0.3 | Definitely not speech (silence, noise) |
| 0.3 - 0.5 | Uncertain (background noise, breathing) |
| 0.5 - 0.7 | Likely speech |
| 0.7 - 1.0 | Definitely speech |
43.3 VAD Configuration
43.4 VAD Implementation
43.5 Endpointing Strategies
“Endpointing” means detecting when the user finished their utterance.Strategy 1: Fixed Silence Timeout
Simple: wait for N milliseconds of silence.Cons: Cuts off slow speakers, waits too long for fast speakers
Strategy 2: Adaptive Endpointing
Adjust timeout based on speech patterns:Strategy 3: Semantic Endpointing (Advanced)
Use transcript content to help determine completion:43.6 VAD Settings by Scenario
| Scenario | Threshold | Min Speech | Min Silence |
|---|---|---|---|
| Quiet office | 0.5 | 250ms | 300ms |
| Noisy environment | 0.7 | 300ms | 400ms |
| Fast conversation | 0.4 | 150ms | 200ms |
| Elderly callers | 0.4 | 200ms | 500ms |
| Call center | 0.5 | 250ms | 350ms |
Section 44: Claude LLM Integration
44.1 What is Claude?
Claude is Anthropic’s large language model—the “brain” of our voice AI. It understands the caller’s request and generates appropriate responses. We use Claude Sonnet for the best balance of:- Speed: Fast enough for real-time conversation
- Quality: Intelligent, coherent responses
- Cost: Reasonable per-token pricing
Model Comparison
| Model | TTFB | Quality | Cost (in/out per 1M) | Use Case |
|---|---|---|---|---|
| Claude Sonnet | ~300ms | Excellent | 15 | Primary |
| Claude Haiku | ~150ms | Good | 1.25 | Fallback |
| GPT-4o | ~350ms | Excellent | 15 | Alternative |
| GPT-4o-mini | ~200ms | Good | 0.60 | Backup |
44.2 API Setup
Step 1: Create Anthropic Account
- Go to https://console.anthropic.com
- Sign up with email
- Verify email
Step 2: Create API Key
- Go to API Keys in console
- Click Create Key
- Name it:
voice-aiconnected-production - Copy the key
Step 3: Configure Environment
44.3 LLM Configuration
44.4 System Prompt Design for Voice
Voice conversations need special system prompt considerations:44.5 Conversation History Management
Claude needs conversation history for context, but we must manage it carefully:44.6 Streaming Responses
For voice, we must stream LLM responses:44.7 Function Calling (Tools)
Claude can execute functions during conversations:Tool Executor
44.8 Sentence Accumulator
We send text to TTS sentence-by-sentence for lowest latency:44.9 Token Tracking
Summary: What You’ve Learned in Part 7B
Section 43: Voice Activity Detection
- Silero VAD provides local, low-latency speech detection
- Key parameters: threshold, min_speech_duration, min_silence_duration
- Endpointing strategies: fixed timeout, adaptive, semantic
Section 44: Claude LLM Integration
- Claude Sonnet is primary model for voice conversations
- Voice-optimized system prompts: concise, conversational, no formatting
- Streaming responses essential for low latency
- Sentence accumulation for TTS optimization
- Function calling enables actions during conversation
What’s Next
In Part 7C, you’ll learn:- Chatterbox TTS integration (self-hosted on RunPod)
- Barge-in handling (interruption detection)
- Conversation state management with Redis
Document Metadata
| Field | Value |
|---|---|
| Document ID | PRD-007B |
| Title | Junior Developer PRD — Part 7B |
| Version | 1.0 |
| Status | Complete |
End of Part 7B — Continue to Part 7C
Junior Developer PRD — Part 7C: TTS, Barge-In & State Management
Document Version: 1.0Last Updated: January 25, 2026
Part: 7C of 10 (Sub-part 3 of 3)
Sections: 45-47
Table of Contents
- Section 45: Chatterbox TTS Integration
- Section 46: Barge-In Handling
- Section 47: Conversation State Management
Section 45: Chatterbox TTS Integration
45.1 What is Chatterbox?
Chatterbox is an open-source TTS that produces remarkably natural speech:- Voice Quality: Nearly indistinguishable from human
- Self-Hosted: No per-minute API costs
- Open Source: MIT license
TTS Comparison
| Provider | TTFB | Quality | Cost | Notes |
|---|---|---|---|---|
| Chatterbox | ~150ms | Excellent | Self-hosted | Primary |
| Cartesia Sonic | ~40ms | Excellent | $30/1M chars | Backup |
| Deepgram Aura | ~100ms | Good | $30/1M chars | Backup |
45.2 RunPod Deployment
Chatterbox requires GPU. Deploy on RunPod:45.3 TTS Configuration
45.4 TTS Client
45.5 TTS Fallback
Section 46: Barge-In Handling
46.1 What is Barge-In?
Barge-in = caller interrupts AI while it’s speaking. Natural in conversation.| Scenario | Example | Response |
|---|---|---|
| Correction | AI: “Tuesday—” Caller: “No, Wednesday” | Stop, process |
| Agreement | AI: “Would you—” Caller: “Yes” | Stop, continue |
| Frustration | AI: long explanation Caller: “Transfer me” | Stop, transfer |
46.2 Barge-In Detection
46.3 Barge-In Handler
Section 47: Conversation State Management
47.1 Why State Management?
Track: pipeline state, conversation history, call context, tool results.47.2 Redis Data Model
47.3 State Models
47.4 Redis State Manager
47.5 State Machine
Part 7 Summary
| Sub-Part | Content |
|---|---|
| 7A | Pipeline architecture, latency budget, Deepgram STT |
| 7B | Silero VAD, Claude LLM, function calling |
| 7C | Chatterbox TTS, barge-in, Redis state |
- Target latency: <1000ms mouth-to-ear
- STT: Deepgram Nova-2 (~200ms)
- LLM: Claude Sonnet (~300ms TTFB)
- TTS: Chatterbox (~150ms TTFB)
What’s Next
Part 8: Knowledge Base & RAG covers:- Document processing and chunking
- Vector embeddings with pgvector
- Retrieval-Augmented Generation
- Context injection
End of Part 7C
Junior Developer PRD — Part 8A: Document Processing & Chunking
Document Version: 1.0Last Updated: January 25, 2026
Part: 8A of 10 (Sub-part 1 of 3)
Sections: 48-49
Audience: Junior developers with no prior context
Estimated Reading Time: 20 minutes
How to Use This Document
This is Part 8A—the first of three sub-parts covering Knowledge Base & RAG:- Part 8A (this document): Document Processing & Chunking
- Part 8B: Vector Embeddings & pgvector
- Part 8C: RAG Pipeline & Context Injection
Table of Contents
Section 48: Knowledge Base Overview
48.1 What is a Knowledge Base?
A knowledge base is a collection of information that the AI can reference when answering questions. Without it, the AI only knows:- What’s in its training data (general knowledge)
- What’s in the current conversation
- Business-specific information (hours, services, policies)
- Product details and pricing
- FAQs and common procedures
- Historical data and records
Real-World Example
Without Knowledge Base:48.2 RAG: Retrieval-Augmented Generation
RAG is the technique that connects the knowledge base to the AI:Why RAG Instead of Fine-Tuning?
| Approach | Pros | Cons |
|---|---|---|
| RAG | Easy to update, no training needed, cites sources | Retrieval can fail, adds latency |
| Fine-Tuning | Fast inference, deeply learned | Expensive, hard to update, can hallucinate |
- Tenants can update their knowledge anytime
- No GPU training required
- AI can cite specific sources
- Multi-tenant isolation is straightforward
48.3 Knowledge Base Architecture
48.4 Database Schema
48.5 Supported Document Types
| File Type | Extension | Parser | Notes |
|---|---|---|---|
| PyMuPDF | Text + tables, OCR optional | ||
| Word | .docx | python-docx | Preserves structure |
| Text | .txt | Native | Direct read |
| Markdown | .md | markdown-it | Preserves headers |
| HTML | .html | BeautifulSoup | Strips tags |
| CSV | .csv | pandas | Row-based chunks |
Section 49: Document Processing & Chunking
49.1 The Ingestion Pipeline
When a tenant uploads a document:49.2 Document Parsers
49.3 Chunking Strategies
Chunking is how we split documents into smaller pieces for embedding and retrieval.Why Chunk?
- Embedding models have limits: Most handle ~8000 tokens max
- Precise retrieval: Smaller chunks = more specific matches
- Context efficiency: Don’t waste LLM context on irrelevant text
- Cost: Embedding fewer tokens is cheaper
Chunking Strategies
Overlap: Why It Matters
Overlap ensures context isn’t lost at chunk boundaries:49.4 Chunking Implementation
49.5 Chunking Configuration by Document Type
Different document types benefit from different chunking strategies:| Document Type | Chunk Size | Overlap | Notes |
|---|---|---|---|
| FAQ | 256 tokens | 25 | Small, focused answers |
| Policy docs | 512 tokens | 50 | Balanced |
| Technical docs | 768 tokens | 100 | Preserve context |
| Conversations | 256 tokens | 50 | Turn-based |
| Product specs | 512 tokens | 75 | Detailed info |
49.6 Processing Pipeline
Summary: What You’ve Learned in Part 8A
Section 48: Knowledge Base Overview
- Knowledge bases store business-specific information
- RAG = Retrieval + Augmentation + Generation
- Architecture: Documents → Chunks → Embeddings → Vector DB
Section 49: Document Processing & Chunking
- Parsers extract text from PDF, DOCX, TXT, HTML, MD
- Chunking splits documents into embeddable segments
- Semantic chunking respects sentence/paragraph boundaries
- Overlap prevents context loss at chunk boundaries
- Different document types need different chunk sizes
What’s Next
In Part 8B, you’ll learn:- Vector embeddings and embedding models
- pgvector extension for PostgreSQL
- Similarity search algorithms
- Index optimization
Document Metadata
| Field | Value |
|---|---|
| Document ID | PRD-008A |
| Title | Junior Developer PRD — Part 8A |
| Version | 1.0 |
| Status | Complete |
End of Part 8A — Continue to Part 8B
Junior Developer PRD — Part 8B: Vector Embeddings & pgvector
Document Version: 1.0Last Updated: January 25, 2026
Part: 8B of 10 (Sub-part 2 of 3)
Sections: 50-51
Audience: Junior developers with no prior context
Estimated Reading Time: 20 minutes
How to Use This Document
This is Part 8B—the second of three sub-parts covering Knowledge Base & RAG:- Part 8A: Document Processing & Chunking ✓
- Part 8B (this document): Vector Embeddings & pgvector
- Part 8C: RAG Pipeline & Context Injection
Table of Contents
Section 50: Vector Embeddings
50.1 What Are Embeddings?
Embeddings are numerical representations of text that capture semantic meaning. They convert words and sentences into vectors (lists of numbers) that computers can compare mathematically.The Key Insight
Similar meanings → Similar vectors → Close in vector space50.2 How Embeddings Enable Search
Traditional keyword search fails when words don’t match exactly:50.3 Embedding Models
| Model | Dimensions | Max Tokens | Cost/1M | Speed | Quality |
|---|---|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 | $0.02 | Fast | Good |
| text-embedding-3-large | 3072 | 8191 | $0.13 | Medium | Excellent |
| text-embedding-ada-002 | 1536 | 8191 | $0.10 | Fast | Good |
| Cohere embed-v3 | 1024 | 512 | $0.10 | Fast | Good |
| Voyage-2 | 1024 | 4000 | $0.10 | Medium | Excellent |
- Best price/performance ratio
- 1536 dimensions (good balance)
- Fast inference
- Works well with pgvector
50.4 Embedding Service Implementation
50.5 Similarity Metrics
How do we measure if two vectors are similar?Cosine Similarity (Our Choice)
Measures the angle between vectors. Ignores magnitude, focuses on direction.Other Metrics
| Metric | Formula | Best For |
|---|---|---|
| Cosine | angle between vectors | Semantic similarity |
| Euclidean | straight-line distance | Dense vectors |
| Dot Product | raw multiplication | Normalized vectors |
| Manhattan | sum of differences | Sparse vectors |
Why Cosine for RAG?
- Normalized: Length doesn’t matter (short and long texts comparable)
- Intuitive: Higher = more similar
- Fast: Optimized in pgvector
- Standard: Most embedding models are trained for cosine
Section 51: pgvector Integration
51.1 What is pgvector?
pgvector is a PostgreSQL extension that adds vector data types and similarity search. It lets us store embeddings directly in our database and perform efficient similarity queries.Why pgvector?
| Option | Pros | Cons |
|---|---|---|
| pgvector | Integrated with PostgreSQL, ACID, familiar | Scaling limits |
| Pinecone | Managed, scalable | Separate service, cost |
| Weaviate | Feature-rich | Complex setup |
| Qdrant | Fast, open source | Another database |
| Milvus | Highly scalable | Operational overhead |
- We already use PostgreSQL
- No additional infrastructure
- Joins with other data
- Simpler architecture
- Good enough for our scale
51.2 pgvector Setup
Install Extension
Create Vector Column
51.3 Vector Indexes
Without an index, similarity search scans all rows (slow). pgvector offers two index types:IVFFlat Index
Inverted File Flat - clusters vectors, searches relevant clusters.HNSW Index
Hierarchical Navigable Small World - graph-based.Index Comparison
| Index | Build Time | Query Time | Accuracy | Memory |
|---|---|---|---|---|
| None | - | O(n) | 100% | Low |
| IVFFlat | Fast | ~10ms | 95%+ | Medium |
| HNSW | Slow | ~5ms | 99%+ | High |
51.4 Similarity Search Queries
Basic Similarity Search
Distance Operators
| Operator | Metric | Usage |
|---|---|---|
<=> | Cosine distance | ORDER BY embedding <=> query |
<-> | Euclidean (L2) | ORDER BY embedding <-> query |
<#> | Inner product | ORDER BY embedding <#> query |
Filtered Search
Similarity Threshold
51.5 Vector Repository Implementation
51.6 Index Maintenance
51.7 Performance Tuning
Index Parameters
Query Tuning
Memory Configuration
Summary: What You’ve Learned in Part 8B
Section 50: Vector Embeddings
- Embeddings convert text to numerical vectors
- Similar meanings → similar vectors
- We use OpenAI text-embedding-3-small (1536 dimensions)
- Cosine similarity measures vector closeness
Section 51: pgvector Integration
- pgvector adds vector support to PostgreSQL
- IVFFlat index for fast approximate search
- Distance operators:
<=>(cosine),<->(L2) - Hybrid search combines vector + keyword
- Index tuning critical for performance
What’s Next
In Part 8C, you’ll learn:- Complete RAG pipeline
- Query processing and reranking
- Context assembly for LLM
- Prompt injection with retrieved context
Document Metadata
| Field | Value |
|---|---|
| Document ID | PRD-008B |
| Title | Junior Developer PRD — Part 8B |
| Version | 1.0 |
| Status | Complete |
End of Part 8B — Continue to Part 8C
Junior Developer PRD — Part 8C: RAG Pipeline & Context Injection
Document Version: 1.0Last Updated: January 25, 2026
Part: 8C of 10 (Sub-part 3 of 3)
Sections: 52-53
Table of Contents
Section 52: RAG Pipeline
52.1 Complete RAG Flow
52.2 Query Processing
52.3 Reranking
Reranking improves relevance beyond vector similarity:52.4 Complete RAG Service
Section 53: Context Injection
53.1 Injecting Context into Prompts
53.2 Voice Pipeline with RAG
53.3 Caching RAG Results
53.4 Handling Edge Cases
Part 8 Complete Summary
| Sub-Part | Sections | Key Topics |
|---|---|---|
| 8A | 48-49 | Document parsing, chunking, ingestion |
| 8B | 50-51 | Embeddings, pgvector, similarity search |
| 8C | 52-53 | RAG pipeline, reranking, context injection |
- Parse documents into text
- Chunk into ~512 token segments
- Embed chunks using text-embedding-3-small
- Store in pgvector
- Search with vector similarity
- Rerank for better relevance
- Inject context into LLM prompt
- Generate grounded response
What’s Next
Part 9: Testing & Deployment will cover:- Unit and integration testing
- End-to-end voice testing
- CI/CD pipelines
- Production deployment
End of Part 8C
Junior Developer PRD — Part 9A: Unit Testing & Integration Testing
Document Version: 1.0Last Updated: January 25, 2026
Part: 9A of 10 (Sub-part 1 of 3)
Sections: 54-55
Audience: Junior developers with no prior context
Estimated Reading Time: 25 minutes
How to Use This Document
This is Part 9A—the first of three sub-parts covering Testing & Quality Assurance:- Part 9A (this document): Unit Testing & Integration Testing
- Part 9B: End-to-End Testing & Voice Testing
- Part 9C: Performance Testing & CI/CD
Table of Contents
Section 54: Unit Testing Fundamentals
54.1 Why Testing Matters
Testing isn’t optional—it’s how we ensure the system works correctly and continues to work as we make changes.The Cost of Bugs
| Stage Found | Cost to Fix | Example |
|---|---|---|
| During coding | 1x | Developer catches typo |
| Unit test | 2x | Test fails, fix immediately |
| Integration test | 5x | Multiple components involved |
| QA/Staging | 10x | Full deployment needed |
| Production | 100x | Customer impact, urgent fix |
Testing Philosophy for Voice AI
Voice AI has unique testing challenges:- Real-time constraints: Latency matters
- Non-deterministic: AI responses vary
- External dependencies: STT, TTS, LLM APIs
- Audio processing: Binary data, timing
- State management: Conversation context
- Unit tests: Fast, isolated, deterministic
- Integration tests: Component interactions
- E2E tests: Full pipeline with mocks
- Voice tests: Audio-specific scenarios
54.2 Testing Stack
| Tool | Purpose | Why We Chose It |
|---|---|---|
| pytest | Test framework | Industry standard, powerful fixtures |
| pytest-asyncio | Async testing | Our code is async |
| pytest-cov | Coverage | Track test completeness |
| factory_boy | Test data | Generate realistic fixtures |
| faker | Fake data | Random but realistic values |
| respx | HTTP mocking | Mock external APIs |
| pytest-mock | Mocking | Flexible mock/patch |
| testcontainers | Database testing | Real PostgreSQL in Docker |
Installation
54.3 Project Test Structure
54.4 pytest Configuration
54.5 Shared Fixtures (conftest.py)
54.6 Writing Effective Unit Tests
Test Structure: Arrange-Act-Assert (AAA)
Testing Async Code
54.7 Test Factories
54.8 Mocking External Services
54.9 Running Unit Tests
Section 55: Integration Testing
55.1 What is Integration Testing?
Integration tests verify that components work together correctly. They test:- Database operations with real PostgreSQL
- Redis operations with real Redis
- API endpoints with real HTTP
- Multiple services communicating
Unit vs Integration Tests
| Aspect | Unit Tests | Integration Tests |
|---|---|---|
| Scope | Single function/class | Multiple components |
| Dependencies | Mocked | Real (or containers) |
| Speed | Fast (ms) | Slower (seconds) |
| Isolation | Complete | Partial |
| Flakiness | Low | Higher |
| Purpose | Logic correctness | System correctness |
55.2 Database Integration Tests
55.3 Redis Integration Tests
55.4 API Integration Tests
Summary: What You’ve Learned in Part 9A
Section 54: Unit Testing Fundamentals
- Testing pyramid: unit → integration → E2E
- pytest with async support (pytest-asyncio)
- AAA pattern: Arrange, Act, Assert
- Fixtures for test setup and teardown
- Factory Boy for test data generation
- Mocking external services
Section 55: Integration Testing
- Testcontainers for real databases
- PostgreSQL with pgvector testing
- Redis integration testing
- API endpoint testing with httpx
- Test isolation and cleanup
What’s Next
In Part 9B, you’ll learn:- End-to-end testing strategies
- Voice-specific testing
- Audio simulation
- Pipeline testing
Document Metadata
| Field | Value |
|---|---|
| Document ID | PRD-009A |
| Title | Junior Developer PRD — Part 9A |
| Version | 1.0 |
| Status | Complete |
End of Part 9A — Continue to Part 9B
Junior Developer PRD — Part 9B: End-to-End Testing & Voice Testing
Document Version: 1.0Last Updated: January 25, 2026
Part: 9B of 10 (Sub-part 2 of 3)
Sections: 56-57
Table of Contents
Section 56: End-to-End Testing
56.1 E2E Test Environment
56.2 E2E Test Client
56.3 E2E Test Scenarios
Section 57: Voice-Specific Testing
57.1 Audio Test Utilities
57.2 VAD Testing
57.3 Pipeline Testing
57.4 Audio Quality Testing
Summary
Section 56: End-to-End Testing
- Docker Compose environment for full system testing
- E2E client with retry/wait logic
- Critical user journey tests
Section 57: Voice-Specific Testing
- Audio utilities for test data generation
- VAD accuracy and timing tests
- Pipeline integration tests
- Barge-in and latency verification
What’s Next
Part 9C: Performance Testing & CI/CD covers:- Load testing with Locust
- GitHub Actions pipelines
- Deployment strategies
End of Part 9B
Junior Developer PRD — Part 9C: Performance Testing & CI/CD
Document Version: 1.0Last Updated: January 25, 2026
Part: 9C of 10 (Sub-part 3 of 3)
Sections: 58-59
Audience: Junior developers with no prior context
Estimated Reading Time: 25 minutes
Table of Contents
Section 58: Performance Testing
58.1 Why Performance Testing?
Voice AI has strict performance requirements:| Metric | Target | Impact if Exceeded |
|---|---|---|
| Latency (P50) | < 1000ms | Unnatural conversation |
| Latency (P95) | < 1500ms | User frustration |
| Latency (P99) | < 2000ms | Call abandonment |
| Concurrent calls | 100+ per instance | Service degradation |
| Error rate | < 0.1% | Trust loss |
58.2 Load Testing with Locust
58.3 Running Load Tests
58.4 Latency Testing
58.5 Database Performance Testing
58.6 Stress Testing
Section 59: CI/CD Pipelines
59.1 GitHub Actions Workflow
59.2 Deployment Scripts
59.3 Rollback Procedures
59.4 Database Migrations
Part 9 Summary
| Sub-Part | Sections | Key Topics |
|---|---|---|
| 9A | 54-55 | Unit tests, integration tests, fixtures, mocking |
| 9B | 56-57 | E2E tests, voice testing, audio utilities |
| 9C | 58-59 | Load testing, latency testing, CI/CD pipelines |
- Many unit tests (fast, isolated)
- Some integration tests (component interaction)
- Few E2E tests (critical paths)
- Voice-specific tests (audio, latency, barge-in)
- Lint → Unit Tests → Integration Tests → Build → Deploy
- Rolling deployments with health checks
- Automatic rollback on failure
What’s Next
Part 10: Operations & Monitoring will cover:- Logging and observability
- Metrics and alerting
- Incident response
- Scaling strategies
End of Part 9C Junior Developer PRD — Part 10A: Logging & Observability Document Version: 1.0 Last Updated: January 25, 2026 Part: 10A of 10 (Sub-part 1 of 3) Sections: 60-61 Audience: Junior developers with no prior context Estimated Reading Time: 25 minutes How to Use This Document This is Part 10A—the first of three sub-parts covering Operations & Monitoring: Part 10A (this document): Logging & Observability Part 10B: Metrics & Alerting Part 10C: Operations & Scaling Prerequisites: Parts 1-9 of the PRD series. Table of Contents Section 60: Structured Logging Section 61: Distributed Tracing Section 60: Structured Logging 60.1 Why Logging Matters In production, you can’t attach a debugger. Logs are your primary tool for understanding what’s happening: Scenario Without Good Logging With Good Logging Call dropped “Something failed” “STT timeout after 5s, call_id=abc123, tenant=xyz” Slow response “It’s slow sometimes” “LLM TTFB P95 = 2.3s, model=sonnet, prompt_tokens=4521” Customer complaint Hours of investigation Query by call_id, see full timeline 60.2 Structured Logging Principles Structured logs use key-value pairs instead of free-form text:
Bad: Unstructured
logger.info(f”Processing call {call_id} for tenant {tenant_id}“)Good: Structured
Context variable for log context
Convenience function to set call context
Configure logging at service startup
configure_logging(“agent-service”, level=“INFO”)Create contextual logger
logger = ContextualLogger(“agent-service.pipeline”) async def handle_incoming_call(call_context): """Handle an incoming call with proper logging."""Configure at startup
tracer = configure_tracing(“agent-service”)FastAPI middleware for automatic propagation
Create custom registry
REGISTRY = CollectorRegistry()============================================================
SERVICE INFO
============================================================
SERVICE_INFO = Info( “voiceai_service”, “Service information”, registry=REGISTRY, )============================================================
CALL METRICS
============================================================
CALLS_TOTAL = Counter( “voiceai_calls_total”, “Total number of calls”, [“tenant_id”, “direction”, “status”], registry=REGISTRY, ) CALLS_ACTIVE = Gauge( “voiceai_calls_active”, “Currently active calls”, [“tenant_id”], registry=REGISTRY, ) CALL_DURATION_SECONDS = Histogram( “voiceai_call_duration_seconds”, “Call duration in seconds”, [“tenant_id”, “direction”], buckets=[30, 60, 120, 300, 600, 1200, 1800, 3600], registry=REGISTRY, )============================================================
LATENCY METRICS
============================================================
Latency buckets optimized for voice (in seconds)
LATENCY_BUCKETS = [0.05, 0.1, 0.2, 0.3, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0] STT_LATENCY = Histogram( “voiceai_stt_latency_seconds”, “Speech-to-text latency”, [“provider”], buckets=LATENCY_BUCKETS, registry=REGISTRY, ) LLM_TTFB = Histogram( “voiceai_llm_ttfb_seconds”, “LLM time to first byte”, [“model”], buckets=LATENCY_BUCKETS, registry=REGISTRY, ) LLM_TOTAL_LATENCY = Histogram( “voiceai_llm_total_latency_seconds”, “LLM total generation time”, [“model”], buckets=[0.5, 1, 2, 3, 5, 10, 15, 20, 30], registry=REGISTRY, ) TTS_LATENCY = Histogram( “voiceai_tts_latency_seconds”, “Text-to-speech latency”, [“provider”], buckets=LATENCY_BUCKETS, registry=REGISTRY, ) E2E_LATENCY = Histogram( “voiceai_e2e_latency_seconds”, “End-to-end turn latency (mouth to ear)”, [“tenant_id”], buckets=LATENCY_BUCKETS, registry=REGISTRY, ) RAG_LATENCY = Histogram( “voiceai_rag_latency_seconds”, “RAG retrieval latency”, buckets=[0.05, 0.1, 0.2, 0.3, 0.5, 1.0], registry=REGISTRY, )============================================================
ERROR METRICS
============================================================
ERRORS_TOTAL = Counter( “voiceai_errors_total”, “Total errors by type”, [“service”, “error_type”], registry=REGISTRY, ) STT_ERRORS = Counter( “voiceai_stt_errors_total”, “STT errors”, [“provider”, “error_type”], registry=REGISTRY, ) LLM_ERRORS = Counter( “voiceai_llm_errors_total”, “LLM errors”, [“model”, “error_type”], registry=REGISTRY, ) TTS_ERRORS = Counter( “voiceai_tts_errors_total”, “TTS errors”, [“provider”, “error_type”], registry=REGISTRY, )============================================================
PIPELINE METRICS
============================================================
PIPELINE_STATE = Gauge( “voiceai_pipeline_state”, “Current pipeline state (encoded)”, [“call_id”], registry=REGISTRY, ) VAD_DETECTIONS = Counter( “voiceai_vad_detections_total”, “VAD detection events”, [“event_type”], # speech_start, speech_end registry=REGISTRY, ) BARGE_INS = Counter( “voiceai_barge_ins_total”, “Barge-in events”, [“tenant_id”], registry=REGISTRY, ) TURNS_TOTAL = Counter( “voiceai_turns_total”, “Conversation turns”, [“tenant_id”, “role”], # user, assistant registry=REGISTRY, )============================================================
RESOURCE METRICS
============================================================
DB_CONNECTIONS_ACTIVE = Gauge( “voiceai_db_connections_active”, “Active database connections”, registry=REGISTRY, ) REDIS_CONNECTIONS_ACTIVE = Gauge( “voiceai_redis_connections_active”, “Active Redis connections”, registry=REGISTRY, ) QUEUE_SIZE = Gauge( “voiceai_queue_size”, “Queue size by queue name”, [“queue_name”], registry=REGISTRY, )============================================================
TOKEN METRICS
============================================================
LLM_TOKENS = Counter( “voiceai_llm_tokens_total”, “LLM tokens used”, [“tenant_id”, “model”, “type”], # type: input, output registry=REGISTRY, ) EMBEDDING_TOKENS = Counter( “voiceai_embedding_tokens_total”, “Embedding tokens used”, [“tenant_id”], registry=REGISTRY, )============================================================
HELPER FUNCTIONS
============================================================
@contextmanager def measure_latency(histogram, labels: dict = None): """ Context manager to measure latency.Usage in pipeline
async def process_turn(audio, metrics: MetricsRecorder): """Process a conversation turn with metrics."""alerts/voice-ai-alerts.yaml
groups:-
name: voice-ai-critical rules:
============================================================
LATENCY ALERTS
============================================================
============================================================
ERROR ALERTS
============================================================
============================================================
AVAILABILITY ALERTS
============================================================
- alert: NoActiveCalls expr: | sum(voiceai_calls_active) == 0 and hour() >= 9 and hour() <= 17 and day_of_week() >= 1 and day_of_week() <= 5 for: 30m labels: severity: warning team: voice-platform annotations: summary: “No active calls during business hours” description: “No calls for 30 minutes during peak hours”
============================================================
CAPACITY ALERTS
============================================================
============================================================
COST ALERTS
============================================================
-
name: voice-ai-slo rules:
============================================================
SLO ALERTS
============================================================
alertmanager/config.yaml
global: resolve_timeout: 5m slack_api_url: ‘https://hooks.slack.com/services/XXX/YYY/ZZZ’ route: receiver: ‘default’ group_by: [‘alertname’, ‘severity’] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: # Critical alerts → PagerDuty + Slack - match: severity: critical receiver: ‘pagerduty-critical’ continue: true- name: ‘default’ slack_configs:
- name: ‘pagerduty-critical’ pagerduty_configs:
- name: ‘slack-critical’ slack_configs:
- type: button text: ‘View in Grafana’ url: ‘https://grafana.internal/d/voice-ai’
- name: ‘slack-warning’ slack_configs:
Don’t alert on warnings if critical is firing
- source_match: severity: ‘critical’ target_match: severity: ‘warning’ equal: [‘alertname’] 63.4 Alert Response Procedures """ Alert response automation.
Example actions
Runbook: High E2E Latency
Alert
HighE2ELatency: P95 end-to-end latency exceeds 1.5 seconds
Impact
- Users experience delayed AI responses
- Conversation feels unnatural
- Potential call abandonment