Voice by aiConnected — A Developer’s Introduction
What This Document Is
This document explains Voice by aiConnected for developers who are new to voice AI, real-time systems, or this specific project. It covers what we’re building, how the pieces fit together, and what you need to understand to contribute effectively. If you’re joining this project and want to get up to speed quickly, start here.What We’re Building
The One-Sentence Version
A platform that connects phone calls to AI agents that can have natural conversations, take actions, and transfer to humans when needed.The One-Paragraph Version
Voice by aiConnected is a multi-tenant Voice AI platform. When someone calls a business using our service, the call is answered by an AI that listens (using speech-to-text), thinks (using a large language model), and responds (using text-to-speech). The AI has access to information about the business and can perform actions like scheduling appointments or updating CRM records. When the AI can’t handle something, it seamlessly transfers the call to a human.The Visual Version
Core Concepts You Need to Understand
1. The Voice Pipeline
Voice AI is essentially a pipeline that transforms audio → text → response → audio:| Stage | What Happens | Target Latency |
|---|---|---|
| STT (Speech-to-Text) | Converts caller’s voice to text | ~300ms |
| LLM (Large Language Model) | Generates a response | ~350ms |
| TTS (Text-to-Speech) | Converts response to speech | ~150ms |
| Network/Audio | Moving audio around | ~100ms |
| Total | ~900ms |
2. Streaming vs. Batch Processing
The key to achieving low latency is streaming. Instead of waiting for each stage to complete fully before starting the next, we stream data through the pipeline: Batch (slow):- Deepgram provides interim transcription results
- Claude streams tokens as they’re generated
- Chatterbox synthesizes audio incrementally
3. WebRTC
WebRTC (Web Real-Time Communication) is a protocol for real-time audio/video over the internet. It’s what powers video calls in your browser. Key concepts:- Peer-to-peer — Direct connections between participants
- SDP (Session Description Protocol) — How peers negotiate connection parameters
- ICE (Interactive Connectivity Establishment) — How peers find a path to connect
- Tracks — Individual audio or video streams
4. LiveKit
LiveKit is an open-source platform for real-time audio/video. Think of it as “Zoom/WebRTC infrastructure as a service.” Key concepts:- Rooms — Virtual spaces where participants connect
- Participants — Entities in a room (could be users or AI agents)
- Tracks — Audio or video streams published by participants
- LiveKit Agents SDK — Framework for building AI agents that participate in rooms
- Each phone call creates a LiveKit room
- The WebRTC bridge joins as one participant (representing the caller)
- The AI agent joins as another participant
- Audio flows between them through the room
5. State Machines
Phone calls have states: ringing, connected, on hold, transferred, ended. Managing these transitions correctly is crucial.- It’s fast (in-memory)
- It’s ephemeral (call state doesn’t need to persist forever)
- It supports pub/sub (for real-time notifications)
6. Multi-Tenancy
Multiple businesses (tenants) use the same platform. Each tenant has:- Their own phone numbers
- Their own AI configuration (personality, knowledge base, tools)
- Their own usage tracking and billing
- Isolated data (Tenant A can’t see Tenant B’s calls)
- Tenant IDs on all database records
- Scoped API keys
- Request-level tenant context
The Technology Stack
Languages & Frameworks
| Component | Language | Framework | Why |
|---|---|---|---|
| WebRTC Bridge | Python | aiortc | Best WebRTC library for Python |
| AI Agent | Python | LiveKit Agents SDK | Native Python SDK |
| API Gateway | Python | FastAPI | Async, fast, great docs |
| Background Workers | Python | Celery or native async | Job processing |
External Services
| Service | Purpose | Why This One |
|---|---|---|
| GoToConnect | Phone system | Existing, full API, unlimited plan |
| LiveKit Cloud | Real-time audio | Industry standard, Agents SDK |
| Deepgram | Speech-to-text | Low latency, streaming, accurate |
| Anthropic Claude | AI reasoning | Best reasoning, streaming |
| RunPod | GPU hosting | Cost-effective A5000 |
| Chatterbox | Text-to-speech | Free, MIT license, paralinguistics |
Databases & Storage
| Service | Purpose | Why This One |
|---|---|---|
| PostgreSQL | Relational data | Tenants, configs, call logs |
| Redis | Cache & state | Call state machine, sessions |
| DO Spaces | Object storage | Voice samples, recordings |
Infrastructure
| Service | Purpose |
|---|---|
| DigitalOcean | Cloud hosting |
| Dokploy | Container orchestration |
| RunPod | GPU instances for TTS |
Project Structure
Here’s how the codebase is organized:Key Files You’ll Work With
WebRTC Bridge (services/webrtc-bridge/)
This is the trickiest part of the system. It:
- Receives calls from GoToConnect via WebRTC
- Extracts audio frames
- Publishes them to a LiveKit room
- Receives AI audio from LiveKit
- Sends it back to GoToConnect
AI Agent (services/agent-service/)
This is where the magic happens. The agent:
- Joins a LiveKit room
- Listens to the audio track from the bridge
- Transcribes it with Deepgram
- Generates a response with Claude
- Synthesizes speech with Chatterbox
- Publishes the audio back to the room
Call State Machine (shared/state/)
Manages the lifecycle of each call:
Development Workflow
Setting Up Your Environment
- Clone the repository
- Copy environment template
- Start local services
- Run migrations
- Start development servers
Testing a Call Locally
For local development, you’ll use test audio files instead of real phone calls:Running Tests
Common Patterns You’ll See
1. Async Everything
Almost all our code is async because we’re dealing with I/O-bound operations (network calls, audio streaming). Get comfortable with:2. Dependency Injection
We use FastAPI’s dependency injection for database sessions, authentication, tenant context:3. Event-Driven Communication
Services communicate through events, not direct calls:4. Circuit Breakers
External services can fail. We use circuit breakers to fail fast:5. Graceful Degradation
When components fail, we degrade gracefully:Key Challenges You’ll Face
1. Latency Optimization
Every millisecond matters. You’ll need to:- Profile everything
- Avoid blocking operations
- Stream wherever possible
- Cache aggressively
- Minimize network hops
2. Audio Quality
Telephone audio is 8kHz, muddy, and often has background noise. You’ll need to:- Handle different audio formats
- Resample correctly
- Understand codec differences
- Deal with packet loss
3. Conversation Flow
Natural conversations have:- Interruptions (barge-in)
- Pauses
- Overlapping speech
- Misunderstandings
4. Error Recovery
Lots of things can fail:- Network issues
- API rate limits
- Audio dropout
- Service outages
5. Concurrency
Multiple calls happen simultaneously. You need to:- Avoid race conditions
- Manage connection pools
- Handle resource contention
- Scale horizontally
Glossary
| Term | Definition |
|---|---|
| STT | Speech-to-Text. Converting audio to text. |
| TTS | Text-to-Speech. Converting text to audio. |
| LLM | Large Language Model. AI that generates text (like Claude). |
| WebRTC | Web Real-Time Communication. Protocol for real-time audio/video. |
| SDP | Session Description Protocol. How WebRTC peers negotiate. |
| PBX | Private Branch Exchange. A business phone system. |
| PSTN | Public Switched Telephone Network. The traditional phone system. |
| SIP | Session Initiation Protocol. Protocol for VoIP calls. |
| VAD | Voice Activity Detection. Detecting when someone is speaking. |
| TTFB | Time to First Byte. How long until the first response arrives. |
| RTF | Real-Time Factor. If RTF < 1, synthesis is faster than real-time. |
| Barge-in | When the caller interrupts the AI while it’s speaking. |
| Warm Transfer | Transferring a call after briefing the recipient. |
| Blind Transfer | Transferring a call without briefing the recipient. |
| Multi-tenant | One system serving multiple separate customers. |
How to Get Help
Documentation
- This document — Start here for overview
- Master Project Task List — Overall project plan
- Individual specification documents — Deep dives into each component
- Skills folder — API reference for each provider
Code
- Read the tests — Tests show how things are supposed to work
- Read the types — Type hints document expected inputs/outputs
- Read the docstrings — Functions should explain what they do
People
- Ask questions — No question is too basic
- Review PRs — See how others solve problems
- Pair program — Learn by doing together
Your First Tasks
If you’re new to the project, here are good starting points:Beginner
- Set up your local development environment
- Run the test suite and make sure everything passes
- Read through the API gateway routes to understand the API surface
- Add a simple new endpoint (e.g., health check with more details)
Intermediate
- Add a new tool that the AI can call (e.g., check business hours)
- Improve error messages in a service
- Add metrics/logging to an existing component
- Write integration tests for an existing feature
Advanced
- Implement a new call feature (e.g., call recording)
- Optimize latency in the voice pipeline
- Add a new STT/TTS provider as a fallback
- Implement a complex state machine transition
Final Thoughts
Voice AI is a fascinating intersection of several technologies:- Real-time systems
- AI/ML
- Telecommunications
- Distributed systems
Last updated: 2026-01-16