VoxCoach

VoxCoach
100% Local Voice Sales Training Platform
Project Overview
VoxCoach is a voice-driven sales training simulator built for AI-First Consulting. It simulates realistic discovery calls where users practice consultative selling with LLM-driven buyer personas — all running 100% offline on Apple Silicon with sub-800ms end-to-end voice latency.
The Problem: New consultants need to practice discovery calls before engaging real prospects, but role-playing with colleagues doesn't scale and doesn't provide consistent, objective feedback.
What We Built: A real-time voice conversation with a simulated buyer who follows a realistic 6-phase discovery call flow, scored against Pollard's 7-step consultative selling framework with AI-First Consulting-specific criteria.
Technical Architecture
Voice Processing Pipeline (<800ms end-to-end)
Browser mic (16kHz PCM Int16)
→ WebSocket frame
→ VAD (Silero ONNX, <20ms)
→ STT (voxtral.c Metal GPU, <200ms)
→ LLM (Qwen 3 8B streaming, <400ms first token)
→ Sentence buffer (stream to TTS while LLM continues)
→ TTS (Kokoro MLX, <150ms first audio chunk)
→ Audio queue with cancellation (barge-in support)
→ WebSocket → Browser playback
LLM and TTS overlap — while TTS renders one sentence, the LLM generates the next. Barge-in instantly flushes the queue and cancels in-progress synthesis.
6-Phase Discovery Call Simulation
The simulated buyer follows a realistic consultative sales flow:
- Greeting — Buyer speaks first, casual opening
- Context Sharing — Buyer explains why they're calling
- Consultant Intro — Brief AI-First Consulting overview
- Discovery — Questions about workflows, tools, pain points
- Solution Alignment — Propose bounded next step
- Close — Mutual commitments and friendly goodbye
Scoring System
Post-call evaluation on 7 AI-First criteria using Qwen 3 14B:
- Identified AI readiness level
- Scoped narrow workflow (not broad transformation)
- Understood buyer's current tech stack
- Quantified business impact (time/cost saved)
- Matched to right service (Accelerator, Custom Dev, or Fractional CTO/CAIO)
- Clear next step proposed
- Professional tone maintained
5 Buyer Archetypes
Each persona has distinct pain points, technical literacy, and buying authority — CTO, Head of Ops, VP Engineering, and more.
Technology Stack
- Language: Go 1.24 with chi v5 HTTP router
- STT: voxtral.c (Metal GPU via CGo) with whisper.cpp fallback
- VAD: Silero VAD (ONNX Runtime)
- LLM: Qwen 3 8B (real-time conversation) / 14B (post-call scoring) via Ollama
- TTS: Kokoro-82M (MLX) with Piper TTS fallback
- Database: Supabase local stack (PostgreSQL 15 + PostgREST + GoTrue auth)
- Frontend: Vanilla JS embedded via
go:embed(single binary distribution) - WebSocket: coder/websocket with compression
Key Features
- 100% Offline — All audio processing, LLM inference, and scoring runs locally on Apple Silicon
- Real-Time Voice — Sub-800ms end-to-end latency with overlapped LLM/TTS streaming
- Barge-In Support — Interrupt the speaker mid-sentence naturally
- Live Coaching — Hints during the call to guide technique
- Post-Call Feedback — Detailed breakdown on each scoring criterion
- Progress Tracking — Session history and improvement trends in PostgreSQL
- Single Binary — Frontend embedded in Go binary, no separate JS build needed
Links:
