Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences. Use when: voice ai, voice agent, speech to text, text to speech, realtime voice.
Add this skill
npx mdskills install sickn33/voice-ai-developmentComprehensive voice AI guide with detailed examples across multiple providers and latency best practices
1---2name: voice-ai-development3description: "Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences. Use when: voice ai, voice agent, speech to text, text to speech, realtime voice."4source: vibeship-spawner-skills (Apache 2.0)5---67# Voice AI Development89**Role**: Voice AI Architect1011You are an expert in building real-time voice applications. You think in terms of12latency budgets, audio quality, and user experience. You know that voice apps feel13magical when fast and broken when slow. You choose the right combination of providers14for each use case and optimize relentlessly for perceived responsiveness.1516## Capabilities1718- OpenAI Realtime API19- Vapi voice agents20- Deepgram STT/TTS21- ElevenLabs voice synthesis22- LiveKit real-time infrastructure23- WebRTC audio handling24- Voice agent design25- Latency optimization2627## Requirements2829- Python or Node.js30- API keys for providers31- Audio handling knowledge3233## Patterns3435### OpenAI Realtime API3637Native voice-to-voice with GPT-4o3839**When to use**: When you want integrated voice AI without separate STT/TTS4041```python42import asyncio43import websockets44import json45import base644647OPENAI_API_KEY = "sk-..."4849async def voice_session():50 url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"51 headers = {52 "Authorization": f"Bearer {OPENAI_API_KEY}",53 "OpenAI-Beta": "realtime=v1"54 }5556 async with websockets.connect(url, extra_headers=headers) as ws:57 # Configure session58 await ws.send(json.dumps({59 "type": "session.update",60 "session": {61 "modalities": ["text", "audio"],62 "voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer63 "input_audio_format": "pcm16",64 "output_audio_format": "pcm16",65 "input_audio_transcription": {66 "model": "whisper-1"67 },68 "turn_detection": {69 "type": "server_vad", # Voice activity detection70 "threshold": 0.5,71 "prefix_padding_ms": 300,72 "silence_duration_ms": 50073 },74 "tools": [75 {76 "type": "function",77 "name": "get_weather",78 "description": "Get weather for a location",79 "parameters": {80 "type": "object",81 "properties": {82 "location": {"type": "string"}83 }84 }85 }86 ]87 }88 }))8990 # Send audio (PCM16, 24kHz, mono)91 async def send_audio(audio_bytes):92 await ws.send(json.dumps({93 "type": "input_audio_buffer.append",94 "audio": base64.b64encode(audio_bytes).decode()95 }))9697 # Receive events98 async for message in ws:99 event = json.loads(message)100101 if event["type"] == "resp102```103104### Vapi Voice Agent105106Build voice agents with Vapi platform107108**When to use**: Phone-based agents, quick deployment109110```python111# Vapi provides hosted voice agents with webhooks112113from flask import Flask, request, jsonify114import vapi115116app = Flask(__name__)117client = vapi.Vapi(api_key="...")118119# Create an assistant120assistant = client.assistants.create(121 name="Support Agent",122 model={123 "provider": "openai",124 "model": "gpt-4o",125 "messages": [126 {127 "role": "system",128 "content": "You are a helpful support agent..."129 }130 ]131 },132 voice={133 "provider": "11labs",134 "voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel135 },136 firstMessage="Hi! How can I help you today?",137 transcriber={138 "provider": "deepgram",139 "model": "nova-2"140 }141)142143# Webhook for conversation events144@app.route("/vapi/webhook", methods=["POST"])145def vapi_webhook():146 event = request.json147148 if event["type"] == "function-call":149 # Handle tool call150 name = event["functionCall"]["name"]151 args = event["functionCall"]["parameters"]152153 if name == "check_order":154 result = check_order(args["order_id"])155 return jsonify({"result": result})156157 elif event["type"] == "end-of-call-report":158 # Call ended - save transcript159 transcript = event["transcript"]160 save_transcript(event["call"]["id"], transcript)161162 return jsonify({"ok": True})163164# Start outbound call165call = client.calls.create(166 assistant_id=assistant.id,167 customer={168 "number": "+1234567890"169 },170 phoneNumber={171 "twilioPhoneNumber": "+0987654321"172 }173)174175# Or create web call176web_call = client.calls.create(177 assistant_id=assistant.id,178 type="web"179)180# Returns URL for WebRTC connection181```182183### Deepgram STT + ElevenLabs TTS184185Best-in-class transcription and synthesis186187**When to use**: High quality voice, custom pipeline188189```python190import asyncio191from deepgram import DeepgramClient, LiveTranscriptionEvents192from elevenlabs import ElevenLabs193194# Deepgram real-time transcription195deepgram = DeepgramClient(api_key="...")196197async def transcribe_stream(audio_stream):198 connection = deepgram.listen.live.v("1")199200 async def on_transcript(result):201 transcript = result.channel.alternatives[0].transcript202 if transcript:203 print(f"Heard: {transcript}")204 if result.is_final:205 # Process final transcript206 await handle_user_input(transcript)207208 connection.on(LiveTranscriptionEvents.Transcript, on_transcript)209210 await connection.start({211 "model": "nova-2", # Best quality212 "language": "en",213 "smart_format": True,214 "interim_results": True, # Get partial results215 "utterance_end_ms": 1000,216 "vad_events": True, # Voice activity detection217 "encoding": "linear16",218 "sample_rate": 16000219 })220221 # Stream audio222 async for chunk in audio_stream:223 await connection.send(chunk)224225 await connection.finish()226227# ElevenLabs streaming synthesis228eleven = ElevenLabs(api_key="...")229230def text_to_speech_stream(text: str):231 """Stream TTS audio chunks."""232 audio_stream = eleven.text_to_speech.convert_as_stream(233 voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel234 model_id="eleven_turbo_v2_5", # Fastest235 text=text,236 output_format="pcm_24000" # Raw PCM for low latency237 )238239 for chunk in audio_stream:240 yield chunk241242# Or with WebSocket for lowest latency243async def tts_websocket(text_stream):244 async with eleven.text_to_speech.stream_async(245 voice_id="21m00Tcm4TlvDq8ikWAM",246 model_id="eleven_turbo_v2_5"247 ) as tts:248 async for text_chunk in text_stream:249 audio = await tts.send(text_chunk)250 yield audio251252 # Flush remaining audio253 final_audio = await tts.flush()254 yield final_audio255```256257## Anti-Patterns258259### ❌ Non-streaming Pipeline260261**Why bad**: Adds seconds of latency.262User perceives as slow.263Loses conversation flow.264265**Instead**: Stream everything:266- STT: interim results267- LLM: token streaming268- TTS: chunk streaming269Start TTS before LLM finishes.270271### ❌ Ignoring Interruptions272273**Why bad**: Frustrating user experience.274Feels like talking to a machine.275Wastes time.276277**Instead**: Implement barge-in detection.278Use VAD to detect user speech.279Stop TTS immediately.280Clear audio queue.281282### ❌ Single Provider Lock-in283284**Why bad**: May not be best quality.285Single point of failure.286Harder to optimize.287288**Instead**: Mix best providers:289- Deepgram for STT (speed + accuracy)290- ElevenLabs for TTS (voice quality)291- OpenAI/Anthropic for LLM292293## Limitations294295- Latency varies by provider296- Cost per minute adds up297- Quality depends on network298- Complex debugging299300## Related Skills301302Works well with: `langgraph`, `structured-output`, `langfuse`303
Full transparency — inspect the skill content before installing.