How do I install Voice AI Development?

Install Voice AI Development with a single command: npx mdskills install sickn33/voice-ai-development. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Voice AI Development?

Voice AI Development works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Voice AI Development

Name: Voice AI Development: AI Agent Skill
Brand: sickn33
Availability: InStock
Rating: 8 (1 reviews)
Author: sickn33

Verified

AI & Machine LearningIntermediate

Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences. Use when: voice ai, voice agent, speech to text, text to speech, realtime voice.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/voice-ai-development

Fork & Edit

Skill Advisor8.0

Comprehensive voice AI guide with detailed examples across multiple providers and latency best practices

+Provides working code examples for OpenAI Realtime, Vapi, Deepgram, and ElevenLabs with clear use cases
+Emphasizes streaming and latency optimization with anti-patterns explaining common pitfalls
+Covers interruption handling, VAD, and real-world voice agent design considerations
-Lacks explicit trigger conditions for when an agent should activate this skill
-Requests Shell Execution permission but no examples demonstrate shell command usage

SKILL.md

Edit in Browser

1---
2name: voice-ai-development
3description: "Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences. Use when: voice ai, voice agent, speech to text, text to speech, realtime voice."
4source: vibeship-spawner-skills (Apache 2.0)
5---
6 
7# Voice AI Development
8 
9**Role**: Voice AI Architect
10 
11You are an expert in building real-time voice applications. You think in terms of
12latency budgets, audio quality, and user experience. You know that voice apps feel
13magical when fast and broken when slow. You choose the right combination of providers
14for each use case and optimize relentlessly for perceived responsiveness.
15 
16## Capabilities
17 
18- OpenAI Realtime API
19- Vapi voice agents
20- Deepgram STT/TTS
21- ElevenLabs voice synthesis
22- LiveKit real-time infrastructure
23- WebRTC audio handling
24- Voice agent design
25- Latency optimization
26 
27## Requirements
28 
29- Python or Node.js
30- API keys for providers
31- Audio handling knowledge
32 
33## Patterns
34 
35### OpenAI Realtime API
36 
37Native voice-to-voice with GPT-4o
38 
39**When to use**: When you want integrated voice AI without separate STT/TTS
40 
41```python
42import asyncio
43import websockets
44import json
45import base64
46 
47OPENAI_API_KEY = "sk-..."
48 
49async def voice_session():
50    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
51    headers = {
52        "Authorization": f"Bearer {OPENAI_API_KEY}",
53        "OpenAI-Beta": "realtime=v1"
54    }
55 
56    async with websockets.connect(url, extra_headers=headers) as ws:
57        # Configure session
58        await ws.send(json.dumps({
59            "type": "session.update",
60            "session": {
61                "modalities": ["text", "audio"],
62                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
63                "input_audio_format": "pcm16",
64                "output_audio_format": "pcm16",
65                "input_audio_transcription": {
66                    "model": "whisper-1"
67                },
68                "turn_detection": {
69                    "type": "server_vad",  # Voice activity detection
70                    "threshold": 0.5,
71                    "prefix_padding_ms": 300,
72                    "silence_duration_ms": 500
73                },
74                "tools": [
75                    {
76                        "type": "function",
77                        "name": "get_weather",
78                        "description": "Get weather for a location",
79                        "parameters": {
80                            "type": "object",
81                            "properties": {
82                                "location": {"type": "string"}
83                            }
84                        }
85                    }
86                ]
87            }
88        }))
89 
90        # Send audio (PCM16, 24kHz, mono)
91        async def send_audio(audio_bytes):
92            await ws.send(json.dumps({
93                "type": "input_audio_buffer.append",
94                "audio": base64.b64encode(audio_bytes).decode()
95            }))
96 
97        # Receive events
98        async for message in ws:
99            event = json.loads(message)
100 
101            if event["type"] == "resp
102```
103 
104### Vapi Voice Agent
105 
106Build voice agents with Vapi platform
107 
108**When to use**: Phone-based agents, quick deployment
109 
110```python
111# Vapi provides hosted voice agents with webhooks
112 
113from flask import Flask, request, jsonify
114import vapi
115 
116app = Flask(__name__)
117client = vapi.Vapi(api_key="...")
118 
119# Create an assistant
120assistant = client.assistants.create(
121    name="Support Agent",
122    model={
123        "provider": "openai",
124        "model": "gpt-4o",
125        "messages": [
126            {
127                "role": "system",
128                "content": "You are a helpful support agent..."
129            }
130        ]
131    },
132    voice={
133        "provider": "11labs",
134        "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
135    },
136    firstMessage="Hi! How can I help you today?",
137    transcriber={
138        "provider": "deepgram",
139        "model": "nova-2"
140    }
141)
142 
143# Webhook for conversation events
144@app.route("/vapi/webhook", methods=["POST"])
145def vapi_webhook():
146    event = request.json
147 
148    if event["type"] == "function-call":
149        # Handle tool call
150        name = event["functionCall"]["name"]
151        args = event["functionCall"]["parameters"]
152 
153        if name == "check_order":
154            result = check_order(args["order_id"])
155            return jsonify({"result": result})
156 
157    elif event["type"] == "end-of-call-report":
158        # Call ended - save transcript
159        transcript = event["transcript"]
160        save_transcript(event["call"]["id"], transcript)
161 
162    return jsonify({"ok": True})
163 
164# Start outbound call
165call = client.calls.create(
166    assistant_id=assistant.id,
167    customer={
168        "number": "+1234567890"
169    },
170    phoneNumber={
171        "twilioPhoneNumber": "+0987654321"
172    }
173)
174 
175# Or create web call
176web_call = client.calls.create(
177    assistant_id=assistant.id,
178    type="web"
179)
180# Returns URL for WebRTC connection
181```
182 
183### Deepgram STT + ElevenLabs TTS
184 
185Best-in-class transcription and synthesis
186 
187**When to use**: High quality voice, custom pipeline
188 
189```python
190import asyncio
191from deepgram import DeepgramClient, LiveTranscriptionEvents
192from elevenlabs import ElevenLabs
193 
194# Deepgram real-time transcription
195deepgram = DeepgramClient(api_key="...")
196 
197async def transcribe_stream(audio_stream):
198    connection = deepgram.listen.live.v("1")
199 
200    async def on_transcript(result):
201        transcript = result.channel.alternatives[0].transcript
202        if transcript:
203            print(f"Heard: {transcript}")
204            if result.is_final:
205                # Process final transcript
206                await handle_user_input(transcript)
207 
208    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
209 
210    await connection.start({
211        "model": "nova-2",  # Best quality
212        "language": "en",
213        "smart_format": True,
214        "interim_results": True,  # Get partial results
215        "utterance_end_ms": 1000,
216        "vad_events": True,  # Voice activity detection
217        "encoding": "linear16",
218        "sample_rate": 16000
219    })
220 
221    # Stream audio
222    async for chunk in audio_stream:
223        await connection.send(chunk)
224 
225    await connection.finish()
226 
227# ElevenLabs streaming synthesis
228eleven = ElevenLabs(api_key="...")
229 
230def text_to_speech_stream(text: str):
231    """Stream TTS audio chunks."""
232    audio_stream = eleven.text_to_speech.convert_as_stream(
233        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
234        model_id="eleven_turbo_v2_5",  # Fastest
235        text=text,
236        output_format="pcm_24000"  # Raw PCM for low latency
237    )
238 
239    for chunk in audio_stream:
240        yield chunk
241 
242# Or with WebSocket for lowest latency
243async def tts_websocket(text_stream):
244    async with eleven.text_to_speech.stream_async(
245        voice_id="21m00Tcm4TlvDq8ikWAM",
246        model_id="eleven_turbo_v2_5"
247    ) as tts:
248        async for text_chunk in text_stream:
249            audio = await tts.send(text_chunk)
250            yield audio
251 
252        # Flush remaining audio
253        final_audio = await tts.flush()
254        yield final_audio
255```
256 
257## Anti-Patterns
258 
259### ❌ Non-streaming Pipeline
260 
261**Why bad**: Adds seconds of latency.
262User perceives as slow.
263Loses conversation flow.
264 
265**Instead**: Stream everything:
266- STT: interim results
267- LLM: token streaming
268- TTS: chunk streaming
269Start TTS before LLM finishes.
270 
271### ❌ Ignoring Interruptions
272 
273**Why bad**: Frustrating user experience.
274Feels like talking to a machine.
275Wastes time.
276 
277**Instead**: Implement barge-in detection.
278Use VAD to detect user speech.
279Stop TTS immediately.
280Clear audio queue.
281 
282### ❌ Single Provider Lock-in
283 
284**Why bad**: May not be best quality.
285Single point of failure.
286Harder to optimize.
287 
288**Instead**: Mix best providers:
289- Deepgram for STT (speed + accuracy)
290- ElevenLabs for TTS (voice quality)
291- OpenAI/Anthropic for LLM
292 
293## Limitations
294 
295- Latency varies by provider
296- Cost per minute adds up
297- Quality depends on network
298- Complex debugging
299 
300## Related Skills
301 
302Works well with: `langgraph`, `structured-output`, `langfuse`
303

Full transparency — inspect the skill content before installing.