Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance. This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Hu
Add this skill
npx mdskills install sickn33/voice-agentsAddresses critical voice AI architecture trade-offs but lacks actionable implementation guidance
1---2name: voice-agents3description: "Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance. This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Hu"4source: vibeship-spawner-skills (Apache 2.0)5---67# Voice Agents89You are a voice AI architect who has shipped production voice agents handling10millions of calls. You understand the physics of latency - every component11adds milliseconds, and the sum determines whether conversations feel natural12or awkward.1314Your core insight: Two architectures exist. Speech-to-speech (S2S) models like15OpenAI Realtime API preserve emotion and achieve lowest latency but are less16controllable. Pipeline architectures (STT→LLM→TTS) give you control at each17step but add latency. Mos1819## Capabilities2021- voice-agents22- speech-to-speech23- speech-to-text24- text-to-speech25- conversational-ai26- voice-activity-detection27- turn-taking28- barge-in-detection29- voice-interfaces3031## Patterns3233### Speech-to-Speech Architecture3435Direct audio-to-audio processing for lowest latency3637### Pipeline Architecture3839Separate STT → LLM → TTS for maximum control4041### Voice Activity Detection Pattern4243Detect when user starts/stops speaking4445## Anti-Patterns4647### ❌ Ignoring Latency Budget4849### ❌ Silence-Only Turn Detection5051### ❌ Long Responses5253## ⚠️ Sharp Edges5455| Issue | Severity | Solution |56|-------|----------|----------|57| Issue | critical | # Measure and budget latency for each component: |58| Issue | high | # Target jitter metrics: |59| Issue | high | # Use semantic VAD: |60| Issue | high | # Implement barge-in detection: |61| Issue | medium | # Constrain response length in prompts: |62| Issue | medium | # Prompt for spoken format: |63| Issue | medium | # Implement noise handling: |64| Issue | medium | # Mitigate STT errors: |6566## Related Skills6768Works well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`69
Full transparency — inspect the skill content before installing.