Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, a
Add this skill
npx mdskills install sickn33/azure-ai-voicelive-pyComprehensive SDK reference with excellent examples and clear architecture but overly permissive
1---2name: azure-ai-voicelive-py3description: Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.4package: azure-ai-voicelive5---67# Azure AI Voice Live SDK89Build real-time voice AI applications with bidirectional WebSocket communication.1011## Installation1213```bash14pip install azure-ai-voicelive aiohttp azure-identity15```1617## Environment Variables1819```bash20AZURE_COGNITIVE_SERVICES_ENDPOINT=https://<region>.api.cognitive.microsoft.com21# For API key auth (not recommended for production)22AZURE_COGNITIVE_SERVICES_KEY=<api-key>23```2425## Authentication2627**DefaultAzureCredential (preferred)**:28```python29from azure.ai.voicelive.aio import connect30from azure.identity.aio import DefaultAzureCredential3132async with connect(33 endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],34 credential=DefaultAzureCredential(),35 model="gpt-4o-realtime-preview",36 credential_scopes=["https://cognitiveservices.azure.com/.default"]37) as conn:38 ...39```4041**API Key**:42```python43from azure.ai.voicelive.aio import connect44from azure.core.credentials import AzureKeyCredential4546async with connect(47 endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],48 credential=AzureKeyCredential(os.environ["AZURE_COGNITIVE_SERVICES_KEY"]),49 model="gpt-4o-realtime-preview"50) as conn:51 ...52```5354## Quick Start5556```python57import asyncio58import os59from azure.ai.voicelive.aio import connect60from azure.identity.aio import DefaultAzureCredential6162async def main():63 async with connect(64 endpoint=os.environ["AZURE_COGNITIVE_SERVICES_ENDPOINT"],65 credential=DefaultAzureCredential(),66 model="gpt-4o-realtime-preview",67 credential_scopes=["https://cognitiveservices.azure.com/.default"]68 ) as conn:69 # Update session with instructions70 await conn.session.update(session={71 "instructions": "You are a helpful assistant.",72 "modalities": ["text", "audio"],73 "voice": "alloy"74 })7576 # Listen for events77 async for event in conn:78 print(f"Event: {event.type}")79 if event.type == "response.audio_transcript.done":80 print(f"Transcript: {event.transcript}")81 elif event.type == "response.done":82 break8384asyncio.run(main())85```8687## Core Architecture8889### Connection Resources9091The `VoiceLiveConnection` exposes these resources:9293| Resource | Purpose | Key Methods |94|----------|---------|-------------|95| `conn.session` | Session configuration | `update(session=...)` |96| `conn.response` | Model responses | `create()`, `cancel()` |97| `conn.input_audio_buffer` | Audio input | `append()`, `commit()`, `clear()` |98| `conn.output_audio_buffer` | Audio output | `clear()` |99| `conn.conversation` | Conversation state | `item.create()`, `item.delete()`, `item.truncate()` |100| `conn.transcription_session` | Transcription config | `update(session=...)` |101102## Session Configuration103104```python105from azure.ai.voicelive.models import RequestSession, FunctionTool106107await conn.session.update(session=RequestSession(108 instructions="You are a helpful voice assistant.",109 modalities=["text", "audio"],110 voice="alloy", # or "echo", "shimmer", "sage", etc.111 input_audio_format="pcm16",112 output_audio_format="pcm16",113 turn_detection={114 "type": "server_vad",115 "threshold": 0.5,116 "prefix_padding_ms": 300,117 "silence_duration_ms": 500118 },119 tools=[120 FunctionTool(121 type="function",122 name="get_weather",123 description="Get current weather",124 parameters={125 "type": "object",126 "properties": {127 "location": {"type": "string"}128 },129 "required": ["location"]130 }131 )132 ]133))134```135136## Audio Streaming137138### Send Audio (Base64 PCM16)139140```python141import base64142143# Read audio chunk (16-bit PCM, 24kHz mono)144audio_chunk = await read_audio_from_microphone()145b64_audio = base64.b64encode(audio_chunk).decode()146147await conn.input_audio_buffer.append(audio=b64_audio)148```149150### Receive Audio151152```python153async for event in conn:154 if event.type == "response.audio.delta":155 audio_bytes = base64.b64decode(event.delta)156 await play_audio(audio_bytes)157 elif event.type == "response.audio.done":158 print("Audio complete")159```160161## Event Handling162163```python164async for event in conn:165 match event.type:166 # Session events167 case "session.created":168 print(f"Session: {event.session}")169 case "session.updated":170 print("Session updated")171172 # Audio input events173 case "input_audio_buffer.speech_started":174 print(f"Speech started at {event.audio_start_ms}ms")175 case "input_audio_buffer.speech_stopped":176 print(f"Speech stopped at {event.audio_end_ms}ms")177178 # Transcription events179 case "conversation.item.input_audio_transcription.completed":180 print(f"User said: {event.transcript}")181 case "conversation.item.input_audio_transcription.delta":182 print(f"Partial: {event.delta}")183184 # Response events185 case "response.created":186 print(f"Response started: {event.response.id}")187 case "response.audio_transcript.delta":188 print(event.delta, end="", flush=True)189 case "response.audio.delta":190 audio = base64.b64decode(event.delta)191 case "response.done":192 print(f"Response complete: {event.response.status}")193194 # Function calls195 case "response.function_call_arguments.done":196 result = handle_function(event.name, event.arguments)197 await conn.conversation.item.create(item={198 "type": "function_call_output",199 "call_id": event.call_id,200 "output": json.dumps(result)201 })202 await conn.response.create()203204 # Errors205 case "error":206 print(f"Error: {event.error.message}")207```208209## Common Patterns210211### Manual Turn Mode (No VAD)212213```python214await conn.session.update(session={"turn_detection": None})215216# Manually control turns217await conn.input_audio_buffer.append(audio=b64_audio)218await conn.input_audio_buffer.commit() # End of user turn219await conn.response.create() # Trigger response220```221222### Interrupt Handling223224```python225async for event in conn:226 if event.type == "input_audio_buffer.speech_started":227 # User interrupted - cancel current response228 await conn.response.cancel()229 await conn.output_audio_buffer.clear()230```231232### Conversation History233234```python235# Add system message236await conn.conversation.item.create(item={237 "type": "message",238 "role": "system",239 "content": [{"type": "input_text", "text": "Be concise."}]240})241242# Add user message243await conn.conversation.item.create(item={244 "type": "message",245 "role": "user",246 "content": [{"type": "input_text", "text": "Hello!"}]247})248249await conn.response.create()250```251252## Voice Options253254| Voice | Description |255|-------|-------------|256| `alloy` | Neutral, balanced |257| `echo` | Warm, conversational |258| `shimmer` | Clear, professional |259| `sage` | Calm, authoritative |260| `coral` | Friendly, upbeat |261| `ash` | Deep, measured |262| `ballad` | Expressive |263| `verse` | Storytelling |264265Azure voices: Use `AzureStandardVoice`, `AzureCustomVoice`, or `AzurePersonalVoice` models.266267## Audio Formats268269| Format | Sample Rate | Use Case |270|--------|-------------|----------|271| `pcm16` | 24kHz | Default, high quality |272| `pcm16-8000hz` | 8kHz | Telephony |273| `pcm16-16000hz` | 16kHz | Voice assistants |274| `g711_ulaw` | 8kHz | Telephony (US) |275| `g711_alaw` | 8kHz | Telephony (EU) |276277## Turn Detection Options278279```python280# Server VAD (default)281{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}282283# Azure Semantic VAD (smarter detection)284{"type": "azure_semantic_vad"}285{"type": "azure_semantic_vad_en"} # English optimized286{"type": "azure_semantic_vad_multilingual"}287```288289## Error Handling290291```python292from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed293294try:295 async with connect(...) as conn:296 async for event in conn:297 if event.type == "error":298 print(f"API Error: {event.error.code} - {event.error.message}")299except ConnectionClosed as e:300 print(f"Connection closed: {e.code} - {e.reason}")301except ConnectionError as e:302 print(f"Connection error: {e}")303```304305## References306307- **Detailed API Reference**: See [references/api-reference.md](references/api-reference.md)308- **Complete Examples**: See [references/examples.md](references/examples.md)309- **All Models & Types**: See [references/models.md](references/models.md)310
Full transparency — inspect the skill content before installing.