Add this skill
npx mdskills install sickn33/azure-speech-to-text-rest-pyComprehensive REST API skill with clear examples, error handling, and proper scope limitations
1---2name: azure-speech-to-text-rest-py3description: |4 Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.5 Triggers: "speech to text REST", "short audio transcription", "speech recognition REST API", "STT REST", "recognize speech REST".6 DO NOT USE FOR: Long audio (>60 seconds), real-time streaming, batch transcription, custom speech models, speech translation. Use Speech SDK or Batch Transcription API instead.7---89# Azure Speech to Text REST API for Short Audio1011Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.1213## Prerequisites14151. **Azure subscription** - [Create one free](https://azure.microsoft.com/free/)162. **Speech resource** - Create in [Azure Portal](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)173. **Get credentials** - After deployment, go to resource > Keys and Endpoint1819## Environment Variables2021```bash22# Required23AZURE_SPEECH_KEY=<your-speech-resource-key>24AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope2526# Alternative: Use endpoint directly27AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com28```2930## Installation3132```bash33pip install requests34```3536## Quick Start3738```python39import os40import requests4142def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict:43 """Transcribe short audio file (max 60 seconds) using REST API."""44 region = os.environ["AZURE_SPEECH_REGION"]45 api_key = os.environ["AZURE_SPEECH_KEY"]4647 url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"4849 headers = {50 "Ocp-Apim-Subscription-Key": api_key,51 "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",52 "Accept": "application/json"53 }5455 params = {56 "language": language,57 "format": "detailed" # or "simple"58 }5960 with open(audio_file_path, "rb") as audio_file:61 response = requests.post(url, headers=headers, params=params, data=audio_file)6263 response.raise_for_status()64 return response.json()6566# Usage67result = transcribe_audio("audio.wav", "en-US")68print(result["DisplayText"])69```7071## Audio Requirements7273| Format | Codec | Sample Rate | Notes |74|--------|-------|-------------|-------|75| WAV | PCM | 16 kHz, mono | **Recommended** |76| OGG | OPUS | 16 kHz, mono | Smaller file size |7778**Limitations:**79- Maximum 60 seconds of audio80- For pronunciation assessment: maximum 30 seconds81- No partial/interim results (final only)8283## Content-Type Headers8485```python86# WAV PCM 16kHz87"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"8889# OGG OPUS90"Content-Type": "audio/ogg; codecs=opus"91```9293## Response Formats9495### Simple Format (default)9697```python98params = {"language": "en-US", "format": "simple"}99```100101```json102{103 "RecognitionStatus": "Success",104 "DisplayText": "Remind me to buy 5 pencils.",105 "Offset": "1236645672289",106 "Duration": "1236645672289"107}108```109110### Detailed Format111112```python113params = {"language": "en-US", "format": "detailed"}114```115116```json117{118 "RecognitionStatus": "Success",119 "Offset": "1236645672289",120 "Duration": "1236645672289",121 "NBest": [122 {123 "Confidence": 0.9052885,124 "Display": "What's the weather like?",125 "ITN": "what's the weather like",126 "Lexical": "what's the weather like",127 "MaskedITN": "what's the weather like"128 }129 ]130}131```132133## Chunked Transfer (Recommended)134135For lower latency, stream audio in chunks:136137```python138import os139import requests140141def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict:142 """Stream audio in chunks for lower latency."""143 region = os.environ["AZURE_SPEECH_REGION"]144 api_key = os.environ["AZURE_SPEECH_KEY"]145146 url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"147148 headers = {149 "Ocp-Apim-Subscription-Key": api_key,150 "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",151 "Accept": "application/json",152 "Transfer-Encoding": "chunked",153 "Expect": "100-continue"154 }155156 params = {"language": language, "format": "detailed"}157158 def generate_chunks(file_path: str, chunk_size: int = 1024):159 with open(file_path, "rb") as f:160 while chunk := f.read(chunk_size):161 yield chunk162163 response = requests.post(164 url,165 headers=headers,166 params=params,167 data=generate_chunks(audio_file_path)168 )169170 response.raise_for_status()171 return response.json()172```173174## Authentication Options175176### Option 1: Subscription Key (Simple)177178```python179headers = {180 "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"]181}182```183184### Option 2: Bearer Token185186```python187import requests188import os189190def get_access_token() -> str:191 """Get access token from the token endpoint."""192 region = os.environ["AZURE_SPEECH_REGION"]193 api_key = os.environ["AZURE_SPEECH_KEY"]194195 token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"196197 response = requests.post(198 token_url,199 headers={200 "Ocp-Apim-Subscription-Key": api_key,201 "Content-Type": "application/x-www-form-urlencoded",202 "Content-Length": "0"203 }204 )205 response.raise_for_status()206 return response.text207208# Use token in requests (valid for 10 minutes)209token = get_access_token()210headers = {211 "Authorization": f"Bearer {token}",212 "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",213 "Accept": "application/json"214}215```216217## Query Parameters218219| Parameter | Required | Values | Description |220|-----------|----------|--------|-------------|221| `language` | **Yes** | `en-US`, `de-DE`, etc. | Language of speech |222| `format` | No | `simple`, `detailed` | Result format (default: simple) |223| `profanity` | No | `masked`, `removed`, `raw` | Profanity handling (default: masked) |224225## Recognition Status Values226227| Status | Description |228|--------|-------------|229| `Success` | Recognition succeeded |230| `NoMatch` | Speech detected but no words matched |231| `InitialSilenceTimeout` | Only silence detected |232| `BabbleTimeout` | Only noise detected |233| `Error` | Internal service error |234235## Profanity Handling236237```python238# Mask profanity with asterisks (default)239params = {"language": "en-US", "profanity": "masked"}240241# Remove profanity entirely242params = {"language": "en-US", "profanity": "removed"}243244# Include profanity as-is245params = {"language": "en-US", "profanity": "raw"}246```247248## Error Handling249250```python251import requests252253def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None:254 """Transcribe with proper error handling."""255 region = os.environ["AZURE_SPEECH_REGION"]256 api_key = os.environ["AZURE_SPEECH_KEY"]257258 url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"259260 try:261 with open(audio_path, "rb") as audio_file:262 response = requests.post(263 url,264 headers={265 "Ocp-Apim-Subscription-Key": api_key,266 "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",267 "Accept": "application/json"268 },269 params={"language": language, "format": "detailed"},270 data=audio_file271 )272273 if response.status_code == 200:274 result = response.json()275 if result.get("RecognitionStatus") == "Success":276 return result277 else:278 print(f"Recognition failed: {result.get('RecognitionStatus')}")279 return None280 elif response.status_code == 400:281 print(f"Bad request: Check language code or audio format")282 elif response.status_code == 401:283 print(f"Unauthorized: Check API key or token")284 elif response.status_code == 403:285 print(f"Forbidden: Missing authorization header")286 else:287 print(f"Error {response.status_code}: {response.text}")288289 return None290291 except requests.exceptions.RequestException as e:292 print(f"Request failed: {e}")293 return None294```295296## Async Version297298```python299import os300import aiohttp301import asyncio302303async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict:304 """Async version using aiohttp."""305 region = os.environ["AZURE_SPEECH_REGION"]306 api_key = os.environ["AZURE_SPEECH_KEY"]307308 url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"309310 headers = {311 "Ocp-Apim-Subscription-Key": api_key,312 "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",313 "Accept": "application/json"314 }315316 params = {"language": language, "format": "detailed"}317318 async with aiohttp.ClientSession() as session:319 with open(audio_file_path, "rb") as f:320 audio_data = f.read()321322 async with session.post(url, headers=headers, params=params, data=audio_data) as response:323 response.raise_for_status()324 return await response.json()325326# Usage327result = asyncio.run(transcribe_async("audio.wav", "en-US"))328print(result["DisplayText"])329```330331## Supported Languages332333Common language codes (see [full list](https://learn.microsoft.com/azure/ai-services/speech-service/language-support)):334335| Code | Language |336|------|----------|337| `en-US` | English (US) |338| `en-GB` | English (UK) |339| `de-DE` | German |340| `fr-FR` | French |341| `es-ES` | Spanish (Spain) |342| `es-MX` | Spanish (Mexico) |343| `zh-CN` | Chinese (Mandarin) |344| `ja-JP` | Japanese |345| `ko-KR` | Korean |346| `pt-BR` | Portuguese (Brazil) |347348## Best Practices3493501. **Use WAV PCM 16kHz mono** for best compatibility3512. **Enable chunked transfer** for lower latency3523. **Cache access tokens** for 9 minutes (valid for 10)3534. **Specify the correct language** for accurate recognition3545. **Use detailed format** when you need confidence scores3556. **Handle all RecognitionStatus values** in production code356357## When NOT to Use This API358359Use the Speech SDK or Batch Transcription API instead when you need:360361- Audio longer than 60 seconds362- Real-time streaming transcription363- Partial/interim results364- Speech translation365- Custom speech models366- Batch transcription of many files367368## Reference Files369370| File | Contents |371|------|----------|372| [references/pronunciation-assessment.md](references/pronunciation-assessment.md) | Pronunciation assessment parameters and scoring |373
Full transparency — inspect the skill content before installing.