What is Azure Speech To Text REST Py?

Azure Speech To Text REST Py is a free, open-source AI agent skill. |

How do I install Azure Speech To Text REST Py?

Install Azure Speech To Text REST Py with a single command: npx mdskills install sickn33/azure-speech-to-text-rest-py. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Azure Speech To Text REST Py?

Azure Speech To Text REST Py works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Azure Speech To Text REST Py

Name: Azure Speech To Text REST Py: AI Agent Skill
Brand: sickn33
Availability: InStock
Rating: 8 (1 reviews)
Author: sickn33

Verified

Video & PodcastIntermediate

by @sickn33 13,166Updated 2/20/2026

Add this skill

npx mdskills install sickn33/azure-speech-to-text-rest-py

Fork & Edit

Are you @sickn33? Sign in with GitHub to claim this listing.

Skill Advisor8.0

Comprehensive REST API skill with clear examples, error handling, and proper scope limitations

+Provides complete working code examples with sync, async, and chunked streaming variants
+Clearly documents when NOT to use this API vs SDK alternatives
+Includes detailed error handling, authentication options, and response format tables
-Declares shell execution permission without demonstrating any shell command usage
-Could benefit from file path validation examples to prevent directory traversal

SKILL.md

Edit in Browser

1---
2name: azure-speech-to-text-rest-py
3description: |
4  Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK.
5  Triggers: "speech to text REST", "short audio transcription", "speech recognition REST API", "STT REST", "recognize speech REST".
6  DO NOT USE FOR: Long audio (>60 seconds), real-time streaming, batch transcription, custom speech models, speech translation. Use Speech SDK or Batch Transcription API instead.
7---
8 
9# Azure Speech to Text REST API for Short Audio
10 
11Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
12 
13## Prerequisites
14 
151. **Azure subscription** - [Create one free](https://azure.microsoft.com/free/)
162. **Speech resource** - Create in [Azure Portal](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)
173. **Get credentials** - After deployment, go to resource > Keys and Endpoint
18 
19## Environment Variables
20 
21```bash
22# Required
23AZURE_SPEECH_KEY=<your-speech-resource-key>
24AZURE_SPEECH_REGION=<region>  # e.g., eastus, westus2, westeurope
25 
26# Alternative: Use endpoint directly
27AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
28```
29 
30## Installation
31 
32```bash
33pip install requests
34```
35 
36## Quick Start
37 
38```python
39import os
40import requests
41 
42def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict:
43    """Transcribe short audio file (max 60 seconds) using REST API."""
44    region = os.environ["AZURE_SPEECH_REGION"]
45    api_key = os.environ["AZURE_SPEECH_KEY"]
46    
47    url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
48    
49    headers = {
50        "Ocp-Apim-Subscription-Key": api_key,
51        "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
52        "Accept": "application/json"
53    }
54    
55    params = {
56        "language": language,
57        "format": "detailed"  # or "simple"
58    }
59    
60    with open(audio_file_path, "rb") as audio_file:
61        response = requests.post(url, headers=headers, params=params, data=audio_file)
62    
63    response.raise_for_status()
64    return response.json()
65 
66# Usage
67result = transcribe_audio("audio.wav", "en-US")
68print(result["DisplayText"])
69```
70 
71## Audio Requirements
72 
73| Format | Codec | Sample Rate | Notes |
74|--------|-------|-------------|-------|
75| WAV | PCM | 16 kHz, mono | **Recommended** |
76| OGG | OPUS | 16 kHz, mono | Smaller file size |
77 
78**Limitations:**
79- Maximum 60 seconds of audio
80- For pronunciation assessment: maximum 30 seconds
81- No partial/interim results (final only)
82 
83## Content-Type Headers
84 
85```python
86# WAV PCM 16kHz
87"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"
88 
89# OGG OPUS
90"Content-Type": "audio/ogg; codecs=opus"
91```
92 
93## Response Formats
94 
95### Simple Format (default)
96 
97```python
98params = {"language": "en-US", "format": "simple"}
99```
100 
101```json
102{
103  "RecognitionStatus": "Success",
104  "DisplayText": "Remind me to buy 5 pencils.",
105  "Offset": "1236645672289",
106  "Duration": "1236645672289"
107}
108```
109 
110### Detailed Format
111 
112```python
113params = {"language": "en-US", "format": "detailed"}
114```
115 
116```json
117{
118  "RecognitionStatus": "Success",
119  "Offset": "1236645672289",
120  "Duration": "1236645672289",
121  "NBest": [
122    {
123      "Confidence": 0.9052885,
124      "Display": "What's the weather like?",
125      "ITN": "what's the weather like",
126      "Lexical": "what's the weather like",
127      "MaskedITN": "what's the weather like"
128    }
129  ]
130}
131```
132 
133## Chunked Transfer (Recommended)
134 
135For lower latency, stream audio in chunks:
136 
137```python
138import os
139import requests
140 
141def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict:
142    """Stream audio in chunks for lower latency."""
143    region = os.environ["AZURE_SPEECH_REGION"]
144    api_key = os.environ["AZURE_SPEECH_KEY"]
145    
146    url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
147    
148    headers = {
149        "Ocp-Apim-Subscription-Key": api_key,
150        "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
151        "Accept": "application/json",
152        "Transfer-Encoding": "chunked",
153        "Expect": "100-continue"
154    }
155    
156    params = {"language": language, "format": "detailed"}
157    
158    def generate_chunks(file_path: str, chunk_size: int = 1024):
159        with open(file_path, "rb") as f:
160            while chunk := f.read(chunk_size):
161                yield chunk
162    
163    response = requests.post(
164        url, 
165        headers=headers, 
166        params=params, 
167        data=generate_chunks(audio_file_path)
168    )
169    
170    response.raise_for_status()
171    return response.json()
172```
173 
174## Authentication Options
175 
176### Option 1: Subscription Key (Simple)
177 
178```python
179headers = {
180    "Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"]
181}
182```
183 
184### Option 2: Bearer Token
185 
186```python
187import requests
188import os
189 
190def get_access_token() -> str:
191    """Get access token from the token endpoint."""
192    region = os.environ["AZURE_SPEECH_REGION"]
193    api_key = os.environ["AZURE_SPEECH_KEY"]
194    
195    token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
196    
197    response = requests.post(
198        token_url,
199        headers={
200            "Ocp-Apim-Subscription-Key": api_key,
201            "Content-Type": "application/x-www-form-urlencoded",
202            "Content-Length": "0"
203        }
204    )
205    response.raise_for_status()
206    return response.text
207 
208# Use token in requests (valid for 10 minutes)
209token = get_access_token()
210headers = {
211    "Authorization": f"Bearer {token}",
212    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
213    "Accept": "application/json"
214}
215```
216 
217## Query Parameters
218 
219| Parameter | Required | Values | Description |
220|-----------|----------|--------|-------------|
221| `language` | **Yes** | `en-US`, `de-DE`, etc. | Language of speech |
222| `format` | No | `simple`, `detailed` | Result format (default: simple) |
223| `profanity` | No | `masked`, `removed`, `raw` | Profanity handling (default: masked) |
224 
225## Recognition Status Values
226 
227| Status | Description |
228|--------|-------------|
229| `Success` | Recognition succeeded |
230| `NoMatch` | Speech detected but no words matched |
231| `InitialSilenceTimeout` | Only silence detected |
232| `BabbleTimeout` | Only noise detected |
233| `Error` | Internal service error |
234 
235## Profanity Handling
236 
237```python
238# Mask profanity with asterisks (default)
239params = {"language": "en-US", "profanity": "masked"}
240 
241# Remove profanity entirely
242params = {"language": "en-US", "profanity": "removed"}
243 
244# Include profanity as-is
245params = {"language": "en-US", "profanity": "raw"}
246```
247 
248## Error Handling
249 
250```python
251import requests
252 
253def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None:
254    """Transcribe with proper error handling."""
255    region = os.environ["AZURE_SPEECH_REGION"]
256    api_key = os.environ["AZURE_SPEECH_KEY"]
257    
258    url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
259    
260    try:
261        with open(audio_path, "rb") as audio_file:
262            response = requests.post(
263                url,
264                headers={
265                    "Ocp-Apim-Subscription-Key": api_key,
266                    "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
267                    "Accept": "application/json"
268                },
269                params={"language": language, "format": "detailed"},
270                data=audio_file
271            )
272        
273        if response.status_code == 200:
274            result = response.json()
275            if result.get("RecognitionStatus") == "Success":
276                return result
277            else:
278                print(f"Recognition failed: {result.get('RecognitionStatus')}")
279                return None
280        elif response.status_code == 400:
281            print(f"Bad request: Check language code or audio format")
282        elif response.status_code == 401:
283            print(f"Unauthorized: Check API key or token")
284        elif response.status_code == 403:
285            print(f"Forbidden: Missing authorization header")
286        else:
287            print(f"Error {response.status_code}: {response.text}")
288        
289        return None
290        
291    except requests.exceptions.RequestException as e:
292        print(f"Request failed: {e}")
293        return None
294```
295 
296## Async Version
297 
298```python
299import os
300import aiohttp
301import asyncio
302 
303async def transcribe_async(audio_file_path: str, language: str = "en-US") -> dict:
304    """Async version using aiohttp."""
305    region = os.environ["AZURE_SPEECH_REGION"]
306    api_key = os.environ["AZURE_SPEECH_KEY"]
307    
308    url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
309    
310    headers = {
311        "Ocp-Apim-Subscription-Key": api_key,
312        "Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
313        "Accept": "application/json"
314    }
315    
316    params = {"language": language, "format": "detailed"}
317    
318    async with aiohttp.ClientSession() as session:
319        with open(audio_file_path, "rb") as f:
320            audio_data = f.read()
321        
322        async with session.post(url, headers=headers, params=params, data=audio_data) as response:
323            response.raise_for_status()
324            return await response.json()
325 
326# Usage
327result = asyncio.run(transcribe_async("audio.wav", "en-US"))
328print(result["DisplayText"])
329```
330 
331## Supported Languages
332 
333Common language codes (see [full list](https://learn.microsoft.com/azure/ai-services/speech-service/language-support)):
334 
335| Code | Language |
336|------|----------|
337| `en-US` | English (US) |
338| `en-GB` | English (UK) |
339| `de-DE` | German |
340| `fr-FR` | French |
341| `es-ES` | Spanish (Spain) |
342| `es-MX` | Spanish (Mexico) |
343| `zh-CN` | Chinese (Mandarin) |
344| `ja-JP` | Japanese |
345| `ko-KR` | Korean |
346| `pt-BR` | Portuguese (Brazil) |
347 
348## Best Practices
349 
3501. **Use WAV PCM 16kHz mono** for best compatibility
3512. **Enable chunked transfer** for lower latency
3523. **Cache access tokens** for 9 minutes (valid for 10)
3534. **Specify the correct language** for accurate recognition
3545. **Use detailed format** when you need confidence scores
3556. **Handle all RecognitionStatus values** in production code
356 
357## When NOT to Use This API
358 
359Use the Speech SDK or Batch Transcription API instead when you need:
360 
361- Audio longer than 60 seconds
362- Real-time streaming transcription
363- Partial/interim results
364- Speech translation
365- Custom speech models
366- Batch transcription of many files
367 
368## Reference Files
369 
370| File | Contents |
371|------|----------|
372| [references/pronunciation-assessment.md](references/pronunciation-assessment.md) | Pronunciation assessment parameters and scoring |
373

Full transparency — inspect the skill content before installing.

New to skill.md files?

See what a SKILL.md file is, how to install one, and how it differs from AGENTS.md or cursorrules.

Read the guide →