Audio
SambaNova’s first speech reasoning model on SambaNova Cloud will extend our multimodal AI capabilities beyond vision to include advanced audio processing and understanding. This model offers OpenAI compatible endpoints that enable real-time reasoning, transcriptions and translations.
This model is currently being provided as a beta model.
The Qwen2-Audio Instruct model
-
Model: Qwen2-Audio Instruct
-
Description: Instruction-tuned large audio language model. Built on Qwen-7B with Whisper-large-v3 audio encoder (8.2B parameters).
-
Model ID:
qwen2-audio-7b-instruct
-
Supported languages: Multilingual
Core capabilities
-
Transform audio into Intelligence: Allows you to build GPT-4-like voice applications quickly.
-
Provides direct question-answering for any audio input.
-
Comprehensive audio processing that includes real-time conversation, transcription, translation, and analysis through a single unified model.
Customization and control
-
System-level prompts: Use Assistant Prompt in the request to customize model behavior for specific requirements. See the
message
parameter in the Request parameters section for more details.-
Brand-specific formatting (e.g., “BrandName” vs “brandname”).
-
Domain-specific terminology.
-
Response style and tone control.
-
View the Audio reasoning, Translation, and Transcription API endpoint documents for more details.
Audio processing
-
Silence detection: Intelligent identification of meaningful pauses and gaps in speech.
-
Noise cancellation: Advanced noise filtering and clean audio processing.
-
Multilingual processing: Support for multiple languages with automatic language detection.
Analysis capabilities
-
Sentiment analysis: Detects and analyzes emotional content in speech.
-
Multi-speaker handling: Processes conversations with multiple participants.
-
Mixed audio understanding: Comprehends speech, music, and environmental sounds.
Speech recognition performance numbers
-
Metrics taken from published Qwen2-Audio paper benchmarks.
-
WER%, lower is better
Language | Dataset | Qwen2-Audio | Whisper-large-v3 | Improvement |
---|---|---|---|---|
English | Common Voice 15 | 8.6% | 9.3% | +7.5% |
Chinese | Common Voice 15 | 6.9% | 12.8% | +46.1% |
Request parameters
Parameter | Type | Default | Description | Endpoints |
---|---|---|---|---|
model | String | Required | The ID of the model to use. Only Qwen2-Audio-7B-Instruct is currently available. | All |
messages | Message | Required | A list of messages containing role (user/system/assistant), type (text/audio_content), and audio_content (base64 audio content). | All |
response_format | String | JSON | The output format, either JSON or text. | All |
temperature | Number | 0 | Sampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness, while lower values (e.g., 0.2) make output more focused. | All |
max_tokens | Number | 1000 | The maximum number of tokens to generate. | All |
file | File | Required | Audio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. Each single file must not exceed 30 seconds in duration. | All |
language | String | Optional | The target language for transcription or translation. | Transcription, Translation |
stream | Boolean | False | Enables streaming responses. | All |
stream_options | Object | Optional | Additional streaming configuration (e.g., {“include_usage”: true}). | All |