SambaNova’s first speech reasoning model on SambaNova Cloud will extend our multimodal AI capabilities beyond vision to include advanced audio processing and understanding. This model offers OpenAI compatible endpoints that enable real-time reasoning, transcriptions and translations.

The Whisper-Large-v3 model

  • Model: Whisper-Large-v3

  • Description: State-of-the-art automatic speech recognition (ASR) and translation model. Developed by OpenAI and trained on 5M+ hours of labeled audio. Excels in multilingual and zero-shot speech tasks across diverse domains.

  • Model ID: Whisper-Large-v3

  • Supported languages: Multilingual

Core capabilities

  • Transcribes and translates extended audio inputs (up to 25 MB).

  • Demonstrates high accuracy in speech recognition and translation tasks.

  • Provides OpenAI-compatible endpoints for transcriptions and translations.

Request parameters

ParameterTypeDescriptionDefaultEndpoints
modelStringThe ID of the model to use.Requiredtranscriptions, translations
fileFileAudio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. File size limit: 25MB.Requiredtranscriptions, translations
promptStringPrompt to influence transcription style or vocabulary. Example: “Please transcribe carefully, including pauses and hesitations.”Optionaltranscriptions, translations
response_formatStringOutput format: either json or text.jsontranscriptions, translations
languageStringThe language of the input audio. Using ISO-639-1 format (e.g., en) improves accuracy and latency.Optionaltranscriptions, translations
streamBooleanEnables streaming responses.falsetranscriptions, translations
stream_optionsObjectAdditional streaming configuration (e.g., {“include_usage”: true}).Optionaltranscriptions, translations

The Qwen2-Audio Instruct model

  • Model: Qwen2-Audio Instruct

  • Description: Instruction-tuned large audio language model. Built on Qwen-7B with Whisper-large-v3 audio encoder (8.2B parameters).

  • Model ID: qwen2-audio-7b-instruct

  • Supported languages: Multilingual

This model is currently being provided as a beta model.

Core capabilities

  • Transform audio into Intelligence: Allows you to build GPT-4-like voice applications quickly.

  • Provides direct question-answering for any audio input.

  • Comprehensive audio processing that includes real-time conversation, transcription, translation, and analysis through a single unified model.

Customization and control

  • System-level prompts: Use Assistant Prompt in the request to customize model behavior for specific requirements. See the message parameter in the Request parameters section for more details.

    • Brand-specific formatting (e.g., BrandName vs brandname).

    • Domain-specific terminology.

    • Response style and tone control.

View the Audio reasoning, Translation, and Transcription API endpoint documents for more details.

Audio processing

  • Silence detection: Intelligent identification of meaningful pauses and gaps in speech.

  • Noise cancellation: Advanced noise filtering and clean audio processing.

  • Multilingual processing: Support for multiple languages with automatic language detection.

Analysis capabilities

  • Sentiment analysis: Detects and analyzes emotional content in speech.

  • Multi-speaker handling: Processes conversations with multiple participants.

  • Mixed audio understanding: Comprehends speech, music, and environmental sounds.

Speech recognition performance numbers

  • Metrics taken from published Qwen2-Audio paper benchmarks.

  • WER%, lower is better

LanguageDatasetQwen2-AudioWhisper-large-v3Improvement
EnglishCommon Voice 158.6%9.3%+7.5%
ChineseCommon Voice 156.9%12.8%+46.1%

Request parameters

ParameterTypeDescriptionDefaultEndpoints
modelStringThe ID of the model to use. Only qwen2-audio-7b-instruct is currently available.RequiredAll
messagesMessageA list of messages containing role (user, system, assistant), type (text or audio_content), and audio_content (base64-encoded audio).RequiredAll
response_formatStringThe output format: either json or text.jsonAll
temperatureNumberSampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness; lower values (e.g., 0.2) make output more focused.0All
max_tokensNumberThe maximum number of tokens to generate.1000All
fileFileAudio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. Each file must not exceed 30 seconds in duration.RequiredAll
languageStringThe target language for transcription or translation.Optionaltranscriptions, translations
streamBooleanEnables streaming responses.falseAll
stream_optionsObjectAdditional streaming configuration (e.g., {"include_usage": true}).OptionalAll