Deepgram speech-to-text and voice models now available natively on Together AI
Quick Answer
Deepgram's speech-to-text (STT) and text-to-speech (TTS) models are now natively available on Together AI, enhancing real-time voice agents with improved turn detection and transcription accuracy.
Quick Take
Deepgram's speech-to-text (STT) and text-to-speech (TTS) models are now natively available on Together AI, enhancing real-time voice agents with improved turn detection and transcription accuracy. Models like Flux and Nova-3 ensure responsiveness and clarity in challenging environments such as contact centers and healthcare, while Aura-2 maintains consistency in enterprise applications.
Key Points
- Flux model detects turn boundaries with 250ms end-of-turn detection for improved conversation flow.
- Nova-3 handles messy production audio, supporting vocabulary customization for domain-specific terms.
- Aura-2 ensures clear and reliable TTS output for structured information in enterprise settings.
- Deepgram models reduce latency and operational fragility by running natively on Together AI infrastructure.
- Multilingual support is enhanced with Nova-3, allowing seamless language switching during conversations.
📖 Reader Mode
~5 min readReal-time voice agents often fail when speech is treated as transcription rather than conversation. Getting the words right is only part of the challenge: the system also has to detect turn boundaries, handle interruptions and overlap, and respond quickly enough to keep the exchange feeling natural. When teams try to patch those gaps with endpointing logic, routing layers, and extra providers, they often add latency and operational fragility right back into the system. Deepgram’s models are purpose-built for that layer, where transcription, turn-taking, and responsiveness have to work together in real time.
Deepgram’s STT and TTS model lineup now runs natively on Together AI, the AI Native Cloud for building real-time voice agents, so teams can pair Deepgram transcription and synthesis with any LLM in the Together catalog and run the full voice pipeline on one production platform. For the broader architecture, see our real-time voice agents announcement.
“Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down. By hosting Deepgram’s STT and TTS natively on Together AI’s infrastructure, we’re giving developers production-grade transcription without the tradeoff. Fast, accurate, and co-located with the rest of the pipeline.”
- Abe Pursell, VP of Partnerships, Deepgram
Flux: Conversational STT with turn detection
Accurate transcription is only part of the job. A voice agent also has to know when the speaker is actually finished, because if it misreads the turn, it either talks over the caller or waits too long and feels unresponsive.
Flux is Deepgram’s conversational STT model for real-time agents, built not just to transcribe speech but to produce turn signals from conversational context rather than silence alone. That matters because many teams still rely on extra endpointing logic to bridge this gap, which adds complexity and makes latency harder to control. Flux simplifies that part of the stack and helps keep turn-taking more predictable in production with 250ms end-of-turn detection.
Nova-3: Production transcription for real-world audio
Production audio is messier than benchmark audio. Calls come with background noise, overlapping speakers, accents, telephony compression, and interruptions, and the model still has to return text the rest of the pipeline can trust. Nova-3 is built for those conditions, with support for vocabulary customization so teams can improve recognition of domain-specific terms without retraining.
Nova-3 Multilingual extends that approach across multiple languages, which matters in deployments where callers switch languages mid-conversation.
Aura-2: Enterprise TTS for production voice agents
Aura-2 covers the synthesis side of the pipeline for business environments where clarity and consistency matter. Teams can use Deepgram STT and TTS together while keeping output stable for domain-specific terms and structured entities.
That difference shows up in delivery. The voice has to stay clear, direct, and reliable when it reads structured information or specialized language back to the user. A voice that sounds fine in a demo is not enough if it starts to stumble once the interaction becomes operational.
Deepgram Aura-2
Thalia voice in English
0:00
"Welcome to the show. Today we're exploring something truly fascinating — the power of voice. It's not just the words that matter. It's the feeling behind them, the quiet moments of reflection, and the clarity to handle the details when they count."
Like this: Dr. Sarah Chen, 450 Park Avenue, New York, 10022 — your confirmation is BX-4072 with a $14.99 copay.
That's a lot of detail, and every bit of it needs to land clearly. That's what a great voice can do."
Use cases
Contact center voice agents
Contact centers are inherently messy environments. Call quality varies, speakers overlap, interruptions are constant, and latency still has to stay low enough for natural back-and-forth. Deepgram’s models help agents stay in flow through those conditions, keeping conversations responsive and intelligible instead of letting them break down into delays, missed turns, or unclear responses.
Healthcare voice agents
Healthcare voice agents need accurate transcription of medication names, procedure terms, and clinical language, along with output that stays clear when reading the same terms back to patients. A transcription error at the start of the pipeline can surface later as a fluent but incorrect response, which is exactly the kind of failure these systems cannot afford. Nova-3 helps teams adapt recognition to clinical language, while Aura-2 keeps patient-facing output clear and consistent.
Financial services
Financial voice systems depend on precision. Account numbers, routing numbers, trade confirmations, and structured financial language need to be captured correctly the first time, because a single transcription miss can create a failed transaction, compliance issue, or broken customer interaction. Deepgram’s speech models give teams a stronger foundation for these regulated workflows.
Multilingual customer support
Global support teams need speech models that hold up when callers move between languages and accents in the same interaction. Nova-3 Multilingual helps teams serve those conversations without building separate STT pipelines for every market, which makes multilingual support easier to scale and easier to operate.
Production infrastructure on Together AI
Deepgram models run on Together AI Dedicated Model Inference alongside LLM and TTS workloads on isolated capacity. Keeping transcription, reasoning, and synthesis in the same production environment makes real-time systems easier to operate and gives teams tighter control over performance as they scale.
Together AI is the AI Native Cloud for production inference, and Dedicated Model Inference gives teams the control and reliability they need to run voice agents at scale.
Infrastructure
Dedicated GPU capacity with isolated workloads
99.9% uptime SLA
SOC 2 Type II and HIPAA-ready support, with PCI support where applicable
Global regions with data residency options
Developer experience
Same SDKs and authentication across LLM, STT, and TTS endpoints
Single observability and logging surface for the voice pipeline
Model selection and swapping via configuration
One billing surface across your stack
Together AI supports a broad voice catalog in one place, so teams can mix and match across the pipeline without adding vendors. That includes open-source and proprietary models deployed alongside the LLMs that power agent reasoning.
See the Together AI voice solutions
Get started
- Deepgram’s announcement
- Read STT documentation
- Read TTS documentation
- Read the voice agents announcement
- Contact Sales for dedicated endpoint deployment and volume pricing
— Originally published at together.ai
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from Together AI
See more →Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
MiniMax's M3 model introduces a 1M-token context and multimodal capabilities, optimized for efficient inference with a 9x speedup in prefill and 15x in decoding, supported by Together AI's cloud infrastructure.