12.0.0 - Real-Time Container

February 14th, 2025

Realtime Container

New

Multi-session: the real-time container can now process multiple sessions in parallel, simplifying orchestration and giving a large reduction in memory usage when using GPU transcription
New transcription languages: Urdu (ur), Bengali (bn) and Swahili (sw)
New field orchestrator_version in the RecognitionStarted message which uniquely identifies the underlying engine version

Faster real-time transcription with more consistent latency
- Fewer words are included in each Final (typically 1-2 words per final).
- The latency for final word transcriptions is now more consistent.
- Average latency reduced by 300-400ms for a given max_delay, and an additional 150-200ms for English.
- We recommend re-evaluating the best max_delay for your use case. In most cases, you can reduce the max_delay and still get the same accuracy! Learn more about the accuracy vs. latency trade-offs here.
Massive accuracy improvements for real-time Speaker Diarization – including more accurate speaker changes
- Reduction in errors recognizing the speakers by 48% at 1s (GPU operating points)
- Reduction in speaker change mistakes by 38% at 1s (GPU operating points)
Major efficiency improvements for real-time transcription on GPU for English, French, German and Spanish with the Enhanced operating point
- 75% increase in the number of sessions that can be processed on a GPU Transcription Inference Server (benchmarked using NVIDIA T4 on major cloud platforms)
Partial transcripts now include numeral formatting, enhanced punctuation, and better casing.
Improved accuracy for recognition of Arabic numbers and currency

"Kyiv" output consistently in English transcription
Updated English profanity tagging to remove a small number of non-profane words
Fixed an issue with Mandarin (cmn) where consecutive English words were concatenated together
Fix for delay in emitting Final transcripts for a subset of languages when using config containing punctuation_overrides={"permitted_marks":[],...
When the server receives an EndOfStream message, all AddAudio messages received are dropped, and on the first AddAudio received after EndOfStream, the server will send an error message to the client (but not close the connection)
Security fixes. A Software Bill of Materials (SBOM) is available for download from the corresponding release page in our Support Portal.

Degraded transcription accuracy on non-speech audio in English. This issue is fixed in the 12.0.1 release