Under the hood

From your voice to the meeting, in seven stages.

Sync Speak is a pipeline, not a monolith. Each stage solves one problem and hands off to the next. Here's what happens between the moment you say something in Hindi and the moment Zoom plays it back in English.

Mic → VAD → Saarika STT → Groq LLM → Bulbul TTS → VB-Cable → Meeting

Microphone capture
sounddevice · 16 kHz · mono · WASAPI
Audio is captured as 16 kHz mono PCM. Before every stream open, the WASAPI backend is terminated and re-initialised — this eliminates the -9985 stale device error you get on Windows when a meeting app has touched the device first.
Neural Voice Activity Detection
webrtcvad-wheels · 10 ms frames · mode 3
WebRTC Neural VAD classifies each 10 ms frame as speech or silence. A 500 ms ring buffer (deque maxlen=5) sits upstream of the trigger, so when speech fires, the preceding half-second is already captured. First syllables are not lost.
Speech-to-text
Sarvam Saarika v2.5 (REST)
When silence is detected after speech, the completed utterance is POSTed to Saarika v2.5. It handles pure Hindi, Hinglish, and code-mixed speech natively. A small Hinglish correction table runs on the output to catch common phonetic mis-mappings before they reach the LLM.
Translation with context
Groq · Llama 3.3 70B · 5-utterance rolling window
The current utterance plus the previous four are sent to Groq. The system prompt instructs the model to preserve meeting register, resolve pronouns, and keep proper nouns untouched. Groq's inference latency is the key reason end-to-end feels sub-second — hundreds of tokens per second.
Text-to-speech, pipelined
Sarvam Bulbul v3 (REST) · sentence-level
The English translation is split into sentences. Each sentence is synthesised in parallel, but playback starts on sentence one as soon as it returns — sentences two, three, four arrive during playback. End-to-end perceived latency sits under 1.2 seconds.
Output routing
VB-Cable · virtual microphone
VB-Cable presents a virtual input device to the OS. Sync Speak writes the translated audio into it, and your meeting app (Meet, Zoom, Teams, Discord) selects the virtual cable as its microphone. The meeting hears English.
Self-talk guard
internal _tts_active flag · VAD reset
While Bulbul output is playing, microphone capture is gated off so the English voice is not re-transcribed back into the Hindi STT. After playback ends, VAD state is reset so no partial frames bleed into the next utterance.

Latency budget

End-to-end perceived latency sits between 800 ms and 1.2 s on typical office fibre. The breakdown:

VAD trigger~50 ms
Saarika STT~400 ms
Groq LLM~150 ms
Bulbul first sentence~300 ms
Playback start~50 ms

Microphone capture

Neural Voice Activity Detection

Speech-to-text

Translation with context

Text-to-speech, pipelined

Output routing

Self-talk guard

Latency budget