From your voice to the meeting, in seven stages.
Sync Speak is a pipeline, not a monolith. Each stage solves one problem and hands off to the next. Here's what happens between the moment you say something in Hindi and the moment Zoom plays it back in English.
-
Microphone capture
sounddevice · 16 kHz · mono · WASAPIAudio is captured as 16 kHz mono PCM. Before every stream open, the WASAPI backend is terminated and re-initialised — this eliminates the -9985 stale device error you get on Windows when a meeting app has touched the device first.
-
Neural Voice Activity Detection
webrtcvad-wheels · 10 ms frames · mode 3WebRTC Neural VAD classifies each 10 ms frame as speech or silence. A 500 ms ring buffer (deque maxlen=5) sits upstream of the trigger, so when speech fires, the preceding half-second is already captured. First syllables are not lost.
-
Speech-to-text
Sarvam Saarika v2.5 (REST)When silence is detected after speech, the completed utterance is POSTed to Saarika v2.5. It handles pure Hindi, Hinglish, and code-mixed speech natively. A small Hinglish correction table runs on the output to catch common phonetic mis-mappings before they reach the LLM.
-
Translation with context
Groq · Llama 3.3 70B · 5-utterance rolling windowThe current utterance plus the previous four are sent to Groq. The system prompt instructs the model to preserve meeting register, resolve pronouns, and keep proper nouns untouched. Groq's inference latency is the key reason end-to-end feels sub-second — hundreds of tokens per second.
-
Text-to-speech, pipelined
Sarvam Bulbul v3 (REST) · sentence-levelThe English translation is split into sentences. Each sentence is synthesised in parallel, but playback starts on sentence one as soon as it returns — sentences two, three, four arrive during playback. End-to-end perceived latency sits under 1.2 seconds.
-
Output routing
VB-Cable · virtual microphoneVB-Cable presents a virtual input device to the OS. Sync Speak writes the translated audio into it, and your meeting app (Meet, Zoom, Teams, Discord) selects the virtual cable as its microphone. The meeting hears English.
-
Self-talk guard
internal _tts_active flag · VAD resetWhile Bulbul output is playing, microphone capture is gated off so the English voice is not re-transcribed back into the Hindi STT. After playback ends, VAD state is reset so no partial frames bleed into the next utterance.
Latency budget
End-to-end perceived latency sits between 800 ms and 1.2 s on typical office fibre. The breakdown:
- VAD trigger~50 ms
- Saarika STT~400 ms
- Groq LLM~150 ms
- Bulbul first sentence~300 ms
- Playback start~50 ms