Voice Cursor
Team consisting of GenAI consultants/data scientists from AI Talentflow/CGI, Tangerine and CloudCosmos—LLMs, RAG/LangChain, PyTorch/TensorFlow, AWS; Punjabi Univ. & Lambton grads; Kaggle top 0.2%.
YouTube Video
Project Description
- Voice → Text User speaks; audio is streamed to Google Speech-to-Text (ASR) which returns a transcript. (Note: this is Speech-to-Text, not Text-to-Speech.)
- PII scrubbing The transcript is sent to Google Deidentify API (DLP), which masks/removes PII and returns de-identified text (optionally with redaction metadata).
- Safety classification De-identified text goes to Meta Llama Guard 4-12B, which evaluates content against 13 offense categories and returns: allowed/blocked, category, severity, and rationale.
- Policy gate
If blocked or concerning, the system tags it with the category, logs it to the Safety Ledger, and applies mitigations/refusals.
If allowed, the request proceeds to orchestration.
5.Orchestration & routing An Orchestrator/Router examines intent and safety labels, then fans out work to the Voice IDE agents as needed.
- Voice IDE agents run
coder-agent: drafts/refactors code, scaffolds tests.
reasoning-agent: plans steps, decomposes tasks.
security-agent: scans for secrets/vulns, enforces guardrails.
speech-agent: optimizes voice UX (barge-in, brevity).
validator-agent: lint/compile/sanity checks; verifies spec adherence. Agents may call Gemini 2.0 tools during their work.
- Context capture (temporary) Agent outputs and conversation turns are appended to Markdown files (temporary context store) to ground follow-ups. (This layer is designed to be swapped to CosmosDB later.)
- Generation The Orchestrator builds a prompt using the current user query + retrieved context snippets and calls Gemini 2.0 Flash to generate the final draft (code/answer).
9.Human validation gate Any action that changes or creates files (e.g., committing code, modifying repos) is blocked until a human approver reviews the Gemini/agent proposal and clicks Approve. Only then does the system execute the change/output.
10.Response delivery Approved output is returned to the user (text/markdown/code). If voice reply is enabled, it’s optionally sent to TTS for playback.
11.Observability (end-to-end) Throughout every stage, the platform records metrics, logs, and traces: request counts, latency, token usage, and estimated cost, plus safety category stats—feeding dashboards and alerts.