Fully-local two-way voice assistant ($0/mo)

The problem and the motivation

Commercial voice assistants carry two hidden costs: they send your audio to the cloud, and they bill you per use. For a personal workflow on Windows, on a single GPU machine, neither tradeoff is necessary. Speech recognition and speech synthesis already run well locally, and a good-enough LLM is available on a free lane. The goal was concrete: a two-way voice assistant that listens, understands, thinks and replies out loud, without sending anything off-box and without a monthly operating cost.

The cost constraint shaped the whole architecture. To keep operation at $0/mo, every stage had to be either fully local (STT, TTS) or on a free lane (the LLM via NVIDIA NIM). That decision ruled out any pay-per-request cloud service from the start and turned the work into a local-resources engineering problem: how do you get low latency and useful answers out of an entirely free stack.

The architecture and how it works

The system is a single Python loop that ties three stages together. The microphone is read through sounddevice, with no PyAudio. The captured audio runs through faster-whisper for speech-to-text. The resulting text is sent to a free LLM via NVIDIA NIM (the cc_free lane), and the reply is played back through Piper TTS using the en_US-amy-medium voice. In short: mic → STT (faster-whisper) → LLM (NVIDIA, gpt-oss) → TTS (Piper).

The interaction is designed to be short and natural. The prompt asks for replies of 1-3 sentences, so the listener is not worn out and the round-trip stays small. The assistant is wake-gated by default: you say "Claude ..." to address it, and "stop" closes it. This gating prevents an open microphone from answering every noise in the room. In open-mic mode (gating off) it answers everything, which is why headphones are recommended there.

The agent runs as a daemon in its own console window, so the prompts are visible and the microphone and speaker work correctly. On startup it pre-warms both STT and TTS (about 7 seconds), then enters the listening state. The entire behavior is tunable from environment variables, which made it possible to iterate on a daemon through quick kill, edit and relaunch cycles without changing the code each time.

Key technical decisions and why

The first important decision was the choice of audio libraries. PyAudio will not build on Python 3.14, because there is no cp314 wheel, and the RealtimeSTT stack depends on it. The fix was to replace it with sounddevice plus faster-whisper, which install cleanly and cover the same role. The general lesson, applicable to any project: when a hard dependency has no wheel for your Python version, swap it for an installed equivalent instead of forcing a build.

The second decision was pinning the LLM model. The raw NVIDIA lane rotates a pool of models, and sometimes lands on one that answers in 20-40 seconds, which is unacceptable for voice. By explicitly pinning gpt-oss (the CC_VOICE_NVIDIA_MODEL = openai/gpt-oss-120b variable), the LLM latency dropped to about 0.6s, versus 3.4s on the default lane. This is one of the single largest latency savings in the whole system, obtained from one configuration decision.

The third decision concerned the Piper API. From Piper 1.4 onward, synthesis is done with voice.synthesize_wav(text, wav_file); the old synthesize(text, wf) call raises a "channels not specified" error. Aligning to the new API was required for audio playback to work at all.

Engineering details: anti-fragility and performance

A local voice assistant has to be robust to noise and to model failure modes. The most annoying problem was silence hallucination: on segments with no speech, Whisper produces typical phantom phrases like "subscribe to my channel" or "thank you very much". The fix was a three-layer filter: a gate on microphone energy (RMS), a minimum speech duration, and a known-phrase filter. The RMS threshold is tunable (CC_VOICE_RMS = 0.030, lowered toward 0.020 for weak microphones), and the minimum-duration thresholds (CC_VOICE_MIN_BLOCKS, CC_VOICE_MIN_S) ensure that short words like "stop" are still recognized without opening the door to noise.

Performance was a fight against GPU contention. Memory recall (grounding from a project scratchpad, the Hivemind) runs embeddings on the same single GPU as Whisper, so it adds about 2.8s of latency when turned on. For that reason recall is off by default and is enabled only when usefulness matters more than speed. For STT, the base.en model is nearly as accurate as small.en on short commands but about 2.8 times faster, so it is the default choice. Amy's synthesis playing while the agent listens creates barge-in problems on speakers, where its own audio bleeds back into the microphone; on headphones barge-in works cleanly, and on speakers it is recommended to turn it off.

For cost and reliability at the LLM level, the lane behaves as a cascade. cc.auto (gpt-oss) returns empty on long prompts, so the lanes are chained (qwen_cloud → nvidia → auto), keeping useful answers without giving up the free tier. This keeps the cost at $0/mo even when an individual lane fails.

The bridge to Claude Code

Beyond ordinary conversation, the agent can optionally drive a live Claude Code session. With CC_VOICE_BRIDGE on, a spoken command of the form "Claude, " is written to a bridge file (voice_cc_command.jsonl) and announced in the console with a CC-COMMAND marker. A Claude Code session monitors that file, writes its reply to voice_cc_reply.jsonl, and the voice agent speaks it back. Important for safety: action commands still pass through all the HARD RULE gates, so voice is not a way to bypass confirmations. The bridge is off by default, so the base interaction stays purely conversational.

Outcome and current status

The system is a working build, not a prototype. When warm, the full round-trip is 400-600ms, and time to first audio is about 1s, achieved through a latency optimization pass: boot pre-warming, a pinned model, and a 0.4s end-of-speech wait. The operating cost is $0/mo, because STT and TTS are local and the LLM runs on a free lane. The full stack is Python, faster-whisper for STT, Piper for TTS, sounddevice for audio I/O, NVIDIA NIM for the LLM, plus NumPy, on Windows with CUDA.

All of these build details are documented so they are not relearned: the PyAudio incompatibility with Python 3.14, the Piper API change, the silence hallucinations, the GPU contention, and the speaker barge-in problem. The assistant began as a matured behavior from an internal instinct system and was extracted into a standalone, reusable local tool, with environment knobs for wake gating, model, audio thresholds and the bridge. In its current form, it is a concrete demonstration that a low-latency, two-way voice experience can be built entirely from local and free components.