Introduction
I built a voice assistant for my kid that runs entirely on an RTX GPU in our home. No cloud services, no API calls, no data leaving the house. The whole thing runs on an RTX 5070 Ti sitting under a desk.
This wasn't about building a product or proving commercial viability. It was about understanding what happens when you try to run multiple AI services on hardware that wasn't really designed for it, and what architectural patterns emerge from real constraints.
The goal was simple: let kids ask questions and get spoken answers, with strong safety guarantees. Everything else followed from that and from the limitations of what I had to work with.
Constraints That Shaped the Design
The interesting part wasn't the individual technologies. It was how the constraints forced certain decisions.
Hardware: 16GB of VRAM, shared
An RTX GPU with 16GB sounds generous until you try to run speech recognition and language model inference simultaneously. The speech model (Riva Parakeet) takes about 2.6GB. A 7B language model in quantized form takes another 5-7GB depending on configuration. The key challenge is balancing memory allocation between services while leaving enough headroom for the KV cache.
With careful tuning, the system runs at 6-7GB total VRAM usage (38-44% utilization), leaving 9-10GB free for dynamic allocation, conversations, and concurrent requests.
This meant every service needed explicit memory boundaries. The language model service got a hard limit: 15% of total VRAM. Not because that's optimal, but because anything higher would starve the speech recognition service during startup.
Network: local access only
The system runs on a home LAN. Kids connect from tablets and laptops elsewhere in the house. This assumption simplified some things (no TLS termination, no auth complexity) but created others (the orchestrator is middleware, not an origin server).
Safety: non-negotiable and multi-layered
Building for kids means safety can't be an afterthought. This ruled out directly exposing any models to users. Everything goes through an orchestrator that filters input, sanitizes prompts, and validates output. Three separate checks before a response reaches a child.
This added latency but it was never optional.
Architecture Overview
The system splits into five services, each in its own container:
- Speech recognition (Riva ASR): browser audio to text, gRPC streaming
- Language model (vLLM): text to text, HTTP API
- Speech synthesis (Piper TTS): text to audio
- Orchestrator (FastAPI): coordinates everything, enforces safety
- UI (static Nginx): browser interface with microphone support
The orchestrator is middleware. It accepts HTTP requests from browsers and makes gRPC or HTTP calls to backend services. This pattern emerged naturally from the need to centralize safety logic and hide implementation details from clients.
The browser never talks directly to the AI services. It doesn't need to know about gRPC or model endpoints or audio format requirements. It just uploads audio, gets audio back.
Key Design Decisions
Why quantization mattered more than model choice
The first attempt used an unquantized 7B model. It consumed 14GB of VRAM for the weights alone, leaving almost nothing for the KV cache that stores conversation context. The system ran but couldn't handle even moderate context lengths without hitting out-of-memory errors.
Switching to AWQ 4-bit quantization reduced the model from 14GB to 5GB with minimal quality loss. That freed up 9GB. Suddenly the system could handle actual conversations.
Later, when memory sharing with speech recognition became an issue, I switched to a 1.5B quantized model (2.4GB weights). With framework overhead and KV cache, vLLM consumed about 3-4GB total. Combined with Riva's 2.6GB, the system used only 6-7GB (38-44% of available VRAM). But then the quality issues appeared.
Kids asked "what's the second tallest building in the world?" and got the same answer as for the first and third tallest. The smaller model was hallucinating, repeating plausible-sounding answers without the knowledge to back them up. For basic facts, it wasn't reliable enough.
The conventional wisdom would say: you're stuck, 16GB isn't enough for quality and multiple services. But that assumed I'd exhausted the optimization space. I hadn't.
The final configuration brought back the 7B model but with tuned parameters: GPU memory utilization at 45% instead of 85%, context window at 3072 tokens instead of 4096, and max concurrent sequences capped at 4. This allocated exactly what was needed without waste.
Result: 9-10GB total VRAM usage (6.5-7.5GB for the language model including KV cache and overhead, 2.6GB for speech recognition), 6-7GB free for dynamic allocation, and factually accurate responses. The 7B model knew the actual tallest buildings. It could handle follow-up questions. It stopped hallucinating.
The lesson: quantization enables deployment, but optimization is multidimensional. Model size, memory reservation percentage, context window, and concurrency limits all interact. Finding the right combination matters more than any single parameter.
Why the orchestrator owns safety logic
I could have embedded safety checks in each service or handled them client-side. Both seemed fragile. Client-side checks can be bypassed. Service-level checks would need to be replicated across speech, text, and any future interfaces.
Centralizing safety in the orchestrator meant one implementation, one set of rules, and one place to fix mistakes. The orchestrator sees every request and every response before they reach a user. Nothing bypasses it.
The tradeoff is latency. Every request goes through multiple hops: browser to orchestrator, orchestrator to LLM, LLM back to orchestrator, orchestrator to TTS, TTS back to orchestrator, orchestrator to browser. But the latency hit was worth the architectural clarity.
Why audio format precision forced an entire pipeline
Browsers record audio in WebM format (compressed, variable sample rate). Riva speech recognition expects WAV format (uncompressed, 16kHz, mono, 16-bit PCM). Those specifications aren't suggestions. If you send 44.1kHz stereo, it fails silently or returns garbage.
This required building a conversion pipeline: detect format from magic bytes, convert with ffmpeg through pydub, validate the output. It's not glamorous but it's critical. Format mismatches were the source of the most confusing bugs during development.
The broader point: when systems cross boundaries (browser to server, HTTP to gRPC, one audio library to another), precision matters. Assumptions about "it should just work" don't survive contact with reality.
Why Docker mattered for GPU sharing
Each service gets its own container with explicit resource reservations. The LLM service declares it needs GPU access and gets exactly the memory percentage it requests. The speech recognition service does the same. Docker's resource management prevents them from stepping on each other.
Without containerization, coordinating three GPU-accelerated services would mean managing CUDA contexts, environment variables, and library conflicts manually. Docker doesn't eliminate those problems but it compartmentalizes them.
What Worked Well
Separation of concerns proved its worth during debugging. When speech recognition failed, I could test the Riva container in isolation. When responses were unsafe, I could trace through orchestrator logs without touching the model service. Clean boundaries made problems easier to locate.
The orchestrator-as-middleware pattern scaled better than expected. Adding the text interface meant writing one new route in the orchestrator. Adding voice meant writing another route that composed speech recognition with the existing text pipeline. The underlying services didn't change.
Quantization delivered on its promise. The quality loss from full precision to AWQ 4-bit was minimal. However, the drop from 7B to 1.5B parameters was noticeable - the smaller model hallucinated on factual questions. Kids asking "what's the second tallest building" deserve accurate answers, not plausible-sounding guesses.
Simple prompts beat complex ones. Early attempts used detailed system prompts with examples and multi-part rules. The model leaked those prompts in responses or treated example conversations as real history. Simplifying to a single clear instruction produced more reliable output than elaborate prompt engineering.
What Didn't (or Was Harder Than Expected)
The Riva integration took longer than everything else combined. The documentation suggested HTTP endpoints existed, but the model I was using only supported streaming gRPC. That meant learning gRPC configuration, understanding audio encoding requirements, and debugging through layers of abstraction.
The errors were cryptic. "Audio decoder exception: encoding not specified" didn't mention that I needed to explicitly set LINEAR_PCM and sample rate in the config, even though I was sending WAV files that already contained that information in their headers.
I went through eight distinct failures before getting working transcription. Each one revealed another assumption I'd made that didn't hold.
Prompt engineering for reliability is harder than prompt engineering for quality. Getting the model to consistently return clean output without meta-commentary, leaked instructions, or follow-up questions required both careful prompting and defensive post-processing. I couldn't rely on prompts alone.
VRAM optimization turned out to be multi-dimensional. Model size matters, but so does the memory utilization percentage, context window length, and concurrency limit. Changing one affects the others. Finding a stable configuration meant tuning all of them together, not sequentially. The difference between stated model size (2.4GB weights) and actual VRAM usage (3-4GB total) is important - framework overhead and KV cache add significant memory consumption.
Broader Lessons
Constraints improve architecture when you let them. The 16GB VRAM limit forced me to think carefully about memory reservation, service boundaries, and model selection. Without that constraint, I probably would have used larger models and deferred the hard optimization questions.
Safety requirements drove the orchestrator pattern. Privacy requirements drove the local-only deployment. Audio format requirements drove the conversion pipeline. Each constraint ruled out simpler approaches and pointed toward specific solutions.
Audio is harder than text. Format conversion, sample rate matching, bit depth requirements, and streaming protocols all add complexity that doesn't exist in text-only systems. Voice interfaces are worth it for the use case but they're not free.
Model behavior is ultimately probabilistic. Prompts help, post-processing helps, but there's no prompt that guarantees the model won't occasionally produce something unexpected. Defensive programming matters. Validation matters. Multiple layers matter.
Model size tradeoffs are real. A 1.5B model offered low latency and minimal VRAM usage (3-4GB), but hallucinated on factual questions. The 7B model (~6.5-7.5GB) provided accurate responses at the cost of higher memory usage. For this use case, factual accuracy mattered more than memory efficiency, so the larger model won. The lesson: optimize for the right metric - sometimes that's memory, sometimes it's quality.
Closing Reflection
This was a small-scale build on hardware that nobody would recommend for running multiple AI services simultaneously. That made it more interesting, not less.
When you remove the option to throw more compute at a problem, you're forced to make real tradeoffs. Model size versus quality. Latency versus safety. Context length versus concurrency. Those tradeoffs expose architectural questions that unlimited resources would hide.
The system works. Kids use it. It answers questions about space, animals, and why the sky is blue. It runs on hardware in a house, not in a datacenter.
That constraint probably taught me more about AI system architecture than working with unlimited cloud resources would have.