December 28, 2025
Notes from running speech recognition, an LLM, and TTS on a single RTX 5070 Ti. Covers memory constraints, quantization trade-offs, and architectural choices driven by 16GB of shared VRAM.
Read more →
Coming soon
Notes on LLM Inference Optimization
Observations from tuning inference performance on consumer GPUs. Quantization experiments, memory allocation strategies, and measuring real-world latency.
Coming soon
Experiments with RAG Systems
Lessons from building retrieval-augmented generation pipelines. What worked, what failed, and why certain architectural choices mattered more than expected.