Blog
-
The Agent Harness Pattern: What Poker Taught Me About Multi-Agent Systems
How a Texas Hold'em simulator became a blueprint for any domain where autonomous agents compete, negotiate, and adapt — turn by turn.
-
Building a Self-Improving Personal Knowledge Base Powered by LLM
Inspired by Andrej Karpathy's post on LLM knowledge bases, I built a system where Claude Code skills manage a personal wiki end-to-end — ingesting raw content, compiling concept articles, synthesizing connections, and answering questions. You never touch the wiki. The LLM owns it.
-
Gemma 4 Explained: How One Model Family Spans Phones and Frontier-Class Reasoning
A technical deep-dive into Gemma 4's four core ideas — MatFormer elastic inference, hybrid attention with p-RoPE, parallel dense+MoE FFN, and native agentic tooling — with the Gemma 1–3 lineage as context.
-
TurboQuant Explained: How Google Compresses KV Caches to 3 Bits Without Losing the Plot
A technical breakdown of Google Research's TurboQuant stack: why KV-cache quantization is really an inner-product estimation problem, how PolarQuant removes normalization overhead, and where QJL fits into the final system.
-
Diffusion Language Models: How They Work, How They Compare to Autoregressive LLMs, and Where They're Going
A technical deep-dive into continuous and masked diffusion LLMs — full derivations, key models (LLaDA, Dream, Mercury), head-to-head comparison with autoregressive LLMs, and an honest look at whether dLLMs can replace AR in the future.
-
Multi-Agent Patterns: Swarm, Teammates, and the Coordinator
Claude Code can run multiple Claude instances in parallel within the same process. Here's how in-process teammates, permission sync, and coordinator mode work — and what it means for building your own multi-agent system.
-
Designing for Extensibility: How Claude Code's Plugin and Skill System Works
Claude Code has two extension points: skills (slash commands that inject prompt content) and plugins (packages that contribute commands, MCP servers, and hooks). Here's how each works and why they're designed differently.
-
Security Without a Sandbox: How Claude Code Decides What It's Allowed to Do
Claude Code runs shell commands, edits files, and makes network requests on your machine — without a kernel sandbox. Here's the permission model that makes this safe enough to ship.
-
The Tool Use Loop: How Claude Code Executes Code, Edits Files, and Talks Back
A tool call is a structured JSON request from the LLM to run a named function. Here's exactly how Claude Code handles the full lifecycle — from API call to file edit to loop continuation.
-
Demystifying Claude Code: Inside the Architecture of a CLI Code Agent
Claude Code is one of the most widely used AI coding tools and one of the least understood. We read the source so you don't have to.
-
Two Bets on Generative Recommendation: Semantic IDs vs. Fine-Tuned LLMs
A head-to-head comparison of the two paradigms remaking recommendation — semantic ID autoregressive models and fine-tuned LLMs — with trade-off analysis and a look at how they're converging.
-
The Attention Bottleneck: How Modern LLMs Solved a Problem That Nearly Broke the Transformer
From vanilla multi-head attention to Flash Attention 3 — the engineering bottlenecks that drove every major attention variant and the math behind each fix.
-
The Harness Is the Moat: Why Autonomous AI Agents Live or Die by Their Architecture
Model quality is commoditising. The durable competitive advantage in 2026 is harness architecture — the deterministic enclosures that make probabilistic agents reliable. A deep analysis of the four architectural primitives every production harness must implement, and how Autoresearch, Ralph Loop, Superpowers, and GSD each solve them differently.
-
Generative Recommendation in Production: HSTU, OneRec, and What Every Major Platform Is Building
From semantic IDs to OneRec Think — how Meta, Kuaishou, Google, Alibaba, ByteDance, and LinkedIn are replacing two-stage retrieval pipelines with generative models. What's in production and where the field is heading.
-
From Vibe Coding to Harness Engineering: How to Actually Ship AI-Assisted Software
Vibe coding gets you a working prototype in 10 minutes. Harness engineering is how you ship it to production. Here's the difference, why it matters, and how to make the transition.
-
Why LLM Inference Costs Will Keep Falling
An analysis of hardware trends, algorithmic improvements, and market forces driving down the cost of running large language models.