Blog

April 25, 2026 · 9 min read AI agents LLM multi-agent systems software architecture Claude poker

The Agent Harness Pattern: What Poker Taught Me About Multi-Agent Systems

How a Texas Hold'em simulator became a blueprint for any domain where autonomous agents compete, negotiate, and adapt — turn by turn.
April 5, 2026 · 15 min read LLM knowledge management Claude Code AI agents personal tools

Building a Self-Improving Personal Knowledge Base Powered by LLM

Inspired by Andrej Karpathy's post on LLM knowledge bases, I built a system where Claude Code skills manage a personal wiki end-to-end — ingesting raw content, compiling concept articles, synthesizing connections, and answering questions. You never touch the wiki. The LLM owns it.
April 2, 2026 · 13 min read Gemma LLM Google DeepMind mixture of experts on-device AI multimodal open models

Gemma 4 Explained: How One Model Family Spans Phones and Frontier-Class Reasoning

A technical deep-dive into Gemma 4's four core ideas — MatFormer elastic inference, hybrid attention with p-RoPE, parallel dense+MoE FFN, and native agentic tooling — with the Gemma 1–3 lineage as context.
April 2, 2026 · 10 min read Google Research LLM inference KV cache quantization vector search

TurboQuant Explained: How Google Compresses KV Caches to 3 Bits Without Losing the Plot

A technical breakdown of Google Research's TurboQuant stack: why KV-cache quantization is really an inner-product estimation problem, how PolarQuant removes normalization overhead, and where QJL fits into the final system.
April 1, 2026 · 37 min read diffusion models language models LLaDA masked diffusion autoregressive dLLM MDLM

Diffusion Language Models: How They Work, How They Compare to Autoregressive LLMs, and Where They're Going

A technical deep-dive into continuous and masked diffusion LLMs — full derivations, key models (LLaDA, Dream, Mercury), head-to-head comparison with autoregressive LLMs, and an honest look at whether dLLMs can replace AR in the future.
March 31, 2026 · 9 min read Claude Code multi-agent AI agents concurrency software architecture

Multi-Agent Patterns: Swarm, Teammates, and the Coordinator

Claude Code can run multiple Claude instances in parallel within the same process. Here's how in-process teammates, permission sync, and coordinator mode work — and what it means for building your own multi-agent system.
March 31, 2026 · 9 min read Claude Code plugins skills MCP extensibility software architecture

Designing for Extensibility: How Claude Code's Plugin and Skill System Works

Claude Code has two extension points: skills (slash commands that inject prompt content) and plugins (packages that contribute commands, MCP servers, and hooks). Here's how each works and why they're designed differently.
March 31, 2026 · 11 min read Claude Code security permissions AI agents tool use

Security Without a Sandbox: How Claude Code Decides What It's Allowed to Do

Claude Code runs shell commands, edits files, and makes network requests on your machine — without a kernel sandbox. Here's the permission model that makes this safe enough to ship.
March 31, 2026 · 8 min read Claude Code tool use LLM AI agents Anthropic API

The Tool Use Loop: How Claude Code Executes Code, Edits Files, and Talks Back

A tool call is a structured JSON request from the LLM to run a named function. Here's exactly how Claude Code handles the full lifecycle — from API call to file edit to loop continuation.
March 31, 2026 · 10 min read Claude Code AI agents CLI software architecture LLM tool use

Demystifying Claude Code: Inside the Architecture of a CLI Code Agent

Claude Code is one of the most widely used AI coding tools and one of the least understood. We read the source so you don't have to.
March 29, 2026 · 17 min read recommendation systems generative AI LLM semantic IDs TIGER HSTU PLUM

Two Bets on Generative Recommendation: Semantic IDs vs. Fine-Tuned LLMs

A head-to-head comparison of the two paradigms remaking recommendation — semantic ID autoregressive models and fine-tuned LLMs — with trade-off analysis and a look at how they're converging.
March 28, 2026 · 16 min read transformers attention LLM deep learning GQA flash attention linear attention

The Attention Bottleneck: How Modern LLMs Solved a Problem That Nearly Broke the Transformer

From vanilla multi-head attention to Flash Attention 3 — the engineering bottlenecks that drove every major attention variant and the math behind each fix.
March 25, 2026 · 22 min read AI agents harness engineering LLM autonomous systems machine learning

The Harness Is the Moat: Why Autonomous AI Agents Live or Die by Their Architecture

Model quality is commoditising. The durable competitive advantage in 2026 is harness architecture — the deterministic enclosures that make probabilistic agents reliable. A deep analysis of the four architectural primitives every production harness must implement, and how Autoresearch, Ralph Loop, Superpowers, and GSD each solve them differently.
March 23, 2026 · 14 min read recommendation systems generative AI HSTU OneRec

Generative Recommendation in Production: HSTU, OneRec, and What Every Major Platform Is Building

From semantic IDs to OneRec Think — how Meta, Kuaishou, Google, Alibaba, ByteDance, and LinkedIn are replacing two-stage retrieval pipelines with generative models. What's in production and where the field is heading.
March 23, 2026 · 10 min read AI agents software engineering LLM productivity

From Vibe Coding to Harness Engineering: How to Actually Ship AI-Assisted Software

Vibe coding gets you a working prototype in 10 minutes. Harness engineering is how you ship it to production. Here's the difference, why it matters, and how to make the transition.
March 1, 2026 · 3 min read LLM infrastructure economics

Why LLM Inference Costs Will Keep Falling

An analysis of hardware trends, algorithmic improvements, and market forces driving down the cost of running large language models.