Skip to main content
  1. Posts/

How a Bad CLAUDE.md Cost Me Two Days (And What I Built to Fix It)

I nearly quit my homelab project over a markdown file.

Not a broken deployment. Not a corrupted etcd cluster. Not a misconfigured NetworkPolicy at midnight. A markdown file. Specifically, a CLAUDE.md that described a cluster I no longer had, a roadmap I’d already executed, and a “current state” that was 10 sessions out of date. Every time I opened a new Claude Code session, I was handing my AI assistant a map of the wrong city and wondering why we kept ending up in the wrong place.

Two days of running in circles. Trying the same things. Re-explaining the same context. Hitting the same dead ends I’d already documented for myself and then lost track of. By the end of it I wasn’t frustrated at the tooling — I was frustrated at myself, which is somehow worse. That’s the kind of demotivation that makes you close the laptop and not open it again for a week.

This post is about what I built instead.


The actual problem #

I’m building a 3-node K3s HA cluster on Lenovo M910q mini PCs, studying for the CKA while doing it, and managing the whole thing as a GitOps portfolio piece. The cluster is real — kube-vip, Longhorn, cert-manager, Cloudflare Tunnel, ArgoCD app-of-apps, the works. I work on it across multiple machines, pick it up and put it down across days and weeks, and travel between Prague and Greece and Iraq in between.

The context problem compounds fast in that situation. Claude Code’s context window doesn’t persist between sessions. Every new session starts cold. So I was doing what everyone does — maintaining a CLAUDE.md to give Claude project context. The problem: a static markdown file rots. The cluster moves forward, the file doesn’t. And when they diverge badly enough, you get something worse than no context at all: wrong context, delivered confidently.

I’d also made the classic mistake of putting everything in one place. CLAUDE.md had the architecture, the hardware, the stack, the conventions, the roadmap, my notes, my todo list. It was trying to be a brain dump AND a reference doc AND a status tracker. It was none of them well. And every session, all of that got loaded into the context window whether I needed it or not — eating tokens, going stale, eventually contradicting the actual state of the cluster.

The specific two-day incident: I spent most of it trying to debug a deployment that the static CLAUDE.md said was still planned, but that I’d already partially deployed three sessions ago and left in a broken state. There was nothing in my setup to tell me that. I just kept starting from the wrong assumption.


What I actually needed #

When I stepped back and asked what would have prevented those two days, the answer was simple: something that remembered what had already been tried and failed.

Not a context window. Not a chat history. A ledger. A running record of what I did, what broke, and why — that would survive a closed terminal, a new session, a week away, a different machine. Something I could walk up to cold and ask: “where did I leave off?”

And then I realised I didn’t just need a memory for failures. I needed:

  • A place to capture what I was learning (CKA is an exam — I should be building a study archive, not rediscovering the same concepts every session)
  • A way to keep the AI assistant honest about what my cluster could actually support right now, before I went down a rabbit hole that required hardware I hadn’t built yet
  • A reviewer that would catch my own convention mistakes before they landed in my public repo

Four different jobs. Four different memory requirements. The answer was four (actually six) custom Claude Code subagents, each scoped to one job, each with their own persistent memory directory that survives across sessions.


The setup #

Each agent is a markdown file in .claude/agents/. Claude Code loads them at startup and routes tasks to them automatically, or you invoke them explicitly with @agent-name. The key property: each agent gets its own memory directory, and its MEMORY.md is loaded into context every time it runs. So the memory persists. The context window resets; the memory doesn’t.

Here’s what I built:

it-accountability — the one that would have saved me two days. An append-only ledger of what’s been done, what failed, and what’s next. The Failed/Dead Ends section is the part that matters most. It’s explicitly instructed never to delete that section, because that’s the record that stops you retrying broken approaches. Start any session with @it-accountability where did I leave off? and it reads the ledger and tells you. Its memory is local (gitignored) because it logs command output that might contain secrets.

the-student — turns what I learn while building into CKA study cases. Every case gets a “CKA angle” (how this maps to exam domains), exact commands copied verbatim, and an Anki export at the end. There’s a rollup deck at anki/cka-deck.tsv — Anki imports it directly. It also tracks a study streak and nudges me if I’ve gone three days without capturing anything. The nudge fires when I open any session, which is the only time it can actually reach me.

the-contrarian — the one I’m most proud of the name for. Before I start any deployment task, it reads the live cluster (kubectl get pods -A, kubectl top nodes, read-only only, no changes ever), compares it against the repo, checks the hardware todo list, and tells me whether what I’m about to do is actually sensible right now. It knows what hardware I have, what’s coming but not yet built, and the correct sequence for post-travel upgrades. If I try to plan a GPU workload deployment before the GPU box is on Tailscale, it tells me. If I try to add a 4th node before doing the RAM upgrade, it tells me. The “rabbit hole warning” format is concrete: what you think it’ll take, what it will actually take, the first unexpected thing that will appear.

manifest-reviewer — read-only pre-commit gate for kubernetes/. Checks for plaintext secrets about to be committed (highest priority), missing cert-manager.io/cluster-issuer annotations, forgotten subPath on Postgres mounts (my own documented gotcha — Longhorn and lost+found), orphaned ArgoCD apps that aren’t wired into the bootstrap. Never edits, just reports.

the-blogger — banks blog-worthy moments in a local seeds directory. Low-friction: a few lines of raw material, the hook, angle ideas. Drafts posts on request. Never publishes automatically. (This post was seeded by it during the session where I built the setup. The irony is not lost on me.)

adr-writer — records architecture decisions in my existing ADR format (I have docs/adr/ in the repo). Does a risk check first: flags one-way doors, hidden day-2 costs, credible alternatives I might be dismissing. If the decision is genuinely fine, it says so and drafts. The escape hatch is “just record it” — say that and it writes without argument. Drafts land locally; I move them to docs/adr/ myself when they’re ready to commit.


The persistence story #

The agents’ memory directories live inside the repo under .claude/agent-memory/. They’re gitignored — they never reach GitHub. The repo is public, so my study cases, my blog seeds, and my done/failed ledger stay local. But they’re real files on disk, which means they survive session resets, survive closing VS Code, survive switching machines.

The switching-machines part matters because the dev environment itself is part of this setup. I run VS Code Remote-SSH into a Debian VM on my Unraid server. All work happens inside that VM — kubectl, helm, the agents, everything. The working files live on an NFS-mounted share backed by the Unraid array, not on the VM’s SSD. So if the VM’s disk dies (it’s a CWWK board, I’ve made my peace with the risk), nothing of value was on it. And because every laptop SSHs into the same VM, there’s no sync problem. The session state is a single source of truth. I switch from the MacBook to the Windows machine and the session is just there, because it never moved.

The agents’ memory moves with the repo because it’s in the repo. Open Claude Code on any machine SSHed into the VM, and @it-accountability reads the same ledger. That’s the “continue where I left off” story — not session resumption, but persistent state that travels with the working directory.


What I’d do differently #

The CLAUDE.md lesson: keep it small and stable, and point the rest at agents. My current CLAUDE.md is about 60 lines — hardware, conventions, a pointer to each agent, and a note that the live status lives in it-accountability, not in the file itself. The agents handle the accumulation; the CLAUDE.md handles the durable facts that rarely change. That separation is what stops it rotting.

The agent I’d add if I were starting over: a runbook agent. Something that maintains step-by-step runbooks for recurring operations — drain a node, add a new app, rotate a SOPS key — so I’m not reconstructing the same sequence from scratch every time and inevitably missing the one step that matters. It’s close to what the-student does for concepts, but scoped to operational procedures. On the list for when I’m back from travel.

The thing I underestimated: how much the name of an agent matters for delegation. @the-contrarian is memorable and sets the right expectation immediately — this is the thing that argues with me. A generic @k8s-advisor would get used much less. If you build this kind of setup, spend time on the names. They’re the interface.


The bet I’m making #

The two-day incident is why I built this. Whether the system actually prevents it — I’m about to find out.

The idea is that I open a session, @it-accountability where did I leave off?, and I’m back in context in under a minute instead of re-deriving everything from scratch. The ledger either works or it doesn’t — and if it doesn’t, at least it’ll log why.

The Contrarian already earned its keep once: it talked me out of starting the GPU box setup the night before travel by giving me an honest time estimate (three sessions minimum, not two hours) and pointing out that Zero Trust Access was still open and more important. One session in. Good sign.

The student, the manifest-reviewer, the blogger — those are hypotheses. Study cases get logged as I solve things without having to sit down and write them up. The reviewer catches convention mistakes before they hit the cluster. Maybe they work exactly as designed. Maybe I’ll discover the real friction point is somewhere I didn’t anticipate. I’m genuinely fine with either outcome — that’s what the accountability ledger is for.

Six markdown files. The hard part was figuring out what each one’s job was and making the memory boundaries explicit. The rest I’ll know in a few weeks.

If you’re studying for CKA while building real infrastructure, I’d start with it-accountability and the-student. Even if the whole system half-works, those two are the ones I’d bet on.

The agent definitions are at github.com/Steficzko/homelab if you want to use them as a starting point — or see what I got wrong.


Built on: K3s v1.35, ArgoCD, Longhorn, cert-manager, Cloudflare Tunnel. Agents: Claude Code custom subagents with project-scoped persistent memory. Dev seat: Debian 12 VM on Unraid, VS Code Remote-SSH over Tailscale.