6

Root cause, not symptom

The Socratic method, weaponized for the moment things break. Which they will. This chapter is the difference between fixing and re-fixing.

There was a problem in one of my systems that I fixed three times. I want you to feel the rhythm of it, because I promise you have lived some version of it too.

Something misbehaved. I told my AI to fix it. It found a plausible cause, patched it, the problem went away. Victory. Two weeks later, the same problem, wearing slightly different clothes. Another plausible cause, another patch, another victory. Then a month later, again. Same problem, third outfit. Three intelligent fixes, by an intelligent machine, supervised by a person trying to be intelligent, and the problem was still alive.

The third time, instead of asking "how do we fix this," I sat back and asked a different question, the one that broke the loop: "What would have to be true for this to keep happening?"

Notice what that question does. It stops looking at where it hurts and starts looking for the shape of the thing that keeps producing the hurt. We wrote down, before touching anything, three candidate explanations. Not one. Three. The first was the thing we had already patched twice, so probably not. The second was plausible but we checked it, this is Chapter 4 again, and it held up fine. The third was structural: two parts of the system had quietly disagreed about a definition from the very beginning, and every "bug" we had fixed was just the latest place their disagreement surfaced. Each patch had sealed one crack in a wall that was fundamentally misaligned.

We fixed the definition. One change, deeper than any of the patches. The problem never came back, and, this part still gives me chills, two other "unrelated" annoyances disappeared with it. They had been the same disagreement, surfacing elsewhere, and we had never connected them.

The loop I now run on every breakage, and that I make my AI run with me, has five beats. Reproduce: see the problem actually happen, do not work from a description of it. Hypothesize: write down two or three possible causes before acting, because your first guess is a prediction machine of its own, and it is overconfident too. Isolate: check which hypothesis survives contact with evidence. Fix the cause, not the place it hurts. Verify: confirm the fix held, and look around it, because root causes travel in groups.

A teacher I respect would call this the Socratic method: you do not accept the first answer, you interrogate it until what is actually true has nowhere to hide. I just call it the only way I know to fix something once.

Stop here. Actually sit with this before you scroll on.

Why did the fix not fix it?

Try this nowUnder 30 minutes
  1. Pick something currently or recently broken. Tech or not. A device, a recurring miscommunication with someone, a process at work that keeps failing the same way.
  2. Before any fixing, write three candidate causes. Force yourself to three. The third is usually where the interesting one hides.
  3. Ask your AI to play the skeptic: "Here is the problem and my three hypotheses. What evidence would distinguish them? Argue against my favorite."
  4. Check, then fix the survivor. Aim the fix at the cause that held up, not the symptom that hurt.
  5. Note your hit rate. Was your first guess the real cause? Keep score across breakages. The humility this builds is the whole skill.

Two questions before you go

Answer, then say whether you were sure or guessing. Being honest about which is the skill being trained.