Rebuilding a Windows Bug-Hunter's Mind Inside an AI

How do you get an AI to find the security bugs that an expert finds by thinking — the bugs that do not crash anything, so there is no easy way to confirm them automatically? This is a plain report on building the rules and habits for that: how the AI is made to doubt its own conclusions, keep an honest record, and come up with new ideas on purpose. With what worked, what broke, and what comes next.

Posted Jun 16, 2026 Updated Jun 16, 2026

By Kazuma Matsumoto

28 min read

Introduction

I have been working on an unusual project: getting an AI to find security bugs the way an experienced human researcher does. Not by copying what the expert knows — facts like which Windows version or which function, which can be looked up and which change over time — but by copying how the expert works: what they look at first, how they form a guess, and how hard they try to prove their own guess wrong before they believe it.

Here is the important difference. Today there are good AI tools that find one common type of bug: the kind that makes a program crash. The AI suggests a possible bug, a separate tool runs the program, and if it crashes, the bug is confirmed. The crash is an automatic check, and it works well. This project is about the other kind of bug — the kind where the program does not crash at all. It simply does the wrong thing: it trusts the wrong person, or checks something at the wrong moment, or uses a different name than the one it checked. For these bugs there is no crash, so there is no automatic way to confirm them. And when there is no automatic check, the only thing keeping the work honest is the researcher’s own discipline.

So the hard part of this project is not finding bugs. It is getting the AI to doubt its own conclusions — because that is the part with no safety net. I will explain every Windows term in plain words as I go. You do not need to be an expert to follow it.

Where These Rules Came From

The rules in this post were not invented from scratch. A large part of the work was studying how the best Windows security researchers actually think, and turning that into instructions an AI can follow.

I read many of their public write-ups and conference talks, closely and more than once. I was not looking at the specific bugs they found. I was looking at the move their mind made just before a bug appeared: how they chose what to look at, how they turned a vague feeling that “something is off” into a clear statement that could be tested, and how they then tried to prove or disprove that statement. The same small set of moves kept coming up across different people and different targets. That repetition was the signal. I wrote each move down in plain words and made it one of the AI’s rules.

Some of those moves shape the parts that follow. Four others are worth stating on their own:

Try the obvious direct attack first. Before building a clever, indirect attack, do the simplest thing. If the simple thing already works, there was never any protection to get around, and the clever idea proves nothing. But if the simple thing fails with an “access denied” type of error, that failure is good news: it proves the protection is real, so any indirect way around it is a genuine finding. A failed direct attempt is not wasted — it tells you there is a wall worth getting past.
Do not accept the first tidy explanation. When something strange happens, the mind reaches for a neat story that makes the strangeness go away. That neat story is the trap. Treat the first explanation as a guess to attack, and ask one more question: “if that were fully true, how could the other thing I also saw ever happen?” The bug is often one question past the comfortable answer.
Turn every guess into a test that could fail. A guess on its own does nothing. Each guess has to pick one small action whose result would show it is wrong, and you decide in advance what result would count as “wrong.” Then a failure is not the end; it is a sign pointing at the next thing to try.
Write down every odd thing, even when it is not a bug. A single strange observation is often one piece of a pattern that only becomes clear much later. If you do not record it, the piece is lost. So the AI keeps a running note of every surprising behavior, with no pressure to explain it yet. (One researcher noted a small oddity while working on something else; more than a year later it grew into a whole new family of bugs.)

I did one more piece of background research: how to find contradictions on purpose. A contradiction is just two things that are supposed to agree but do not, and in security work that is usually where the bug is. I studied how people in other fields find contradictions deliberately, rather than by luck, and built those steps into the method. (More on this further down, where the AI uses it to come up with new things to test.)

Part 1: What an Expert Notices First

The first thing to copy is what an expert notices. Point a beginner and an expert at the same program, and the expert looks straight at one thing: the difference between what the programmer assumed was true and what the program actually checks.

Here is what that means. Suppose a service is supposed to give your saved data back only to you. The programmer assumed: “only the owner can read this data.” But in the actual code, the service decides who you are from a name you typed into the request — it never checks that you are really that person. So anyone who knows the name can read your data. The assumption (“only the owner can read this”) and the real check (“does the name match?”) are not the same thing. The bug lives in that difference.

Real Windows services have this problem often. One real example: Windows keeps a list of which program handles which service. In one case, that list trusted whoever signed up first, and never checked whether they were allowed to. So an attacker could sign up first and pretend to be a trusted system service. The programmer assumed “only the real service signs up here.” Nothing in the code made sure of it.

This is not magic, and that is exactly why an AI can do it. It is a simple, repeatable procedure:

Write the programmer’s hidden assumption as one short sentence. “Only the owner can read this.” “This can only be reached from the local machine.” “Whoever is calling has already proven who they are.” Forcing it into one sentence is the trick: a vague worry cannot be tested, but a clear sentence can.
Find the exact line in the code where that sentence is supposed to be enforced — the real check.
Show that, on a path you can reach, the check and the assumption do not agree.

I made this the AI’s first move on any target, instead of the vague “look for something that seems wrong.” Do not look for “a bug” in general. Find the assumption, find the check, and show they disagree.

Part 2: Patience — Read the Code Before You Test It

Noticing is useless without the patience to follow it the slow way. There is an easy way to look for bugs and a hard way, and choosing the hard way is the second habit.

The easy way is called fuzzing: send a program huge amounts of random, broken input and wait for it to fail. It is a real method and it finds real bugs — but mostly one kind: crashes, which usually just stop the service from working (this is called denial of service) rather than give an attacker more power. And fuzzing finds these without understanding the program at all. That is its strength — it is fast, and anyone can run it — and also its limit: it cannot find the quiet bug where the program does exactly what it was written to do, and what it was written to do is wrong.

So the rule is: understand the program first; use fuzzing last. Read the code that makes the security decision. Write down what it is supposed to guarantee. Find where that guarantee fails. Only fall back to random input when you have run out of things to understand — and even then, send the input in a smart order that builds on the last step, not one random shot at a time, because one-shot random input finds far less.

To be fair and not oversell this, it is a trade-off, not a clear win. The quiet bugs are actually harder to find than crash bugs, and they are often one of a kind. Their real advantage is that they are harder to notice (no crash, no alarm), they get past the protections Windows builds against crash bugs, and they survive when the code is rewritten. They are not “easier to attack” — crash bugs keep their own advantage there. The point is only this: the quiet area is the one the automatic tools cannot see, which is exactly why a careful, human-style method is worth using there.

To keep the AI from jumping straight to fuzzing, I wrote down a fixed order of where to look — from the most valuable, quietest bugs at the top, down to crash bugs at the bottom, where fuzzing finally belongs:

There is good reason to bet on this slow, careful approach. One researcher spent a long time carefully reading the code of one neglected part of Windows — the registry — and found dozens of serious bugs that years of automatic fuzzing had missed. (An honest warning: you mostly hear about the long effort that paid off, not the equally long efforts that found nothing. Careful depth can pay off greatly; it is not guaranteed to.)

Two ways of working, used together

“Understand the program first” raises a question: how do you build that understanding? The rule is that the AI must do two things at once, and treat doing only one as half a job. First, reason from how the program works — work out what must be true from the way the code behaves, because often the answer is written down nowhere. Second, search for what others already know — the official documentation, the technical specifications, other researchers’ write-ups — and where sources disagree, work out which one is correct.

You need both, because they fail in different ways. Reasoning with no searching invents a confident picture that may be wrong. Searching with no reasoning collects facts without understanding which one matters. So a claim is only treated as solid when both — done separately — reach the same answer, and, where possible, the test machine agrees too. One source is an opinion. Two that did not copy each other and still agree is real evidence.

Part 3: Self-Doubt — the Hardest Habit to Build

This is the heart of the project, and the hardest part to build, because the crash-based tools get it for free (the crash is their honest check) and a method with no crash has to build it from nothing.

Here is a real and uncomfortable fact about today’s AI models. Take a model that has reached the correct answer. Argue against it confidently, with wrong reasoning that sounds right, and a large share of the time it will give in and change its correct answer to a wrong one. Researchers have measured this: one round of confident, wrong criticism can lower a model’s accuracy a lot. (Models built for careful step-by-step reasoning resist this much better, which matters here.) Think about what that means, because it breaks the obvious plan.

The plan so far was: make the AI attack its own findings. But if a confident wrong argument can talk a model out of the truth — and the attacker is the same model — then careless self-criticism does not improve the work. It destroys the correct findings along with the wrong ones. So I need a review step that removes bad findings without destroying good ones. Here is the design.

The review. After the AI thinks it has found a bug, it reviews its own finding two or three times. Each time is a separate, hostile pass whose only goal is to destroy the finding. Splitting it into separate passes is on purpose: a mind checking its own work tends to defend it, so I force it to switch roles — “now your only job is to kill this finding.” The passes get stricter:

Check the logic, step by step, and throw out anything not backed by something concrete.
Try to disprove it on a real test machine, using a debugger. Design the test that would show the finding is wrong, and keep “I ran it and saw this” strictly separate from “I worked it out in my head.”
Look for boring explanations. A passing test usually has a dull cause — a saved result, a side effect, your own setup mistake — before it has an exciting one. Rule those out first.

Then three rules stop the review from destroying good findings. Each one is aimed at the confident-but-wrong critic the research measured:

(1) Every criticism must point to real evidence — a specific value, something seen in the debugger, a source actually read. A criticism that only says “this seems wrong,” with nothing behind it, counts for nothing and cannot lower a finding’s standing. (2) When an evidence-backed criticism still disagrees with the finding, the test machine decides — not whichever side sounds more confident. (3) Use the strongest reasoning model for the hostile pass, because those resist a confident wrong argument best.

Rule (2) is the one that matters most, and it works only because this kind of research has something the pure-thinking fields do not: a real machine to test on. When the reasoning and a criticism disagree, you do not pick the more confident voice — you run it on the machine and look at what actually happens. The machine has no pride and no debating skill. It just shows you the result. Richard Feynman said it best, fifty years ago, and it is the rule the whole project is built around:

The first principle is that you must not fool yourself — and you are the easiest person to fool.

This is not only a rule for the AI. While working on this project, I have caught myself doing exactly what the system is built to stop. I “corrected” a fact to the answer that felt obviously right, stated it with full confidence — and I was wrong. The truth was that two reliable sources genuinely disagreed, and my “correction” had quietly erased that disagreement. What caught me was not me being clever. It was the plain rule that does not care how confident I feel: go back to the original source, read it again, and trust it over your memory. That is why the rule exists. I would rather show you the method catching its own author than ask you to trust that it works.

Part 4: Memory — a Record That Cannot Quietly Change

There is a second, quieter problem on long jobs: the AI’s own memory. An AI cannot keep everything it has learned in mind at once. As a job goes on, older notes are shortened to make room, and shortened again, and details can change in the process. The detail most likely to change is the one that matters most: how sure I am of a fact. Did I prove this, or only guess it? A guess from yesterday, after being shortened a few times, can come back today looking like a proven fact — and every later decision that trusts it is now built on a mistake.

So here is a rule I will defend: for an AI that writes its own long-term notes, the record of how it knows a fact must never be changed afterward. That is a safety rule, not just tidiness. The worst failure on a long job is not a wrong fact; it is a fact whose standing quietly gets upgraded from “guess” to “proven.” I make that impossible. The AI keeps a simple log — one short entry per fact — that records not just what it believes but how it knows it: guessed, read somewhere, or proven on the machine. The “how it knows” line can never be edited. You cannot turn a guess into a proof. If a guess is later proven, you do not change the old entry; you add a new one that points back to it.

one entry in the log (example)

claim:       this service decides who you are from a name in the request, not a real check
how I know:  PROVEN ON THE MACHINE   (guessed / read / proven)
what I saw:  asked for another account's data by name → the service handed it over
confidence:  proven

Rule: the "how I know" line can NEVER be edited.
A guess is never changed to read "proven."
If a guess is later proven, add a NEW entry that points back to this one.

(This entry is made up, to show the format. It is not a real finding from this project.)

It looks like paperwork. It is actually what stops the AI from slowly talking itself into believing its old guesses were facts.

The same rule blocks a related mistake that has a name in research: HARKing — short for “claiming you predicted something after you already knew the answer.” It is tempting because it makes a result look stronger: a result you truly predicted and then confirmed is strong evidence, while the same result found by accident and explained afterward is weak — you can invent a clever explanation for almost anything once you know how it ends. So anything found after seeing the result is recorded honestly as “seen first, explained after,” never rewritten as a prediction made in advance.

Part 5: Coming Up With Ideas on Purpose

Everything so far helps the AI reject a bad idea. None of it produces a new one. And here is the strongest objection to the whole project: a method that only rejects ideas is just a careful way of finding what you already expected. The real creative step — inventing a kind of bug that nobody has named — is the part everyone says cannot be automated. So how can a new idea come on purpose, instead of by luck? Two techniques try to do exactly that.

1. List the choices, then look at the gaps. Take a mechanism and write down each separate decision it makes, as a grid. For “how does this service decide who you are?”, the columns could be: does it check your real identity, a name you typed, or nothing? The rows could be: does it check every time, or once and then trust you? Now fill in the boxes with the bug patterns people already know. The boxes that are still empty, but that an attacker can reach, are predictions: “no one has found a bug of this shape here yet — go look.” This turns “I hope I notice something” into “here are the exact gaps to check.”

2. Borrow a rule from a field that already solved it. Take a rule that some well-understood field treats as essential, and ask directly whether this Windows mechanism follows the same rule. For example: when your web browser connects to a website, it checks the website’s identity (using a certificate), so a fake website cannot quietly take its place. So ask the same question of a Windows mechanism: when something replies to a request, does Windows check the identity of whatever replied — or does it just trust that the right thing replied? Published research by others asked exactly this, and the answer was: it does not check. Whatever replies fast enough is believed. That missing check is the start of a real bug. Notice where the idea came from: not from a clue in the code, but from a rule that a completely different field treats as essential.

3. Look for contradictions. A contradiction is two things that should agree but do not, and each one is a concrete thing to test. Some kinds to look for: the manual (the official description of how a part should behave) says one thing, but the code does another; two programs that are supposed to follow the same rule handle the same input differently; one part of a system assumes something that another part never actually guarantees; or a value is checked for safety once and then used later as if it cannot change (and if an attacker can change it in between, the check meant nothing). For each contradiction you find, you make the two sides disagree on the test machine and watch what happens. Most of the time nothing breaks, and you have ruled something out — that is still progress. Sometimes it breaks, and that is a real lead.

Part 6: The Tools, and How It Runs

Most of this work is not exciting. It is hours of reading machine code that was never meant to be read, making a small guess, checking it, and being wrong often. The cycle is: read the program in a disassembler (a tool that turns machine code back into something readable) → work out what it must be doing → check that against many trusted sources and settle any disagreements → confirm the behavior on the test machine with a debugger → write the corrected understanding into the log. The most important step is the checking: when sources disagree, you do not pick the one you like — you find out which is right.

The AI does not just talk about these tools; it actually uses them, through a standard connection (called MCP) that lets an AI control real software. The point is not the connection, but what comes back through it. For example: the AI sets a breakpoint inside the running service, steps through it one instruction at a time, reads the identity it is running under directly from the live program, and writes “saw: running as the SYSTEM account” — not “should be running as SYSTEM” — into its log. The actions are real, and the difference between “ran it” and “thought about it” is set by which tool produced the note.

One more lesson, learned the hard way. When a stage needs to cover a lot of ground — check forty facts, look at a dozen services — the AI splits the work across many helper agents running at the same time. On an early run it started about fifty at once, in two big groups. The result was not fifty answers. It hit a usage limit and came back completely empty — the whole run wasted. The fix became a firm rule: run the helpers in groups of at most ten, one group at a time, never two big groups at once. A small rule, but it points at something true about running many things in parallel: the limit is usually not the work itself, but the shared resource everyone is using at once.

Part 7: Where This Stands, and What Comes Next

Let me be honest about where this is, and then about why I am hopeful.

Right now, a person still does the hardest part. A human picks the target, decides where to look, and decides when to give up on a dead end. And the rarest step — inventing a new kind of bug that nobody has named — is still the human’s. That is the step an AI is least suited to, because there is no past example of it to copy. What is built so far makes the AI a much more careful hunter of the known kinds of bug. It does not yet invent new kinds on its own. I would rather say that plainly than let the title promise too much.

But look at what is built, because it is more than it sounds. The habits that check an idea — noticing the gap, reading carefully, doubting the answer, keeping an honest record — are mostly working. What is built is the part most people skip: not raw ability, but being trustworthy when there is no automatic check. An AI that can tell its own proofs from its own guesses, refuses to mix them up, and accepts what the test machine shows even when its own reasoning disagrees. That is a real, unsolved problem in AI agents today, and it is the foundation everything else needs. You cannot trust an AI to invent until you can first trust it to check itself. That part is working.

So the honest picture is not a finished result. It is a strong start: the careful, self-doubting checker is working, and the inventor is still mostly the human’s job. And that is the part I find exciting, because what comes next is the best part. The next step is to push the idea-finding habits — listing the gaps, borrowing rules from other fields — from “helps a human have an idea” toward “has an idea on its own,” and to measure it honestly across many cases, not from one good story. There will be a lot of trial and error, and I plan to keep writing about it — the failures as well as the wins, because the failures are where this method proves its worth.

What I set out to do was rebuild the way an expert thinks, written down as clear rules a machine can follow: how they notice the gap, how they read before testing, how they keep an honest record, how they come up with ideas — and, above all, how they refuse to fool themselves. It is not finished. But it works, it improves with every round, and the part that is left is the part most worth doing. I am going to keep building it, out in the open.

A few sources, if you want to read more

James Forshaw, How to secure a Windows RPC Server, and how not to — a clear look at how these “who is calling?” checks are supposed to work, by one of the best in the field; much of the “assumed vs. actually checked” idea comes from his write-ups.
Yifei Ming and others, Helpful Agent Meets Deceptive Judge (2025) — the study behind Part 3: a confident but wrong criticism can sharply lower an AI’s accuracy in a single round, and stronger reasoning models resist it better.
Richard Feynman, Cargo Cult Science (1974) — the short talk where “you must not fool yourself” comes from.

Security Research, AI

This post is licensed under CC BY 4.0 by the author.