Rebuilding a Security Researcher's Mind in an AI — to Invent Attacks, Not Just Find Them

Anyone can now point an AI at software and find zero-days of known kinds — that capability is spreading fast. This is a report on building an AI for the part no automatic oracle can score: autonomously reverse-engineering undocumented Windows internals to invent attack techniques and bug classes nobody has named, and chaining small footholds into an attack that exists in none of its parts. How it borrows ways of discovering the genuinely new from fields far outside security, doubts its own conclusions, keeps a record that cannot rewrite a guess into a fact, and runs unattended for hours without fooling itself. With what already works, what is still hard, and what comes next.

Posted Jun 16, 2026

By Kazuma Matsumoto

66 min read

Introduction

An AI that finds a zero-day is no longer a shock — that capability is spreading fast, and finding more bugs of familiar kinds is not what this project is about. What almost nobody is building is an AI that invents: one that reaches for an attack technique nobody has described yet. That is what this project is for.

The work is getting an AI to do Windows security research the way an experienced human researcher does — not by copying what the expert knows (facts like which Windows version or which function, which can be looked up and which change over time), but by copying how the expert works: what they look at first, how they form a guess, and how hard they try to prove their own guess wrong before they believe it.

And I should be clear up front about what kind of research, because it shapes every choice that follows. It is no longer just scanners that find bugs. The strongest automated bug-finders today — Google’s Big Sleep, the cyber-reasoning systems from DARPA’s AI Cyber Challenge, the agents now filing real CVEs — are genuinely impressive work, and the bugs they find are real and valuable. I am not out to diminish any of it; this project is simply aimed at a different problem, and naming that problem is the cleanest way to explain the work. Those systems are at their strongest when there is an automatic oracle — most often a crash — that can tell them, on its own, when an input has tripped a bug. Building one that does that well is genuinely hard, and as more teams get there, finding vulnerabilities of kinds that already have a name is steadily becoming something the wider field can do. What stays scarce is the part with no oracle to lean on at all.

So the goal here is that other part — the one no automatic check can score: to autonomously reverse-engineer the undocumented inner workings of Windows that nobody has mapped, raise conjectures from what that reveals and work out fresh places to attack, invent a kind of bug nobody has named, and chain small, individually-harmless footholds into a single attack technique that exists in none of them — the move that turns “found a bug” into “designed an attack.” The aim is to invent, not to recognize — and that one word is the whole project.

That territory has an awkward property, and it is the reason the rest of this post is mostly about discipline. Think about how an oracle-based finder actually confirms a bug: the AI proposes an input, a separate tool runs the program, and if the program crashes, the bug is real. That crash is an automatic, impartial check — an oracle — and it is what lets the whole loop run without a human in it. But an oracle has a hard limit: it can only confirm a bug whose shape was decided in advance. You cannot write a crash-check, or any automatic check, for a bug class that has not been named yet. The bugs this project is after have no oracle, for two compounding reasons. First, they usually do not crash — the program just does the wrong thing: it trusts the wrong person, checks at the wrong moment, or acts on a name it never made sure of. Second, for the genuinely new ones, there is no prior description of what “the wrong thing” even looks like. No crash, no signature, no oracle — so the only thing that can decide whether a finding is real is the researcher’s own disciplined judgment. That is the thing this project automates: not the search, but the judgment.

None of this means crash-based tools are lesser — they are excellent, and the deep ones reach serious bugs. The point is narrower: the easiest automation to build — a loop that throws input at a program and waits for a crash — has the lowest ceiling on how serious its findings tend to be, while the harder, higher-value work is the slow reading-and-reasoning that has no automatic signal at all. The interesting question is whether an AI can be given enough discipline to do that work without fooling itself.

That is the hard part of this project. It is not finding bugs. It is getting the AI to doubt its own conclusions — because that is the part with no safety net. I will explain every Windows term in plain words as I go. You do not need to be an expert to follow it.

Where These Rules Came From

The rules in this post were not invented from scratch. A large part of the work was studying how the best Windows security researchers actually think, and turning that into instructions an AI can follow.

I read many of their public write-ups and conference talks, closely and more than once. I was not looking at the specific bugs they found. I was looking at the move their mind made just before a bug appeared: how they chose what to look at, how they turned a vague feeling that “something is off” into a clear statement that could be tested, and how they then tried to prove or disprove that statement. The same small set of moves kept coming up across different people and different targets. That repetition was the signal. I wrote each move down in plain words and made it one of the AI’s rules.

There is an important guardrail attached to this, because copying experts is easy to get wrong. The rule the system follows is: borrow the thinking move, never the bug. When the AI looks at a famous past finding, it is allowed to ask “which assumption did the researcher decide to attack, and which idea from another field did they borrow?” — and then run that move on something new. It is forbidden from copying the shape of the old bug and scanning for a target that “looks like it.” A past case tells you how someone once thought; it never tells you what to go find. That one rule is what separates inventing something from re-discovering something. And re-discovering — taking a known bug shape and scanning for the next target that fits it — is exactly what an automated bug-finder is good at, and something today’s tools do remarkably well. Copying the move instead of the bug is the one discipline that points an AI at the unnamed shape instead of the next instance of a named one; almost everything else in this post exists to make that one rule survive contact with a real, fallible AI.

Some of those moves shape the parts that follow. Four others are worth stating on their own:

Try the obvious direct attack first. Before building a clever, indirect attack, do the simplest thing. If the simple thing already works, there was never any protection to get around, and the clever idea proves nothing. But if the simple thing fails with an “access denied” type of error, that failure is good news: it proves the protection is real, so any indirect way around it is a genuine finding. A failed direct attempt is not wasted — it tells you there is a wall worth getting past.
Do not accept the first tidy explanation. When something strange happens, the mind reaches for a neat story that makes the strangeness go away. That neat story is the trap. Treat the first explanation as a guess to attack, and ask one more question: “if that were fully true, how could the other thing I also saw ever happen?” The bug is often one question past the comfortable answer.
Turn every guess into a test that could fail. A guess on its own does nothing. Each guess has to pick one small action whose result would show it is wrong, and you decide in advance what result would count as “wrong.” Then a failure is not the end; it is a sign pointing at the next thing to try.
Write down every odd thing, even when it is not a bug. A single strange observation is often one piece of a pattern that only becomes clear much later. If you do not record it, the piece is lost. So the AI keeps a running note of every surprising behavior, with no pressure to explain it yet. (One researcher noted a small oddity while working on something else; more than a year later it grew into a whole new family of bugs.)

I did one more piece of background research: how to find contradictions on purpose. A contradiction is just two things that are supposed to agree but do not, and in security work that is usually where the bug is. I studied how people in other fields find contradictions deliberately, rather than by luck, and built those steps into the method. (More on this further down, where the AI uses it to come up with new things to test.)

And the deepest borrowing was not from security at all. Discovering something genuinely new without fooling yourself is not a security problem — it is the everyday problem of every serious research field, and those fields have spent a very long time on it. So I deliberately pulled ways of thinking from well outside security into the AI’s rulebook, then boiled each one down to a concrete habit it runs while reverse-engineering a Windows component:

the experimental method of the natural sciences — form a sharp guess, then design the experiment that would break it, not the one that would flatter it;
the way detectives and intelligence analysts keep several competing explanations alive at once instead of marrying the first one;
medical differential diagnosis — list every cause that could fit what you are seeing, then rule them out one at a time;
how mathematicians move from a loose conjecture to something actually proven;
engineering failure analysis, which works backward from a break to its true root cause;
systems thinking, for how a chain of small, individually-harmless steps adds up to one unsafe whole;
the psychology of insight and, just as important, of self-deception — the documented ways a mind talks itself into seeing what it already expected;
and the structured invention methods engineers use to manufacture new ideas on purpose, like laying a mechanism out as a full grid of choices, or running through a catalogue of inventive principles (TRIZ) one by one.

None of this rides in the AI as theory or footnotes; each is a habit it actually executes against a live Windows target. Folding this cross-disciplinary toolkit into a security agent is one of the real design bets of the project — it is how you give a machine a fighting chance at the genuinely new, instead of letting it drift back to the merely familiar.

Part 1: What an Expert Notices First

The first thing to copy is what an expert notices. Point a beginner and an expert at the same program, and the expert looks straight at one thing: the difference between what the programmer assumed was true and what the program actually checks.

Here is what that means. Suppose a service is supposed to give your saved data back only to you. The programmer assumed: “only the owner can read this data.” But in the actual code, the service decides who you are from a name you typed into the request — it never checks that you are really that person. So anyone who knows the name can read your data. The assumption (“only the owner can read this”) and the real check (“does the name match?”) are not the same thing. The bug lives in that difference.

Real Windows services have this problem often. One real example: Windows keeps a list of which program handles which service. In one case, that list trusted whoever signed up first, and never checked whether they were allowed to. So an attacker could sign up first and pretend to be a trusted system service. The programmer assumed “only the real service signs up here.” Nothing in the code made sure of it.

This is not magic, and that is exactly why an AI can do it. It is a simple, repeatable procedure:

Write the programmer’s hidden assumption as one short sentence. “Only the owner can read this.” “This can only be reached from the local machine.” “Whoever is calling has already proven who they are.” Forcing it into one sentence is the trick: a vague worry cannot be tested, but a clear sentence can.
Take that sentence apart and negate each piece separately. “Only X can reach this” breaks three ways: maybe there is a second path (negate the “only”); maybe a different person can slip in (negate the “X”); maybe you can reach a smaller, related operation the check does not cover (negate the “reach”). Each negation is a separate thing to go check.
Find the exact line in the code where that sentence is supposed to be enforced, and show that on a path you can reach, the check and the assumption disagree.

I made this the AI’s first move on any target, instead of the vague “look for something that seems wrong.” Do not look for “a bug” in general. Find the assumption, find the check, and show they disagree.

Figure 1 — Where the bug lives. The bug is the gap between the rule the programmer assumed (green) and the check the code actually runs (blue) on a path you can reach.

Part 2: Read the Code Before You Test It

Noticing is useless without the patience to follow it the slow way. So the second habit is about order: read and understand before you reach for the fast, automatic tools. The whole method is a short loop the AI runs over and over, and it is worth seeing the shape of it before the details.

Figure 2 — The research loop. Seven steps, run over and over. The pivotal one is FALSIFY: trying the obvious attack first turns a blocked attempt into evidence that the indirect path is worth building.

The single most useful step in that loop is FALSIFY, and the most useful trick inside it is “try the obvious attack first” from earlier — a blocked direct attempt is evidence that the clever path is worth building. The rest of the loop is bookkeeping that keeps the AI from wandering: list the surface, throw away everything that does not cross a security boundary (a line where something less trusted can affect something more trusted — an ordinary user reaching code that runs with the highest power on a Windows machine, a level called SYSTEM that sits above even an administrator), draw who-trusts-whom, make one testable guess, try to break it cheaply, prove it for real, and only then step back and ask “what kind of bug was that, and where else does the same kind live?” When an ordinary, low-power account gains powers it was never meant to have — climbing toward that SYSTEM level — that is privilege escalation, the most serious kind of bug this project hunts.

Now, the part about reaching for automatic tools. There is an easy way to look for bugs and a hard way. The easy way is called fuzzing: send a program huge amounts of random, broken input and wait for it to fail. It is a real, valuable method and it finds real bugs. Its great strength is that it needs no understanding of the program at all — it is fast, and anyone can run it. Its matching limit is the same fact from the other side. Because it understands nothing, the cheap version of it mostly turns up shallow crashes — the kind that just stop the service from working (a denial of service) rather than hand an attacker more power. (A deeper crash can be far more dangerous, as we will see — but those are not what the cheap, understanding-free version tends to reach.) And it cannot find the quiet bug at all: the one where the program does exactly what it was written to do, and what it was written to do is wrong.

So the rule is about sequence, not about superiority: understand the program first; reach for random input later. Read the code that makes the security decision. Write down what it is supposed to guarantee. Find where that guarantee fails. To keep the AI from skipping straight to the fast tool, I gave it a fixed order of where to look — from the quietest, highest-value bugs at the top down to crash bugs at the bottom, where the cheap, off-the-shelf kind of fuzzing finally fits best.

Figure 3 — Where to look, and in what order. The quiet, high-value bugs sit at the top, with no crash for an automatic tool to aim at; crash bugs sit at the bottom, where cheap, off-the-shelf fuzzing fits best. It is a trade-off in technique, not a ranking of worth. And every rung here is a bug class that already has a name — the work this project is really about sits off the top of the ladder: the class not on any ladder yet, and the chain that turns three low rungs into one attack no single rung describes.

Here is the bet, stated plainly: the easiest automation to build — throw input, wait for a crash — has the lowest ceiling on how serious its findings can be, and the highest-value work is the slow reading-and-reasoning that has no automatic signal at all. That is the work I chose to automate, precisely because it is the hard one. To be fair to the field, the best automated finders today are not dumb crash loops — they pair an AI with a deterministic check (a crash, a sanitizer, a failing test) and can build up stateful, step-by-step interactions to reach deep and serious bugs. I am not competing with that; I am working on the part it is not built for. Every one of those systems still needs an automatic oracle — some impartial signal that says “this input was bad” — and the bugs this project is after have none. And to be honest about the trade-off the other way: the quiet bugs are genuinely harder to find than crash bugs, often one of a kind; their advantages are narrow (no crash means no alarm, they slip past the protections Windows builds specifically against crash bugs, and they survive when the code is rewritten), and they are not “easier to attack,” because a memory bug often hands you a clean, reusable way to read and write the program’s memory at will — often the most powerful thing an attacker can get, since almost any other capability can be built from it — that a logic bug does not. Fuzzing is not junk, either: a deep fuzzer that carefully builds up a valid conversation with a program, step by step, to reach the well-guarded inner code of something privileged can absolutely reach serious bugs. Depth, not the tool, decides the payoff. The honest, narrow line is this: the cheap, off-the-shelf automation has a low ceiling, and the value is in automating the deep work above it — including the part with no oracle at all.

There is good reason to bet on that slow, careful work. One researcher spent the better part of two years carefully auditing one neglected part of Windows — the registry, a thirty-year-old component whose rigid file format largely defeats the random mutation that ordinary fuzzing relies on — and reported dozens of serious bugs there (Microsoft fixed them as more than forty separate security patches). He built a custom fuzzer too, and it did find bugs; but the format’s resistance to random input is exactly why so much of the work had to be reverse-engineering and reading the old kernel code by hand. (An honest warning: you mostly hear about the long effort that paid off, not the equally long efforts that found nothing. Careful depth can pay off greatly; it is not guaranteed to.)

Two ways of working, used together

“Understand the program first” raises a question: how do you build that understanding? The rule is that the AI must do two things at once, and treat doing only one as half a job. First, reason from how the program works — work out what must be true from the way the code behaves, because often the answer is written down nowhere. Second, search for what others already know — the official documentation, the technical specifications, other researchers’ write-ups — and where sources disagree, work out which one is correct.

You need both, because they fail in different ways. Reasoning with no searching invents a confident picture that may be wrong. Searching with no reasoning collects facts without understanding which one matters. So a claim is only treated as solid when both engines — done separately — reach the same answer, and, where possible, the test machine agrees too. Searching has its own version of that rule, one level down: one source is an opinion; two that did not copy each other and still agree is real evidence. (The system is strict about that last point: two blog posts that both got their facts from the same original advisory count as one source, not two.)

Part 3: Self-Doubt — the Hardest Habit to Build

This is the heart of the project, and the hardest part to build, because the crash-based tools get it for free (the crash is their honest check) and a method with no crash has to build it from nothing.

Here is a real and uncomfortable fact about today’s AI models. Take a model that has reached the correct answer. Argue against it confidently, with wrong reasoning that sounds right, and it will often give ground and abandon the right answer. Researchers have measured this. When the wrong criticism came backed by confident, source-cited “evidence,” most leading models’ accuracy fell by more than half in a single round, with the most adversarial critics causing the steepest drops of all. (To be precise: that is a sharp fall in a model’s overall accuracy across many questions, not a coin-flip on every single answer — and the models built for careful, step-by-step reasoning were the clear exception, holding up far better, which is exactly why the design below leans on them.) Think about what that means, because it breaks the obvious plan.

The plan so far was: make the AI attack its own findings. But if a confident wrong argument can talk a model out of the truth — and the attacker is the same model — then careless self-criticism does not improve the work. It destroys the correct findings along with the wrong ones. So I need a review step that removes bad findings without destroying good ones. Here is the design.

The review. After the AI thinks it has found a bug, it reviews its own finding two or three times. Each time is a separate, hostile pass whose only goal is to destroy the finding. Splitting it into separate passes is on purpose: a mind checking its own work tends to defend it, so I force it to switch roles — “now your only job is to kill this finding.” The passes get stricter:

Check the logic, step by step, and throw out anything not backed by something concrete.
Try to disprove it on a real test machine, using a debugger (a tool that freezes a running program so you can inspect it step by step). Design the test that would show the finding is wrong, and keep “I ran it and saw this” strictly separate from “I worked it out in my head.”
Look for boring explanations. A passing test usually has a dull cause — a saved result, a side effect, your own setup mistake — before it has an exciting one. Rule those out first.

Then three rules stop the review from destroying good findings. Each one is aimed at the confident-but-wrong critic the research measured:

(1) Every criticism must point to real evidence — a specific value, something seen in the debugger, a source actually read. A criticism that only says “this seems wrong,” with nothing behind it, counts for nothing and cannot lower a finding’s standing. (2) When an evidence-backed criticism still disagrees with the finding, the test machine decides — not whichever side sounds more confident. (3) Use the strongest reasoning model for the hostile pass, because those resist a confident wrong argument best.

Figure 4 — Reviewing your own finding. Three hostile passes, then a verdict — fenced by three guardrails so a confident-but-wrong critique can raise a doubt but cannot quietly destroy a true finding.

Rule (2) is the one that matters most, and it works only because this kind of research has something the pure-thinking fields do not: a real machine to test on. When the reasoning and a criticism disagree, you do not pick the more confident voice — you run it on the machine and look at what actually happens. The machine has no pride and no debating skill. It just shows you the result. Later you will see the whole rig built around this idea; for now the rule is enough. Richard Feynman said it best more than fifty years ago, and it is the rule the whole project is built around:

The first principle is that you must not fool yourself—and you are the easiest person to fool.

This is not only a rule for the AI. While working on this project, I have caught myself doing exactly what the system is built to stop. I “corrected” a fact to the answer that felt obviously right, stated it with full confidence — and I was wrong. The truth was that two reliable sources genuinely disagreed, and my “correction” had quietly erased that disagreement. What caught me was not me being clever. It was the plain rule that does not care how confident I feel: go back to the original source, read it again, and trust it over your memory. That is why the rule exists. I would rather show you the method catching its own author than ask you to trust that it works.

One more guard belongs here, because the same caution is built into the machinery. When the AI runs its review as separate agents (more on that later), a critic that wants to kill a finding has to attach a concrete reason — a specific value, a line from the running program, a quoted source. A criticism with no reason attached cannot kill anything by itself; the code that collects the votes automatically downgrades a reasonless “this is wrong” to a mere “worth a second look.” Raising a doubt is always allowed. Overturning a result needs evidence.

Part 4: Memory — a Record That Cannot Quietly Change

There is a second, quieter problem on long jobs: the AI’s own memory. An AI cannot keep everything it has learned in mind at once. As a job goes on, older notes are shortened to make room, and shortened again, and details can change in the process. The detail most likely to change is the one that matters most: how sure I am of a fact. Did I prove this, or only guess it? A guess from yesterday, after being shortened a few times, can come back today looking like a proven fact — and every later decision that trusts it is now built on a mistake.

So here is a rule I will defend: for an AI that writes its own long-term notes, the record of how it knows a fact must never be changed afterward. That is a safety rule, not just tidiness. The worst failure on a long job is not a wrong fact; it is a fact whose standing quietly gets upgraded from “guess” to “proven.” I make that impossible. The AI keeps a simple log — one short entry per fact — that records not just what it believes but how it knows it: guessed, read somewhere, or proven on the machine. The “how it knows” line can never be edited. You cannot turn a guess into a proof. If a guess is later proven, you do not change the old entry; you add a new one that points back to it.

Figure 5 — A record that cannot quietly change. How a fact is known is frozen the moment it is written. A guess is never edited into a proof; promotion means appending a new, linked entry.

one entry in the log (example)

claim:       this service decides who you are from a name in the request, not a real check
how I know:  PROVEN ON THE MACHINE   (guessed / read / proven)
what I saw:  asked for another account's data by name → the service handed it over
confidence:  proven

Rule: the "how I know" line can NEVER be edited.
A guess is never changed to read "proven."
If a guess is later proven, add a NEW entry that points back to this one.

(This entry is made up, to show the format. It is not a real finding from this project.)

This looks like paperwork. It is actually a firm stance on one of the genuinely unsolved problems in autonomous agents. An agent that writes its own long-term memory can slowly launder its own guesses into facts — quietly dressing up a guess as something proven — and researchers studying long-running agents have named the exact failures: an error that gains authority simply because it was once written down as a confident “lesson,” and old notes that get trusted and copied without re-checking. My answer is a discipline borrowed from systems that must never lose track of where a fact came from: the provenance of every fact — guessed, read, or proven on the machine — is append-only and can never be rewritten. A guess simply cannot become a proof — the system makes that impossible; promotion happens only by appending a new, linked entry, leaving the original guess visible forever. Keeping that one line frozen is the whole defense.

The same rule blocks a related mistake that has a name in research: HARKing — short for “claiming you predicted something after you already knew the answer.” It is tempting because it makes a result look stronger: a result you truly predicted and then confirmed is strong evidence, while the same result found by accident and explained afterward is weak — you can invent a clever explanation for almost anything once you know how it ends. So anything found after seeing the result is recorded honestly as “seen first, explained after,” never rewritten as a prediction made in advance.

And one more practical benefit, which is what makes bold exploration affordable. When the AI gives up on a dead end, it writes down the exact thing that killed it — the precise error code, the security setting it measured, the access check that came back “denied.” Filed that way, a dead end is not a loss; it is a saved fact. The AI can afford to try wild, speculative targets and abandon them fast, because nothing is ever truly thrown away — and months later, when a similar idea comes up on a completely different program, it recognizes the same kind of dead end and does not waste a day re-deriving it.

Part 5: Coming Up With Ideas on Purpose

This is what the whole project is for, and it is the part nobody else is building. Everything up to here helps the AI reject a bad idea or prove a real one — but rejecting and proving can only ever reach what you already thought to look for. The goal here is the opposite: to manufacture an idea no one has had — a kind of bug nobody has named, or several small weaknesses stitched into an attack nobody has described. A standard AI bug-finder has no step for this at all; it mutates inputs and waits for the oracle to fire. Inventing a shape that has no name has no such loop, so I had to build one: a deliberate invention engine that pulls structured idea-generation methods out of several mature fields — engineering design theory, analogical reasoning, the study of scientific contradiction — and runs them as a coordinated battery against the undocumented mechanism the AI has just reverse-engineered. None of these are brainstorming prompts; each is a procedure with a defined input (the reconstructed mechanism) and a defined output (a concrete, testable conjecture about an attack that may not exist yet). Here are the main ones.

1. List the choices, then look at the gaps. Take a mechanism and write down each separate decision it makes, as a grid. For “how does this service decide who you are?”, the columns could be: does it check your real identity, a name you typed, or nothing? The rows could be: does it check every time, or once and then trust you? Now fill in the boxes with the bug patterns people already know. The boxes that are still empty, but that an attacker can reach, are predictions: “no one has found a bug of this shape here yet — go look.” This turns “I hope I notice something” into “here are the exact gaps to check.” The example grid has two questions — three columns and two rows, six boxes in all — which a person could eyeball. The real ones do not stay that small: a single trust mechanism has many independent design decisions — who is allowed to register, what identity is verified, when the check runs, whose authority the action ultimately uses — and their combinations run well past what anyone holds in their head at once. The machine fills the whole cross-product, marks every cell that already has a known bug pattern, and hands back the reachable-but-empty cells as a ranked list. The unexplored gap stops being something you hope to notice and becomes something the grid enumerates.

Figure 6 — The empty box is the prediction. Lay out a mechanism’s choices as a grid and fill in the safe design and the known bugs; an empty cell an attacker can still reach is a place no one has looked yet. (The grid shown is a deliberately small example; a real mechanism’s grid spans many more axes than a person can hold in their head, which is exactly why a machine that fills the whole cross-product beats eyeballing it.)

2. Borrow a rule from a field that already solved it. Take a rule that some well-understood field treats as essential, and ask directly whether this Windows mechanism follows the same rule. For example: when your web browser connects to a website, it checks the website’s identity (using a certificate), so a fake website cannot quietly take its place. So ask the same question of a Windows mechanism: when something replies to a request, does Windows check the identity of whatever replied — or does it just trust that the right thing replied? Published research by others asked exactly this, and the answer was: it does not check. If the real service is not running, an attacker can stand up a fake one on the same channel and Windows will believe it — nothing verifies that the thing answering is the thing that is supposed to. That missing check is the start of a real bug. Notice where the idea came from: not from a clue in the code, but from a rule that a completely different field treats as essential. The system keeps a long list of such “borrowed rules” — from distributed systems, web security, cryptography, even hardware attacks — and walks them one by one against whatever it is looking at. (The example here is from others’ published research; what matters is the move, and the system runs it autonomously — walking its catalogue of borrowed rules one by one — against mechanisms nobody has mapped.)

3. Look for contradictions. A contradiction is two things that should agree but do not, and each one is a concrete thing to test. Some kinds to look for: the manual (the official description of how a part should behave) says one thing, but the code does another; two programs that are supposed to follow the same rule handle the same input differently; one part of a system assumes something that another part never actually guarantees; or a value is checked for safety once and then used later as if it cannot change (and if an attacker can change it in between, the check meant nothing). For each contradiction you find, you make the two sides disagree on the test machine and watch what happens. Most of the time nothing breaks, and you have ruled something out — that is still progress. Sometimes it breaks, and that is a real lead.

There is one failure mode that quietly defeats all of these idea-makers, and it is worth naming because it is specific to an AI. When you ask one mind to “think of more ideas,” it tends to produce the same idea reworded — over and over — and turning up the model’s “creativity” setting (its temperature) barely helps: a study of LLM creativity found it raises novelty only weakly, while more clearly making the output less coherent. So “generate more” silently stops meaning “explore more.” The system treats this as a real hazard, not a nuisance: a round of idea-making only counts if it changes the source of the ideas (a different technique, a different borrowed field), it judges two ideas to be “the same” by their underlying mechanism rather than their wording, and it stops and switches sources the moment the last few ideas are just paraphrases. You will see this same idea again in the section on how a run is organized, because it shaped the whole design.

Composing footholds into a technique that exists in none of its parts

The first three techniques invent a new single weakness. This last one invents a new technique — and it is the part I consider the real prize, because it is where an AI stops finding bugs and starts designing attacks. A single weakness is often worth little on its own: a way to write one file, read one value, or nudge a privileged program into acting one instant too early. Each such small, reliable capability is what security people call a primitive. The dangerous thing is the composition. The AI treats every small capability it finds as a building block with a shape — what it needs in order to fire, and what it hands back — and hunts for the next block whose input fits the last block’s output. The severe result lives in none of the individual bugs; it exists only in the wiring. And this is the kind of move an oracle-driven approach has no way to reach for: there is no crash to catch and no signature to match, because the danger is created by the composition — three findings, each one alone too small to trip any automatic check, joined into one technique that none of them is. That is where “find a bug” becomes “design an attack” — and designing an attack nobody has described is the part of this work I am least willing to call solved and most convinced is the right thing to chase.

Part 6: The Machine It Runs On

Up to here I have described a way of thinking. The rest of the post is about the machine that lets an AI actually do it — unattended, for hours — because the habits above only mean something if the AI can really read the code, really run the experiment, and really keep the record. Here is the whole rig at a glance.

Figure 7 — The research rig. A thin orchestrating AI hands the heavy work to a swarm of helpers, which reach three tools through a common link: a disassembler to read the code, a real Windows VM to run it and watch, and the durable record to remember honestly.

The AI sits in the middle, but it deliberately stays thin: it plans and decides, and then hands the heavy lifting off to a swarm of short-lived helper programs (more on those in the next part). Out from there run three main connections — three “hands” — each reaching a different piece of real software. The connections use a common standard (called MCP) that simply lets an AI call a tool as if it were a function; the interesting part is not the wiring but what comes back through it.

The first hand reads the code. Windows ships as compiled programs with no source code attached. A kind of tool called a disassembler — or, more precisely, a decompiler — takes one of those compiled files and reconstructs something a human can read: a rough, C-like version of the original code. (Strictly, a disassembler gives you the raw machine instructions; the decompiler step lifts those into readable pseudo-code.) The AI drives this tool directly: it can turn a function into readable pseudo-code, follow every place a function is called from, and rename things as it figures them out, all without anyone clicking in a window.

But displaying pseudo-code is the easy part — any tool does that. The hard part, and the one thing a crash-waiting fuzzer fundamentally cannot do, is what the AI does with it: reconstruct the undocumented shape of Windows from the bytes up. It recovers the hidden data structures a privileged service passes around, rebuilds the table that routes an incoming request to the function that handles it, traces a value the attacker controls all the way to the dangerous operation it might reach, and works out the rule the original programmer silently relied on — a rule written down nowhere, in code Microsoft never published. There is no crash to wait for and nothing to read until you have reconstructed the meaning of the bytes yourself; until very recently, this was work only an experienced human reverse-engineer could do at all. (Getting the tool to launch reliably on its own every time was its own small saga — an off-the-shelf timing bug I had to paper over with a patient heartbeat — but that is plumbing; the reconstruction is the point.)

The second hand runs it and watches. This is the part that turns a guess into a fact. It is a full Windows 11 machine, kept separate (a “virtual machine,” or VM) so the work happens on a disposable, resettable copy of Windows rather than a real one — if an experiment breaks it, you just roll it back. Through this one connection the AI can power the machine on and off, run commands inside it, attach a debugger (a tool that freezes a running program so you can inspect it one step at a time), intercept individual running functions on the fly to watch the exact values going in and coming out (this is called hooking), and even record a stretch of the program’s execution while it handles a request, then replay that recording afterward as many times as needed — so the AI can ask, after the fact, “exactly which piece of code ran for this request, and with what inputs?” Two design choices on this machine are worth calling out because they are easy to overlook and they matter:

It can return to a clean state instantly. The machine keeps a maintained “known-good” snapshot. The AI saves a checkpoint before anything especially invasive, and after a crash or a mess it just rolls the whole machine back to that clean baseline. So an experiment that breaks the machine costs seconds, not an afternoon — which is what makes bold experiments affordable.
It has four genuine ordinary-user accounts, not just an administrator. A privilege-escalation bug — one that lets an ordinary account gain powers it was never meant to have, up toward full control of the machine — is only believable if you prove it starting from a real low-power account. So the AI logs in as a true standard user and attacks from there. Having four such accounts also lets it test a subtler thing: whether one ordinary user can reach another ordinary user’s private data — by acting as user A and reaching for user B’s. Two of this project’s confirmed findings turned on exactly that.

There is a small but important habit that lives on this machine: instead of trying to reason about Windows’ permission settings by reading them, the AI asks the machine directly. It logs in as the unprivileged user and runs the real Windows permission check on a target — and whatever comes back “allowed” maps out the attack surface actually within reach from that account: the set of things this user can even touch, which is where any attack has to start. (Being able to reach something is not the same as being able to abuse it — but you cannot abuse what you cannot even reach, so this is the honest starting map.) It is the difference between arguing about a lock from a photo and just trying the key.

The third hand remembers honestly. This is the durable record from Part 4 — an append-only journal and a one-line-per-fact ledger, kept on disk, where every fact carries its “how I know” tag and that tag is never edited. It sits beside the other two hands deliberately: the moment the running machine shows something, the result is written down with the right standing, so a later step cannot quietly upgrade it.

The throughline across all three hands is one sentence: the AI reasons and searches the web, but the bytes in the disassembler and the behaviour on the running machine are the final word. When the running machine contradicts what the AI worked out by reading, the machine wins, and the AI goes back to re-read what it misunderstood. (There is a fourth, rarely-used hand — plain control of the desktop, clicking and typing like a person — kept only as a fallback for the odd thing the precise tools cannot reach.)

Around these three hands sits infrastructure that elite human teams build by hand and that few automate. One keeps a continuously-rebuilt, searchable map of every service on the machine, compared against last month’s, so the day Microsoft ships a change the AI sees the newly-exposed surface. Another takes a Windows security patch and lines up the before-and-after versions of a program to reveal what was actually fixed — and, just as often, what was fixed incompletely. Others trace how attacker-controlled input could flow to a dangerous operation. These see the surface; they do not decide what is real — only the running machine does that. In candor, several are built and reasoned-through but not yet battle-tested, and they are deliberately support for the real work, not the work itself.

Part 7: How a Run Is Organized

A single AI carrying a research project from start to finish in one long conversation does not work well: it slows down and gets less reliable as the conversation and the data pile up. So a run is built differently — as a short assembly line of stages, where each stage spins up a fresh swarm of small, single-purpose helper programs, collects their results, and hands a short summary to the next stage.

Figure 8 — How one run is organized. Five scripted stages, each fanning out many short-lived agents; inside a stage, independent minds are merged and de-duplicated, and an independent juror panel — not the agent that found it — votes a finding through.

There are five stages. Select picks a target worth attacking and checks it is genuinely under-explored. Analyze opens the target in the disassembler and has many helpers read its inner workings, looking for a flaw and judging whether any flaw is a genuinely new kind of bug or just another example of a known one. Chain is where a new attack technique gets assembled: it takes a weak finding and tries hard to compose it with other footholds into something severe — full control of the machine, say — because a weak result is never allowed to be filed as “weak” until this stage has exhaustively tried to wire it into something bigger. (A finding already severe on its own skips this stage and goes straight to Prove; and how reliably the AI pulls off the genuinely novel chain is the open question I return to at the end.) Prove is where “it should work” becomes “I ran it and saw this,” on the real VM, starting from an ordinary user account. Report writes the finding up. A top-level autopilot can run the whole line by itself, and — this is the important part — when a target turns out to be a dead end, it does not stall: it files the dead end with its exact cause and pivots to a fresh target. It keeps going across targets until something is proven or it hits a deliberate limit.

Three design choices inside this line are worth explaining, because they are where the real engineering went.

Independent minds, then a merge. Every stage that generates ideas does not ask one helper for a list. It launches several helpers in parallel, each started from a different angle — one reasons from the machine’s measured behavior, one fills in the grid of design choices from Part 5, one walks the borrowed-rules-from-other-fields list, one looks at how the component has been changing over time — and they do not talk to each other until their results are merged at the end. The reason is the diversity problem from Part 5: a single mind asked for “more ideas” collapses into near-duplicates. Separate minds started from separate places do not.

Figure 9 — Why many small minds beat one big brainstorm. One mind asked for more ideas collapses into near-duplicates (in one published large-scale study, only about 5% of the ideas are genuinely distinct); separate workers seeded from different starting points, then merged, actually widen the search.

This is not a hunch; I ran a controlled head-to-head against a single large brainstorm and — crucially — counted ideas by their underlying mechanism, not their wording, so reworded duplicates could not inflate the result. By that measure the independent-minds setup turned up more genuinely distinct ideas. I will state the limits plainly: the test was small, and the islands’ output still needed heavy de-duplication before the count meant anything, so I trust the direction far more than the exact magnitude. The lesson I am fully confident in is the measurement discipline itself: count by mechanism, or you will mistake noise for progress.

A jury, not the author, decides. The decisions that cost the most — “is this lead really unexplored?”, “is this a new kind of bug or an old one?”, “was this actually proven?” — are never made by the helper that did the work. They go to a small panel of independent judges that each vote, and a deliberately uneven rule combines the votes — each question gets the threshold that minimizes its more expensive mistake. For “was this proven?” the panel is strict toward not believing it: one judge raising a solid, evidence-backed objection is enough to block a “proven” stamp, because a dropped lead is cheap and a false “proven” is a disaster. For “is this unexplored?” the panel leans the other way: a lone “this is already known” vote does not kill a fresh lead outright; it just flags it for another look — because killing a good lead by mistake is the costlier error there. And the same evidence rule from Part 3 is wired in as code: a judge’s vote to kill a finding is automatically downgraded if it carries no concrete reason.

Serial where it must be, parallel everywhere else. Two things in the rig cannot be shared: the disassembler works on one program at a time, and there is only one test machine. So any step that touches them is run strictly one-at-a-time; only the work that does not touch those two shared tools — reading the already-extracted code, searching the web, reasoning about a result — is fanned out in parallel. Getting this boundary right is most of what keeps a run from tripping over itself.

Part 8: Keeping a Long Autonomous Run Alive

This is the part almost every story about AI agents skips — and the part I am most pleased with, because it is the difference between a demo and a system. An unattended run that lasts hours and launches a hundred-plus helpers does not usually fail with a clean error. It fails silently — it quietly carries on with garbage, or it freezes — and you only notice hours later that the whole run was wasted. Almost every piece of engineering below exists to turn a silent failure into a loud, recoverable one.

Figure 10 — Turning silent failures into loud, resumable ones. The real danger in a long unattended run is not a crash but quietly carrying on with garbage; each measure converts a silent failure into one the system halts on and can resume from.

The most common silent failure is a usage-limit storm. When too many helpers ask the AI service for work at once, they get throttled and come back empty. The naive thing — “carry on with whatever survived” — is exactly the trap: a panel of twelve judges that quietly became a panel of two still returns a verdict, and you would never know it was decided on scraps. So the firm rule is: if helpers came back empty, the run pauses instead of advancing, and a paused run is resumable — when it picks back up, the work that already finished is reused, and only the missing pieces are redone. There is an extra guard that turned out to matter: the moment a whole batch comes back empty, the run stops before launching the next batch, rather than feeding fresh helpers into a storm that is already raging.

That storm is also why helpers are launched in small batches — never more than about ten at once — rather than all at once. I learned this the embarrassing way: an early run fired off around fifty helpers in one burst, hit the usage limit, and came back completely empty. The whole run was wasted. The fix is a firm cap and one polite retry for stragglers, and never running two big jobs at the same time, because they share the same budget.

A pause only helps if the run is still running. The harder failure is a hang — a helper whose answer never comes, so the run just sits there forever with no error at all. The pause logic is blind to this, because it only triggers when a helper returns something. So there is a completely separate watchdog, running outside the main job, that does one simple thing: it watches the run’s progress files and, if nothing has moved for about twenty minutes, it sends a single alert. The twenty minutes is generous on purpose — a genuinely slow step should never be killed by mistake — but a true hang is forever, so waiting twenty minutes to be sure costs nothing.

A few more, each a silent failure turned loud:

A long session corrupting its own output. There is a known defect in the tool I build on where a very long, very full session starts garbling its own internal commands and cannot recover in place. I cannot fix the defect, so I work around it: keep the working memory smaller, start a fresh session at each stage boundary so no single conversation runs long enough to be likely to trigger it, push the heavy work into the separate background jobs (which have their own recovery), and save progress to disk every stage so a fresh session can pick up exactly where a dead one left off.
Copy-pasted safety code drifting apart. The pause-and-batch logic above has to be identical across several of the assembly-line scripts, and copies that must stay identical always drift over time. So there is one source copy, a command that stamps it into the others, and an automatic check that refuses to let the copies disagree.
“More ideas” quietly becoming the same idea. The diversity rule from Part 5 is enforced here too: when fresh ideas stop being meaningfully different, the run stops asking and switches its approach, rather than burning time on reworded duplicates.

I want to end this section with the one thing I will not overclaim, because it is the kind of thing that is easy to dress up. The throttling messages you would see in the logs come from a layer below my code that I cannot silence; all my engineering can do is stop feeding the storm sooner and recover cleanly, not make storms disappear. The honest claim is modest and, I think, the right one: none of this makes a long run fail-proof — it makes a long run fail loudly and resume cheaply, which for unattended work is most of the battle.

There is also a clean lesson buried in here that took a painful seven-hour run to learn. The thing that makes a run fragile is its total length, and length is rounds × depth. Going deeper in a single round — more independent minds, more judges — is sequential work that lengthens the round but does not spike the demand that triggers a storm. Doing more rounds multiplies that whole length. So the choice was easy once it was framed that way: keep the depth (depth is the entire point of the project), and buy stability by doing one deep round and stopping, rather than by thinking less.

Part 9: Where This Stands, and What Comes Next

Let me be honest about where this is — starting with the part I did not expect to be able to write yet.

Unattended — with no human choosing which bug to chase — the system now runs the entire select-to-prove-to-report arc this post has described: it picks a target, reverse-engineers an undocumented mechanism down to ground truth, raises its own hypothesis, proves it on a real machine starting from an ordinary user’s account, and writes it up. Run over days, it has done this for real — turning up machine-proven primitives, and in some cases weaknesses that cross a genuine privilege boundary rather than just crashing a service. (Where a result is proven but its most severe escalation is still open, I say so plainly rather than rounding it up to a finished exploit.) A human does not do the hard part; the pipeline does, across the full select-to-prove-to-report arc — moving further than I expected when I began. And now the honest part, which is the whole reason I am still building.

The very top of the ladder is the single hardest open problem in this entire field: not finding a real bug, but inventing a kind of bug that has no name — a reusable shape nobody has described — and doing it on demand rather than once in a while. It is the part almost no one is even trying to automate yet. The machinery aimed at it — reading the undocumented mechanism, raising competing hypotheses, mapping fresh surface, composing footholds into a chain no single part is severe enough to make alone — already runs on its own, and already surfaces the raw material of genuinely new ideas. What it does not yet do reliably is take the final step, from a sharper version of the familiar to the truly new. I can name that limit precisely, and the precision is the point: this is the frontier of automated security research, and pointing a working, self-doubting machine straight at it is exactly the bet worth making.

But look at what is built, because it is more than it sounds. Two things, really. First, the habits that check an idea — noticing the gap, reading before testing, doubting the answer, keeping an honest record — are mostly working. That is the part most people skip: not raw cleverness, but being trustworthy when there is no automatic check — no oracle, the exact gap this post opened with. An AI that can tell its own proofs from its own guesses, refuses to mix them up, and accepts what the test machine shows even when its own reasoning disagrees. That is a real, unsolved problem in AI agents today, and it is the foundation everything else needs — you cannot trust an AI to invent until you can first trust it to check itself.

Second — and this is the part I underrated when I started — the machine that keeps a long, unattended run honest is itself a real piece of work. Most of the engineering in Parts 6 through 8 is not about being clever; it is about a swarm of fallible helpers running for hours without a human watching, and making sure that when something goes wrong it goes wrong loudly and recoverably instead of quietly producing a confident, wrong answer. That problem — trustworthy autonomy without a safety net — is the same problem the self-doubt rules solve, just at the level of the whole system instead of a single finding.

So the honest picture is not a finished result, but it is well past a prototype: the careful, self-doubting checker works, the machine that runs it unattended works and turns up real, proven findings on its own, and the inventor — the part that reverse-engineers the unknown, raises conjectures, maps fresh surface, and tries to chain footholds into a new technique — runs autonomously, but does not yet reliably cross that last step into the genuinely new. And that is the part I find exciting, because what comes next is the best part. The next step is to push that invention engine — listing the gaps, borrowing rules from other fields, composing primitives — from the level where it mostly surfaces the raw material for a new idea up to the level where it reliably has the new idea itself, and to measure that honestly across many cases, not from one good story. There will be a lot of trial and error, and I plan to keep writing about it — the failures as well as the wins, because the failures are where this method proves its worth.

What I set out to do was rebuild the way an expert thinks, written down as clear rules a machine can follow: how they notice the gap, how they read before testing, how they keep an honest record, how they come up with ideas — and, above all, how they refuse to fool themselves. It is not finished. But the last step left — making a machine reliably have an idea no one has had, instead of just handing me the pieces of one — is the hardest thing in this field, and it is precisely the thing I built this project to do. That is the work most worth doing, and I am going to do it in the open.

A few sources, if you want to read more

James Forshaw, How to secure a Windows RPC Server, and how not to — a clear look at how these “who is calling?” checks are supposed to work, by one of the best in the field; much of the “assumed vs. actually checked” idea comes from his write-ups.
Yifei Ming and others, Helpful Agent Meets Deceptive Judge (2025) — the study behind Part 3: a confident but wrong criticism — especially one that backs itself with cited sources — can sharply lower an AI’s accuracy in a single round, and models built for step-by-step reasoning resist it better.
Richard Feynman, Cargo Cult Science — his 1974 Caltech commencement address, where “you must not fool yourself” comes from.
Chenglei Si, Diyi Yang & Tatsunori Hashimoto, Can LLMs Generate Novel Research Ideas? (2024) — the source for the idea-collapse behind Part 5 and Figure 9: out of thousands of ideas a model generates, only about 5% are genuinely distinct, and that count plateaus no matter how many more you ask for. (On why turning up the “temperature” does not rescue this — it raises novelty only weakly, while more clearly hurting coherence — see Peeperkorn et al., Is Temperature the Creativity Parameter of Large Language Models?, 2024.)

Security Research, AI

This post is licensed under CC BY 4.0 by the author.