Everyone worried about AI making things up asks the same question: how do you know it isn't? My
4 June 2026Everyone worried about AI making things up asks the same question: how do you know it isn’t? My answer has nothing to do with finding a smarter model. It’s a process. I run a panel of seven frontier models against each other, in turns, with an orchestrator whose only job is to disagree with them: Claude Opus 4.8, running on ultracode. Every one of the seven panel models is a free web app, not a paid API.
The panel, ranked by reasoning power on the Artificial Analysis Intelligence Index, strongest first. Every one is a free web app I open in the browser I am already signed into, with no API key and no bill:
- GLM-5.2, from Z.ai. Index 51.
- Gemini 3.5 Flash, from Google. Index 50.
- Qwen3.7 Max, from Alibaba. Index 46.
- DeepSeek V4 Pro. Index 44.
- Kimi K2.6, from Moonshot. Index 43.
- Grok 4.3, from xAI. Index 38.
- Mistral Medium 3.5. Index 30.
Seven companies, seven training sets, seven different blind spots. That diversity is doing most of the work.
Recent benchmarks back this up. OpenRouter published its Fusion results on the DRACO research benchmark, and the numbers point the same way. Fable 5 on its own scored 65.3%. Fused with GPT-5.5, the pair reached 69.0%, ahead of every single model. A cheaper panel of Gemini 3 Flash, Kimi K2.6 and DeepSeek V4 Pro landed at 64.7% for roughly half the cost, while GPT-5.5 alone managed 60.0%. The result that stayed with me: Opus 4.8 scored 58.8% on its own, then climbed to 65.5% when it was handed two of its own answers to compare and combine. Same model, no new knowledge, just made to check itself.
Now, how they are wired. Each lane is the real consumer product, driven through the browser I am already signed into, over the Chrome DevTools Protocol. Not an API, not a screenshot. I capture each model’s own network stream, the exact bytes its interface renders, so I read what the model actually said, token for token. There is no OCR step to misread anything, and no middle layer to drift.
To be specific about the plumbing: Microsoft Edge is built on Chromium, so it speaks the Chrome DevTools Protocol. I attach a DevTools session to each lane’s tab and switch on its network domain. The moment a model starts streaming its reply, I capture that response straight off the wire with the protocol’s streamResourceContent call, the exact bytes the browser is receiving, and decode each vendor’s own format on my side. They are all different: one sends an event stream, another sends JSON one line at a time, a third sends patch frames, a fourth a binary RPC stream. The point is that I read the network, not the screen. No screenshot, no OCR, no scraping the rendered page. A vendor can redesign its whole interface tomorrow and the capture still holds, because I was never reading the pixels.
The orchestrator, Opus 4.8, runs the panel as a debate, never a single pass. On the same Artificial Analysis index it scores 56, higher than every model it is refereeing. And I stay out of it. I do not arbitrate, pick a winner, or babysit the rounds. The orchestrator runs the whole debate and hands me one settled answer. Each round goes like this. The lane answers. The orchestrator says what it disagrees with, and why. Then it goes back to the same lane with that objection, and the lane refines. Every round closes with the same forced question: what am I missing, and what better solutions exist? It runs up to five rounds, or until a round stops adding anything.
So why can a fabrication not slip through to the final answer? Because it has to clear four separate filters, and each one is suspicious for a different reason.
First, an orchestrator that argues. It does not pass answers along, it interrogates them. Its starting assumption is that a claim is wrong until it survives, so a confident mistake gets pushed on rather than forwarded.
Second, a citation the model cannot fake. Every factual claim has to arrive with the source title, the URL the model actually opened that turn, and the figure quoted word for word. If it ran no search, it has to say so plainly: reasoning only, no sources. Attaching a link that merely looks real, one it never opened, counts as a violation, not an answer.
Third, an independent check. The orchestrator re-verifies the claims the answer rests on. It opens the cited page itself, queries the repository, checks the version number, instead of taking the panel’s word. Anything that fails gets cut, or flagged as unverified.
Fourth, the other six models. A hallucination usually lives in one model’s blind spot. The rest never saw that particular ghost, so the moment they disagree, it surfaces. The only thing that earns my confidence is agreement across vendors that have nothing to do with each other.
Add it up. To reach the final answer, a fabrication has to beat an orchestrator built to doubt it, produce a citation that resolves to a page that genuinely exists, survive an independent recheck, and then somehow show up in several unrelated models at once. That is not one lock on the door. It is four, in a row.
There is one more piece, and I refuse to skip it. I will not claim it is impossible. Models can share a blind spot. A check can be done lazily. A real source can simply be wrong. So the system does not sell you certainty. It sells you a confidence label. Every claim comes out tagged as verified against a source, agreed across models, or unverified, and you always know which one you are looking at. The goal was never to be right every time. It was to never be wrong quietly. Any tool that promised a flat zero percent on hallucinations would be making the exact kind of unbacked claim this whole setup exists to catch.
Seven models. One orchestrator that argues. And one rule underneath all of it: nothing reaches you as a fact until it has survived being doubted.
Comments
Comments are moderated before they appear; your name and message become public.
Send me a message about this post
Private message · lands straight in my inbox.