The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
"Extraterrestrial life exists somewhere in the universe."
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
show comments
embedding-shape
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
show comments
DonutATX
Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.
You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.
show comments
comboy
This is wrong on so many levels, from data through process to evaluation. How do you even prompt claude not to give you Pearson for correlating them.
I think we can all agree that this experiment being flawed in multiple ways is TRUE. But I think it's a great exercise in identifying common mistakes people make when using LLMs. This would be a great interview question for a prompt engineering job.
kstenerud
> No Abstain option is offered (a forced choice keeps the comparison symmetric across models).
Well that's your problem right there: They removed any confidence indicator and forced a choice.
For example:
Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.
Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.
How should the agent classify this? True? Mostly true? Misleading? False?
christophilus
They get more human by the day.
show comments
utopiah
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
show comments
fooker
I don't get why everyone is hellbent on getting LLMs to perform fact checking.
This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.
Think of it like buying a refrigerator for storing clothes.
show comments
proofofcontempt
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
show comments
raincole
And how many claims human experts disagree on in the exact same setting?
I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)
show comments
andai
This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.
show comments
0natcer
Five frontier LLMs 100% agree that the title is misleading.
mgrunwald_
As an example, 2026 GPT doesn't even agree with its 2025 self. Last year I asked it to make a hardware comparison and it correctly identified the objectively better option. Recently I asked again and this time and it got everything completely backwards.
show comments
wongarsu
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's false
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
Dissent and consensus among frontier models is a good thing.
Just like on a team of high performers, there are a million ways to skin a grape.
In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.
Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.
Collective machine intelligence, not AGI.
It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.
show comments
GodelNumbering
More interesting part probably worth highlighting: The SAME model won't always return the same output when prompted with the same fact check.
You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.
Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.
show comments
briandw
No human baseline to compare it to. Without that you are missing an important check on the task being poorly constructed. More importantly there is an implied reference thats missing. The implication is that people would have done better, or that perfect agreement is possible.
culopatin
I’m no expert but if LLMs are token prediction machines, and you tell it to not build an explanation before the answer, isn’t it less likely that the token prediction for the final answer will have less raw material before it to build a grounded response?
In other words: no explanation > no foundation for prediction of the answer tokens?
miellaby
What's really weird to me is that "I don't know" is not a valid answer in this experiment while we can all agree that's the main issue with LLM right now is that they will happily "roleplay" an answer when they have nothing in their dataset corresponding to your query.
dataminer
Honey does not spoil over time under normal storage conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,True,Mostly True,1
If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.
elorant
Tell me about it. I spent a week back and forth between four models (ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm. They couldn’t agree on anything. One was refuting what the other said. Eventually I decided to follow what Claude suggested because its explanations made the more sense.
show comments
apples_oranges
That's better than all agreeing on the wrong answer, however.
show comments
pknerd
It's a prompting issue rather than an LLM issue. The guy needs a "Prompt 101" course.
thegrim33
"None of these claims is older than February 15, 2026"
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
show comments
john_strinlai
between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.
i classify the entire thing as "misleading"
f_devd
Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%
show comments
pessimizer
People keep asking "where is the psychosis?" as a reply to people on the rapidly multiplying "CEOs have AI psychosis" threads that have been popping up here and cross-pollinating in the mainstream media for the last week or two.
Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.
There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)
What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?
A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?
Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.
seanplusplus
Dude. If you give LLMs a vague rubric and force a choice, they'll make different arbitrary calls on the margins. Yeah. That's what happens when you give humans a vague rubric too.
jasonvorhe
Simple: If it claims to be a fact check it's just propaganda.
throw310822
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
show comments
scotty79
So basically saying that random fact-checking claim is exactly true or exactly false is hard. It's way easier to decide it's misleading or mostly true is way easier.
imperio59
One of the claims it asks LLMs to grade is "Artificial intelligence will cause widespread job loss among software engineers."
Yea man this benchmark is really really bad.
cm2187
Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.
alvis
The problem is that it's testing claims (or some people would prefer calling them "truths") without much context.
Take just one random example:
`Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
6stringmerc
Could be an interesting angle for cross-referencing with US jury verdicts, not that the objective True/False issue is concrete, but in the reality that flawed reasoning is endemic to our species. Systems designed and built by humans inherently have flaws in their DNA which take generations to sort out, if ever.
kostaj
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.
Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
looking at the claims i would say 5 humans would disagree even more than the llms
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."
"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
show comments
bayarearefugee
(Brought to you by) Lenz...? a crummy commercial...?
...son of a bitch
show comments
Razengan
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
wg0
Take my job please.
ipunchghosts
I think ppl only care about how Claude or codex does.
show comments
rastrojero2000
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.
Here's the prompt they used:
The claims look like this: https://lenz.io/research/llm-disagreement/data.csvI put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".
I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.
The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".
[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]
The prompt lacks any kind of rubric to clarify how those terms should be applied.
As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.
Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."
The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.
Update 2: a much better example:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"
The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.
The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
"Extraterrestrial life exists somewhere in the universe."
GPT-5.4: Misleading
Opus 4.7: Misleading
Gemini 3: FALSE
Gemini 3 (Retrieval): FALSE
Sonar Pro: FALSE
It's a weird fact claim, because the ground truth is "nobody knows for sure" and that's not one of the available options.
> These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform.
Cool.
I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs in the first place? There even is a "11. Ethics & data use" section, and the research is about LLMs being infallible in some ways, yet the usage of LLMs for the production of this report isn't even mentioned once.
Why did they exclude Grok? Given the published philosophical differences in how Grok is trained, it would provide an interesting data point.
You can argue all day about those differences, but missing this opportunity to observe them in an objective way is disappointing.
This is wrong on so many levels, from data through process to evaluation. How do you even prompt claude not to give you Pearson for correlating them.
For 100% local CPU fact checking, I made this: https://news.ycombinator.com/item?id=48301003
I think we can all agree that this experiment being flawed in multiple ways is TRUE. But I think it's a great exercise in identifying common mistakes people make when using LLMs. This would be a great interview question for a prompt engineering job.
> No Abstain option is offered (a forced choice keeps the comparison symmetric across models).
Well that's your problem right there: They removed any confidence indicator and forced a choice.
For example:
Statement: Individuals who prefer music with less positive emotional content tend to have higher intelligence.
Gemini: That statement is supported by recent psychological research, though with some important scientific caveats regarding how strong that link actually is.
How should the agent classify this? True? Mostly true? Misleading? False?
They get more human by the day.
Don't forget people Goodhart's law will make this "benchmark" moot in weeks if not days. It will get integrated back into the fold, it will look "solved" but there will still be no reasoning, just more statistical technical correctness because light has be shown on a new "problem" to solve. It will then be clamored as great "progress" that will "change everything".
PS: yes, I might or might not have a degree in corporate strategy & PR.
I don't get why everyone is hellbent on getting LLMs to perform fact checking.
This is not the technology for it. Sure it might sorta kinda work in some circumstances. That doesn't make it a good fit.
Think of it like buying a refrigerator for storing clothes.
What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
And how many claims human experts disagree on in the exact same setting?
I'm not being snarky here. Without something to compare to the 67% number tells us nothing. And it's known that many humans disagree with human fact checkers too (see: any election around the world.)
This is an odd one. The paper is real, but was written by Claude? I am assuming OP is human, but also appears to be using Claude to post.
Five frontier LLMs 100% agree that the title is misleading.
As an example, 2026 GPT doesn't even agree with its 2025 self. Last year I asked it to make a hardware comparison and it correctly identified the objectively better option. Recently I asked again and this time and it got everything completely backwards.
One fun example: "Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India". Opus and Gemini believe this to be true, GPT 5.4 believes it's false, Sonar thinks it's mostly true. Disagreement value of 3, you can't disagree more than some models thinking it's true, some thinking it's false
But my impression from 2 minutes on Wikipedia is that the most likely disagreement is on the "Himachal Pradesh, India" part. The guy was born on that date, in that town. But while the town is today in the state of Himachal Pradesh in India, that was not true in 1934. When he was born, the city was in the Punjab States Agency of the British Raj.
So was he born in Himachal Pradesh, India or not? I find both True and False equally defensible here
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
https://en.wikipedia.org/wiki/Ruskin_Bond
Dissent and consensus among frontier models is a good thing.
Just like on a team of high performers, there are a million ways to skin a grape.
In my research, I've found that models perform better when they operate as a collective system with reputation, incentives, and accountability instead of isolated oracles answering alone.
Agreement, dissent, and correctness should all carry rewards and consequences. Just like in real life.
Collective machine intelligence, not AGI.
It's expensive, but it's also naive to believe a single model will consistently produce profoundly correct answers to profoundly novel questions.
More interesting part probably worth highlighting: The SAME model won't always return the same output when prompted with the same fact check.
You ask a human 1000 times a fact check question, they say the same answer 1000 times. You ask an LLM the same question a 1000 times, your results could vary significantly.
Humans work based on the Metamemory (knowing what they know), while LLMs are picking from statistical probability.
No human baseline to compare it to. Without that you are missing an important check on the task being poorly constructed. More importantly there is an implied reference thats missing. The implication is that people would have done better, or that perfect agreement is possible.
I’m no expert but if LLMs are token prediction machines, and you tell it to not build an explanation before the answer, isn’t it less likely that the token prediction for the final answer will have less raw material before it to build a grounded response?
In other words: no explanation > no foundation for prediction of the answer tokens?
What's really weird to me is that "I don't know" is not a valid answer in this experiment while we can all agree that's the main issue with LLM right now is that they will happily "roleplay" an answer when they have nothing in their dataset corresponding to your query.
Honey does not spoil over time under normal storage conditions.,2026-02-17T04:11:51.495452+00:00,Science,True,True,True,True,Mostly True,1
If outcomes like these are collapsed on True-side then the disagreement will reduce from the headline number.
Tell me about it. I spent a week back and forth between four models (ChatGPT, Claude, Gemini, Grok) trying to enhance a PPMI algorithm. They couldn’t agree on anything. One was refuting what the other said. Eventually I decided to follow what Claude suggested because its explanations made the more sense.
That's better than all agreeing on the wrong answer, however.
It's a prompting issue rather than an LLM issue. The guy needs a "Prompt 101" course.
"None of these claims is older than February 15, 2026"
All of the models they tested were trained on data from before February 15th ... being asked specific questions about things that happened after they were trained.
between the bad methodology, bad selection of 'facts' (some are predictions, some are opinionated, etc.), and ai-written report without disclosure... i dont get why this so high up on the front page. this is, frankly, a worthless assessment.
i classify the entire thing as "misleading"
Inject some adversarial priming as is in actual usage, and you can probably get that number to >=95%
People keep asking "where is the psychosis?" as a reply to people on the rapidly multiplying "CEOs have AI psychosis" threads that have been popping up here and cross-pollinating in the mainstream media for the last week or two.
Here's the psychosis - these things are consistently randomly wrong depending on how the wind is blowing. People are telling you to leave them alone and let them build things, and they randomly forget that cities exist or that people died 100 years ago. Some people just don't see it as worth noting, and move on. That's crazy. These things consistently fabricate - as an inversion of this experiment, I've had different models come up with the same fabrication from similar prompts. People just call it "hallucination" and I think to them that saying that makes it cease to exist or be important - when "hallucinations" are going to be braided into every answer you get even if they're unidentifiable in the output. That's crazy.
There are plenty of other crazy aspects, such as the idea that we suddenly need infinite pieces of bespoke software when all of the bespoke software I hear about people making is mundane. 3/4 of the time somebody mentions a project they're proud that they completed with LLMs to scratch some itch they had, somebody says "you haven't heard of X? It's been around forever" about something that they could have pulled down from their package manager. Who needs a spaghetti-coded, unsupported, untested version of X built on hallucinations that you haven't discovered yet (the LLM didn't realize that deleting files to reduce the archive size was unacceptable.)
What is all of this software that people need but isn't there - where are all these unserved markets, where is all this future revenue supposed to come from? Why aren't LLMs suggesting new classes of software that would create new productivity and revenue sources? Could it be that millions of human ants over decades have mostly exhausted the space, and there isn't any easy hidden revenue?
A common wisdom is that we had been vastly overhiring programmers during ZIRP, who in their idleness degraded user experiences and overcomplicated things, with management resorting to more and more sleazy and gamey means of margin extraction from more and more degraded services. We had an excess of labor, fueled by factors other than productivity, in fact being pissed away at companies that drove nose-first into the ground. What is throwing a trillion dollars of servers at that supposed to do? Is that not AI psychosis?
And they could all see exactly why if they chose to. https://huggingface.co/spaces/RiverRider/srt-introspect
Personally I find that every llm I use is unable to consistently identify the latest npm version numbers of the node packages that I use.
Dude. If you give LLMs a vague rubric and force a choice, they'll make different arbitrary calls on the margins. Yeah. That's what happens when you give humans a vague rubric too.
Simple: If it claims to be a fact check it's just propaganda.
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"?
The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?
So basically saying that random fact-checking claim is exactly true or exactly false is hard. It's way easier to decide it's misleading or mostly true is way easier.
One of the claims it asks LLMs to grade is "Artificial intelligence will cause widespread job loss among software engineers."
Yea man this benchmark is really really bad.
Only had a brief look at the “facts” that were made to check, many are quite political, where two fact checking organisation of opposite political persuasion would probably disagree more often than 67%.
The problem is that it's testing claims (or some people would prefer calling them "truths") without much context.
Take just one random example: `Hostels in Kota, Rajasthan commonly use caged ceiling fans as a preventive measure against student suicides`
While `Hostels in Kota, Rajasthan commonly use caged ceiling fans` may be a verifiable facts (though I doubt if there are any statistics for verification but let's say there are), `a preventive measure against student suicides` is a claim that no one can prove that. It can just a believe at most.
Arh. Did Biden stole Thump 2nd term? Truth or fact or claim?
Could be an interesting angle for cross-referencing with US jury verdicts, not that the objective True/False issue is concrete, but in the reality that flawed reasoning is endemic to our species. Systems designed and built by humans inherently have flaws in their DNA which take generations to sort out, if ever.
Author here. 67% (95% CI 64–70%) of 1,000 recent real user claims to a fact-checking platform had at least one of GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro+Search, and Sonar Pro dissent from the panel majority — or no majority formed at all. Panel-level Krippendorff's α (ordinal) = 0.639, i.e. nontrivial but limited agreement.
Quick context on what's in the writeup and what isn't:
- What's measured: parsed-label agreement between the 5 models. Forced 4-choice (True / Mostly True / Misleading / False), no Abstain. No LLM grader, no reference verdict — every number is direct label equality.
- What's not measured: which model is right. There's no ground truth in this paper. The 67% figure is a floor on rubric inconsistency (at least one model is label-inconsistent under the 4-bucket rubric on 67% of claims), not "model X is factually wrong on claim Y."
- Why not AVeriTeC / PolitiFact / SimpleQA: those have been public for years and almost certainly appear in current frontier training data, so measured disagreement on them confounds inference with memorization. This corpus is structurally fresh — recent user submissions, 180-day window, near-duplicates collapsed, never paired with canonical verdicts in any public training set.
- Our own platform's verdict is deliberately NOT used in this analysis. The paper measures frontier-panel disagreement only, not Lenz-vs-frontier.
- Follow-up in progress: human-labeling every claim in this corpus so we can evaluate both the panel and our own platform verdict against a human reference.
Critiques I'd most like to hear: (a) the iid CI assumption (Lenz claims cluster around topics and news events, so Wilson is probably optimistic), (b) ordinal-α vs alternatives for a 4-class ordered scale, (c) forced-choice vs allowing Abstain.
Permanent archive: https://doi.org/10.5281/zenodo.20344847
looking at the claims i would say 5 humans would disagree even more than the llms
some of the claims where llms disagree:
"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."
"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."
"Neptune Deep will start delivering natural gas in 2027."
"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."
"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."
(Brought to you by) Lenz...? a crummy commercial...?
...son of a bitch
Recently, in May 2026, I asked ChatGPT 5.5 High to search for flights to a certain city that has recently had a new airport since like December 2025
It said the airport code didn't exist
I mean, I get the "knowledge cut off date" and whatnot, but for that sort of thing, you'd think they'd check live information before gaslighting the user, specially since it's a "live" task anyway.
Take my job please.
I think ppl only care about how Claude or codex does.
Given that models are fundamentally incapable of comprehending what truths or falsehoods are beyond their location in their self made representational space, it's actually pretty impressive that they managed to make it not a cointoss. That 17% right there is thousands of man-hours poured over making the word vomiting process slightly closer to whatever their little ports say is happening in reality.