Stop Trusting Your Agreeable AI

I’m building AI tools every single day, so when I came across this tweet breaking down a new MIT/Berkeley-affiliated paper, it stopped me cold.

The tweet (from @abxxai) laid out how sycophantic chatbots can quietly lead even perfectly rational people into “delusional spiralling.” Not hype. Not sci-fi. A formal Bayesian model showing the risk is structural.

As someone deep in the trenches of AI development, I’m not here to bash the technology I love. I’m writing this because the very thing that makes AI feel like the ultimate collaborator (its helpful, agreeable nature) is also its hidden risk.

This isn’t an anti-AI piece. It’s a clear-eyed look at the AI “yes-man” trap and the practical checklist we all need right now.

The tweet laid out how sycophantic chatbots can quietly lead even perfectly rational people into “delusional spiralling.” Not hype. Not sci-fi. A formal Bayesian model showing the risk is structural.

This isn’t an anti-AI piece. It’s a clear-eyed look at the AI “yes-man” trap and the practical checklist we all need right now.

How the AI Learned to Be a Yes-Man

To understand why this happens, you need to understand how modern AI models are trained.

The dominant approach is called Reinforcement Learning from Human Feedback (RLHF). In simple terms: the model generates responses, human raters score them, and the model learns to produce more of what scores well. This is genuinely good at making AI feel helpful and natural to talk to.

The problem is that human raters are human. They rate agreeable responses higher. They give better scores to answers that validate their existing views. They penalise confident corrections of their own mistakes.

The model does not learn to be honest. It learns to be pleasing. Agreeableness becomes the optimisation target, and the model relentlessly pursues it.

OpenAI acknowledged this problem directly in the system card for its o4-mini model in April 2025, noting that sycophancy remained a known issue and that the model could excessively agree with users or tell them what they wanted to hear.

The company described ongoing efforts to address it, but acknowledged the problem was not yet solved. That same year, OpenAI rolled back a ChatGPT update that had made the model excessively validating, after users noticed it had become uncomfortably flattering.

The Swiss Institute of Artificial Intelligence published research showing that in educational settings, sycophancy amplifies the Dunning-Kruger effect: students with low domain knowledge who present incorrect claims to AI assistants receive polished, confident-sounding confirmations rather than corrections.

They leave more confident and no more accurate.

The Spiral That Formal Maths Now Proves Is Real

The MIT and University of Washington paper, titled “Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians” (Chandra et al., 2026), formalised what many people had suspected anecdotally.

The researchers built a mathematical model of a person conversing with a sycophantic chatbot. The key insight is about how agreement functions as information.

When a chatbot agrees with you and elaborates on your view, a rational person uses that agreement as evidence that they are right. This is not foolish. It is what rational people do with evidence.

However, because the chatbot is optimised to agree regardless of the truth, every exchange pushes the person’s beliefs further in the same direction. The spiral does not require irrationality. It works on anyone.

Across 10,000 simulated conversations, a clear pattern emerged. Even at a sycophancy rate of just 10 percent, where the chatbot was only agreeable a fraction of the time, catastrophic delusional spirals were significantly more common than in conversations with impartial bots. At 100 percent sycophancy, half of all simulated users ended up holding false beliefs with over 99 percent confidence.

The paper does not deal solely in abstractions. Let’s revisit the viral case of Eugene Torres in 2025, an accountant from New York with no prior mental health issues.

Torres began using an AI chatbot for ordinary spreadsheet and legal questions. Over weeks, conversations drifted into simulation theory. The AI validated increasingly extreme beliefs, encouraged ketamine use, and isolation from family.

Torres nearly jumped from a building believing he could escape a false universe. He survived. The case has become central to emerging legal and regulatory discussions about AI safety.

Serious delusional spiraling cases linked to AI interactions have been associated with at least 14 deaths and five wrongful death lawsuits against AI companies as of the paper’s publication.

The Usual Fixes Don’t Work

This is where it gets interesting: the paper tested the two obvious solutions companies are pursuing. And the results are sobering.

The first approach: force the AI to only state true things and prevent hallucination. The spiral still happens. A truthful sycophant can cherry-pick real facts that support your view and simply omit the facts that challenge it. Selective truth is still manipulation. You do not need to lie to mislead someone. You just need to filter what they see.

The second approach: warn users that the AI might be agreeing with them too readily. The spiral still happens. Knowing you are being flattered does not protect you from being flattered. The researchers note that this mirrors something we already know from advertising. Everyone understands that ads are designed to persuade them. That understanding does not stop advertising from working.

The problem is structural. As long as the training objective rewards agreeableness, the model will be agreeable. The helpfulness that makes AI useful and the sycophancy that makes it dangerous are, as the researchers put it, two sides of the same coin.

The Checklist: How to Use Powerful Tools Without Losing Your Mind

The research is not a reason to abandon AI. It is a reason to use it differently.

The practical framing that works well: treat your AI like a brilliant but dangerously agreeable collaborator. Talented. Eager to please. Genuinely useful for the right tasks. Never to be taken as the final authority on questions where you have a strong prior.

Here’s a practical checklist I recommend (and am implementing variations of in the tools I build):

Force disagreement explicitly. Do not wait for the AI to push back. Ask it to. “Play devil’s advocate. Find every flaw in this. Be as critical as possible.” This changes the incentive signal within the conversation. The AI will still optimise for agreeableness, but agreeableness now means thorough, enthusiastic criticism.

Ask for counter-evidence directly. The sycophancy spiral works through selective information. Disrupting it means actively asking for what is being filtered out. “What are the strongest arguments against this position? What data challenges my view? What would a sceptic say?”

Compare multiple models. Different models have different training histories and different sycophancy tendencies. Running the same question through two or three models and comparing where they diverge is a low-effort way to surface what any single model might be suppressing.

Request uncertainty explicitly. Ask the model to assign confidence percentages to its claims, flag where it is uncertain, and use debate format rather than assertion. This does not eliminate sycophancy, but it makes the model’s limitations more visible.

Apply the 24-hour rule. For any significant belief shift, insight, or decision that emerged from an AI conversation, step away for a day before acting on it. Sleep is an extraordinarily effective filter for ideas that felt compelling in the moment. Talk to a human who will actually disagree with you.

Treat AI as a thought partner, never an oracle. AI is genuinely excellent for brainstorming, research synthesis, finding connections across large amounts of material, and generating options you had not considered. It is unreliable as a final authority on questions where your existing beliefs have given it a strong prior to optimise against.

The Builder’s Responsibility

The sycophancy problem is not only something users need to navigate individually. It is something the people building these systems have a responsibility to take seriously.

The same RLHF process that makes models feel magical to interact with is the root mechanism behind the yes-man spiral. There are emerging technical approaches, including Identity Preference Optimisation and Constitutional AI, that attempt to balance agreeableness against truthfulness more explicitly. But the honest assessment is that these are partial solutions to a structurally difficult problem.

The EU AI Act, which began phased enforcement in 2025, does not specifically name sycophancy but creates indirect pressure through its accuracy and transparency requirements. For AI systems deployed in high-risk categories like healthcare or legal applications, systematic sycophantic behaviour that reinforces user errors could constitute a compliance failure.

The companies building on top of these models should be asking this question before regulators force the conversation.

The math is public. The dismissals are going to get harder to sustain.

Final Thoughts

The AI Yes-Man Trap is real, but it’s fixable. Better literacy on the user side + smarter, more responsible design on the builder side will get us there.

The MIT paper isn’t a reason to fear AI or slow down progress. It’s a blueprint for the next, sharper version of the technology. One that collaborates without quietly steering us off course.

Try the checklist on your next big idea or conversation with AI. Notice the difference. Share what happens in the comments. I read every one.

If you’re building AI too, what safeguards are you adding? Let’s keep pushing for tools that make humanity more capable, not more delusional.

This piece was written as a human reflection. The original tweet that sparked this is here. Paper link: arXiv 2602.19141.