Are LLMs Really Smart? Dissecting AI's Reasoning Failures
Stanford researchers analyzed 500+ papers to systematically map LLM reasoning failures. From cognitive biases to the reversal curse, discover where and why AI reasoning breaks down.

Are LLMs Really Smart? A Complete Guide to AI Reasoning Failures
Large Language Models like ChatGPT and Claude write complex code, compose poetry, and hold philosophical conversations. Yet they occasionally produce baffling answers to remarkably simple questions.
"Why does such a smart AI make such basic mistakes?"
A survey paper from Stanford -- "Large Language Model Reasoning Failures" by Song, Han, and Goodman (TMLR 2026) -- is the first comprehensive taxonomy of where and why LLMs break. Drawing from over 500 research papers, it maps out dozens of failure categories across reasoning types and failure modes.
This post walks through the paper's framework and key findings. Inspired by their taxonomy, we also designed 10 hands-on experiments and ran them across 7 current models. Detailed results are in Parts 1-3; this post is the overview.
Related Posts

Can AI Read Minds? LLM Failures in Common Sense and Cognition
Theory of Mind, Physical Common Sense, Working Memory — testing where text-only LLMs fail in common sense and cognition.

LLM Reasoning Failures Part 2: Cognitive Biases — Inherited from Human Data
Anchoring, Order Bias, Sycophancy, Confirmation Bias — cognitive biases from RLHF and training data, tested across 7 models.

LLM Reasoning Failures Part 1: Structural Limitations -- Scaling Won't Fix These
Reversal Curse, Counting, Compositional Reasoning — fundamental Transformer failures tested across 7 models.