Grokking: What We Know, What We Don't, and Why It Matters

In January 2022, a team at OpenAI published a paper with an unusual origin story. As team leader Alethea Power explained at a conference later that year, a colleague went on vacation and forgot to stop a training run. When they returned, something unexpected had happened.

The neural network, a small transformer learning modular arithmetic, had done something it wasn't supposed to do. After memorizing its training data and showing zero ability to generalize to new problems, it suddenly achieved perfect accuracy on unseen inputs. The model had "grokked."

The term comes from Robert Heinlein's Stranger in a Strange Land, meaning to understand something so thoroughly you become part of it. The researchers chose it because something qualitatively different seemed to be happening inside these networks.

Three years later, grokking has spawned dozens of research papers, sparked debates about neural network generalization, and raised important questions for AI safety. But separating what we know from what gets amplified on social media requires looking at the research directly.

What Grokking Actually Is

Grokking describes a specific training phenomenon: a neural network first memorizes its training data (achieving near-zero training loss), appears to plateau with poor generalization, then suddenly — sometimes thousands of epochs later — develops the ability to handle unseen inputs with near-perfect accuracy.

This contradicts conventional machine learning wisdom. Standard practice says to stop training when validation performance plateaus or worsens, because continued training leads to overfitting. Grokking shows this isn't always true.

The original experiments used small transformers on "algorithmic datasets" — essentially mathematical operations like modular addition and multiplication. Given pairs of numbers (a, b) and their sum in modulo 97 arithmetic, the network would initially memorize the specific pairs it had seen. Then, after extended training with regularization, something clicked.

What Researchers Found Inside

Neel Nanda, then at Anthropic (now at Google DeepMind), conducted one of the most revealing investigations. He designed a simplified transformer specifically to understand what was happening during grokking.

His discovery was remarkable: the network learned to perform modular addition using discrete Fourier transforms and trigonometric identities. It represented numbers as points on a circle and used rotations to compute sums.

"These weird brains work differently from our own," Nanda observed. "They have their own rules and structure. We need to learn to think how a neural network thinks."

A follow-up study from MIT found that not all grokking networks discover the same algorithm. About 40% use what researchers call "clock" algorithms (rotation on a circle) or "pizza" algorithms (angle bisection on a sliced circle). The remaining 60% use approaches that researchers haven't been able to interpret.

This matters: even in the simplest cases where we can examine grokking in detail, we can't fully explain what most networks are doing.

What Makes Grokking Happen

Researchers have identified several conditions necessary for grokking:

Regularization is essential. Weight decay or other regularization techniques create pressure toward simpler solutions. Without regularization, networks remain stuck in memorization. With too much, they fail to learn anything.

Data quantity matters precisely. Grokking occurs in a narrow band: enough data that a generalizing solution exists, but not so much that the network generalizes immediately without the delayed transition.

Hyperparameters must be "just right." Change learning rates, network size, or regularization strength beyond a narrow window and grokking disappears. This fragility is important context for extrapolating to larger systems.

The emerging explanation involves competition between two internal algorithms: a memorizing circuit that forms quickly and a generalizing circuit that develops more slowly. Regularization gradually shifts resources from the complex memorizing solution toward the simpler generalizing one. The phase transition occurs when the generalizing circuit finally dominates.

Recent work shows grokking isn't as sudden as it appears externally. "Progress measures" based on mechanistic analysis reveal gradual internal changes throughout training. The apparent discontinuity reflects a shift from one dominant algorithm to another, not spontaneous capability emergence.

The Scaling Question

The critical question for practical AI: does grokking happen in production-scale systems?

Until recently, all grokking research used tiny models on toy tasks. A June 2025 paper changed that, studying grokking during pretraining of OLMoE, a 7 billion parameter mixture-of-experts language model.

The findings were mixed. Grokking does occur during LLM pretraining, but it looks different from the clean phase transitions in small models: different data domains (math, code, common sense) enter grokking phases at different times; the memorization-to-generalization transition happens asynchronously across the model; and unlike toy experiments with thousands of epochs on the same data, LLM pretraining typically passes through data only once.

The researchers developed metrics to track this transition by examining how data "pathways" through the model evolve — from random, instance-specific routes to more structured, shareable patterns. This offers a potential window into generalization dynamics during training.

But the clean narrative of "train longer, achieve understanding" doesn't survive contact with real-world complexity.

What We Don't Know

Several important questions remain unresolved:

Interpretability at scale. We can reverse-engineer grokking in tiny models. Whether similar analysis is possible for billion-parameter networks remains unclear.

Task generality. Grokking has been demonstrated on algorithmic tasks and some limited domains. Whether the phenomenon explains — or even relates to — the generalization capabilities of large language models across diverse tasks is speculative.

The 60% problem. Even in simple cases, most grokking algorithms remain uninterpretable. What happens when we can't understand what the network discovered?

Predictability. If models can suddenly develop capabilities after extended training, how do we anticipate what future models will do? This is a live concern in AI safety research.

Why This Matters (And Why Hype Is Dangerous)

Grokking is genuinely interesting for several reasons:

It demonstrates that neural networks can discover elegant mathematical solutions that humans wouldn't design. The Fourier transform approach to modular addition is beautiful and unexpected.

It provides a testbed for mechanistic interpretability — the project of understanding what neural networks actually compute, not just what they output.

It raises important questions about phase transitions in AI capabilities and the difficulty of predicting model behavior.

But extrapolating from these findings to claims about large language models achieving "genuine understanding" crosses into speculation unsupported by evidence.

The AI industry has a pattern of taking narrow, interesting research and inflating it to serve commercial narratives. Grokking is a fascinating phenomenon that deserves serious study. It doesn't need — and isn't helped by — breathless claims about machines finally "getting it."

The Bottom Line

What we know: Neural networks can exhibit delayed generalization, transitioning from memorization to genuine pattern recognition after extended training. In small models on algorithmic tasks, researchers have reverse-engineered what the networks learned — often elegant mathematical solutions.

What we don't know: Whether grokking explains LLM capabilities, whether the phenomenon scales in meaningful ways, and what most grokking networks are actually computing.

What to watch for: Research connecting grokking dynamics to interpretability and safety; studies examining whether grokking-like transitions occur during capability emergence in large models; work on predicting when and how phase transitions happen.

The story of AI advancement isn't a series of breakthrough moments where machines suddenly achieve understanding. It's a slow accumulation of surprising findings, partial explanations, and uncomfortable questions. Grokking adds to that story. It doesn't resolve it.

Conclusion

Grokking is a genuine and fascinating phenomenon — neural networks transitioning from memorization to real generalization after extended training. But what we know is narrow, what we don't know is vast, and the gap between careful research and industry hype is enormous. Understanding both sides matters.

Written by Stephen Klein, Founder/CEO of Curiouser.AI

Stephen Klein is Founder/CEO of Curiouser.AI — building AI to amplify human intelligence, not replace it. He teaches at Berkeley and is writing a book with Georgetown on post-automation strategy. Curiouser is community-funded on WeFunder.