Well-behaving Ghosts

February 22, 2026 6 minute read ai

Mainstream pundits are stuck on the willful blindness of “it’s just predicting the next word”. – Leopold Aschenbrenner.

There is a tension in the understanding of AI models, between the mathematical intuition and its application success. On one hand, the transformer architecture¹ doesn’t seem that enlightening compared to existing language models such as the Long Short-Term Memory one. It is another neuro-inspired model that is statistically trained. And statistics is basically counting, right? On the other hand, those models can generate such coherent writings, accurate knowledge, and even intelligent remarks.

What is the intuition for the fact that those “counting algorithms” can become so smart? Here are a series of intuitions that I find helpful.

Intuition #1: Next-token prediction is a helpful framework to apply scientific method on AI

Next-token prediction is a definition of the language modeling problem. This is why next-token predictors are also called language models.

It makese AI testable, in so many ways.
- In philosophy of science, testability is a defining feature of good theories. Without it, you can not verify the theories. If you can’t verify, you can’t make progress in understanding the problem at hand.
- Next-word prediction can be noisely tested using internet data. Every word we say everyday is a test for AI models. It tests whether an AI can speak like us. It is an unforgiving test, because there is usually one answer. Even though it is a bit arbitrary what exact word will come next, getting most of them right means you have a lot of understanding of the human world.
- And a lot of traditional machine learning tasks can also be casted into next-word prediction (T5 paper by Raffel et al.). That means, every machine learning problem is applicable to a next-word predictor.
It is general enough to cover many aspects of intelligence. Token generation is a highly expressive representation of human cognitive and physical activities.
- Human acitivies, in many ways, are also about generating the next token.
- Actions are tokens.
- Image drawings can be represented as tokens².
- Thoughts, both verbally and visually, thus, can be seen as reasoning tokens.
- Feelings are hidden vectors, maybe? When feelings are expressed, they are words, which are tokens.
It leaves the internal structure of the system completely undefined. Anyone with any idea can implement it into a system and test it.
- Beside transformers, there are many many other language modeling architectures: n-grams, naive bayes, logistic regression, (any classical classification model applied to classifying the next word within a vocabulary), recurrent neural network, long short-term memory, etc.

Intuition #2: Transformers are much, much, much more sophisticated than counting co-occurences

Counting is only appropriate when you describe n-gram models – a very basic class of language models.

For visualizations of the maddening complexity of language models, watch the Deep Learning series by 3Blue1Brown.
A neural language model (which is what a transformer is) learns not by counting, but by (1) back-propagating from (2) loss values defined by (3) tons of data.
- Backpropagation can become sophisticated. Think of backpropagating a 3-layer neural network. It is an exercise all machine learning students have to do manually. The chain-rule to get the loss-originated gradients flow through the network makese things sophisticated very quickly. Progress had to wait until efficient frameworks for backprops like Tensorflow and Pytorch appear.
- The loss values are directly a benefit from the brilliant problem definition above – next-token prediction. Loss is 1 if the model predict the wrong word, and 0 otherwise. Mathematically minimizing the loss means qualitative having a human-like language model.
- And the scale of data you can use to adjust the model weights is huge. That simply means you repeat the model training process over a lot of times.
- From these sophisticated techniques, we cannot rule out a possibility that we have found a way to train models that can truly generalize.

Intuition #3: Recursive abstraction will bring you very, very far.

(A game-changing insight) The AI research community is doing an inductive engineering of intelligence.
- What is inductive engineering? A big impressive capability is usually composed of several smaller and less impressive capabilities. For example:
  - Being good at a physical sport typically means being good at three smaller areas – stamina, footwork, and hand-eye coordination³.
  - Being good at coding means good at the programming language, the algorithms and data structures, testing, and keyboard typing ⁴.
  - Being an intelligent being, thus, can also be broken into several subskills. And because it is inductive, you continue breaking the skill down into a deadly simple things that a computer can do.
- Why is this a good way to look at recent AI progress?
  - Take NLP as an example. When you make sure a model understands words correctly, then sentence semantics, then pragmatics, then knowledge representation, etc., you gradually have a linguistically fluent model.
  - Take mathematical reasoning as an example. Math is a composable skill. The early skills are additions, substraction, multiplication, division, and so on. Once a model does those well, it will almost never lose that skills, and move on to more complex skills.
  - Take software engineering as an example. When a model, code simple functions correctly, they can string them into more complex functions.
Abstraction will bring you very far.
- In the classical⁵ computer science textbook named Structures and Interpretations of Computer Programs (SICP), there is a very impressive perspective that very complex computer programs can be built from very simple program structures through the process of abstractions.
- If an AI do the small things well, what prevents it from doing the big things as well?

Remark

So what is a good way to think about AIs? AIs are well-behaving ghosts that get smarter every day. They are like the white walks in Game of Thrones. They are getting better at logical things. They gradally resemble emotions in many cases. They are not humans in the biological sense, and more similar to ghosts. They are smart, omnipresent, and intangile, all at once.

Footnotes

Transformers are what we technically mean by “AI models”. For a non-technical person to wrap their head around this concept, think of it as a simulated brain. In recent years, computer scientists have figured out a way to utilitize computer chips (CPUs and GPUs) to extract “intelligence matters” from internet data dumps and store them in a way that they called the “transformer architecture”. These chips with intelligence matters inside, can go on like a sentient being that speak, learn, and act in the digital world. They are the physical representation of AI models. ↩
Even thought AI models currently represent image by dividing them into 2D patches, each becoming a token, it feels like there can be more intuitive ways to do it. ↩
Source: Vincent Pham, former staff at FUV ↩
Source: Competitive Programming 3, Felix and Steven Halim ↩
…that I only read a chapter or two ↩

Khoi P. N. Nguyen