I just learnt about what the T in GPT is all about, and it's super interesting

Published: August 14th 2025

If you come from the Machine Learning (ML) word, this is basic I'm sure.

For a software dev like me, it's all new, and reading about it is like learning about one of those pivotal moments in software history.

It reminds me of the time I learned about Ajax and its history, when first getting into software. Those moments of awe and intrigue learning about fundamentals changes in how we do things.

So, the T in GPT stands for Transformer - the core neural network architecture that underlies virtually all modern LLMs.

Transformers were not always used. They were a breakthrough.

The key innovation of transformers is the attention mechanism, which allows the model to focus on words individually, and link them to other words in the text. So it builds up a sort of mapping of how each word is related to each other word.

Previous architectures focused on processing text sequentially.

That is how Recurring Neural Networks (RNN's), which were there the dominant architecture for language tasks worked, before transformers came about.

Since RNN's can only process sequentially, they cannot parallelize in the same way that transformers can. For RNNs to process a piece of text, they must process all parts before it first. Transformers don't have that restriction. They can process all words together at the same time.

This ability to parallelize transformers makes training dramatically faster on modern hardware like GPUs.

Also, RNNs often "forget" information from early in long sequences, while transformers can directly connect any two positions through attention.

The result?

RNN's basically got completely replaced by transformers.

Here is the timeline

2017 - The Breakthrough - The "Attention is All You Need" paper introduced the transformer architecture. Initially, many researchers were skeptical that you could abandon recurrence entirely. The 8 people who wrote this paper are now known as the "Transformer 8"

2018 - Early adoption - BERT was one of the first major successes showing transformers could dramatically outperform RNN-based models on many NLP benchmarks. This was a wake-up call for the field.

2019-2020 - The tipping point - transition really accelerated.
→ GPT-2 demonstrated impressive text generation capabilities
→ More transformer variants appeared (RoBERTa, ALBERT, etc.)
→ Research papers increasingly focused on transformer architectures

By this point, transformers became the default choice for most new NLP projects.

2020 onwards - Complete dominance Large language models like GPT-3 made it clear that transformers were the path forward for scaling up language models.

Today - RNNs are rarely used for mainstream natural language tasks. You might still see them in very specific applications but for general tasks transformers reign supreme