Transformers: A Closer Look at Encoders, Decoders, and Embeddings

Alright, this is the second lesson on transformer architecture. Last time, we kept things high level—two big parts: encoder and decoder. The encoder reads the input and turns it into embeddings. The decoder takes those embeddings and produces output text. Simple enough, right? But as with most things, the details matter. Let’s walk through them slowly.

Tokens First

Before we get lost in encoders and decoders, let’s pause on tokens. Because large language models don’t really “see” words in the way you or I do. They see tokens.

Now, a token might be a word. Sometimes it’s only part of a word. And punctuation can be a token too. For instance, “apple” is one token. But “friendship” is actually two—“friend” and “ship.”

The number of tokens per sentence isn’t fixed. For a simple phrase, you might get one token per word. For trickier words, maybe two or three tokens each. And don’t forget commas, periods, and so on—they’re tokens as well.

So if you look at a sentence broken into tokens, you’ll notice some words standing alone and others split up. It feels almost mechanical, but it’s a big deal. Because every single thing that happens next—embeddings, vector databases, even text generation—starts with tokens.

Turning Tokens into Embeddings

Once we’ve got tokens, the model needs to do something useful with them. That’s where embeddings come in. An embedding is just a number-based representation of text. Could be a word, a phrase, a sentence, even a whole paragraph.

Why numbers? Because computers don’t work with “friend” or “apple.” They work with values and patterns. Embeddings translate words into vectors—little chunks of numbers—that preserve meaning.

Take the phrase: “they sent me a.” Each word becomes a token, then each token is mapped into a vector. You end up with four vectors for the four tokens, plus sometimes an extra one that represents the whole sentence.

That vector space is where the model can figure out similarity. It’s how it “knows” that car and truck are closer in meaning than car and banana.

Why This Matters (Semantic Search, RAG, and All That)

So what do embeddings actually do for us? One big answer: semantic search. Instead of just matching keywords, embeddings let a system search by meaning. If you ask about “car repair,” the model can pull up stuff about “vehicle maintenance” even if the exact words don’t appear.

This becomes powerful when combined with vector databases. Imagine storing embeddings of every document in a big collection. When a user asks a question, their query is also encoded into a vector. The system compares that vector with the stored ones and finds the closest matches.

Put this together with large language models and you get retrieval-augmented generation (RAG). That’s a fancy way of saying: first fetch the most relevant content, then let the model use that along with its own training knowledge to answer. Without embeddings, that workflow falls apart.

Decoders: Where Text Gets Generated

Now, about decoders. If encoders are readers, decoders are writers. Their main job is to look at a sequence of tokens and predict the next one.

For example, if you give it “they sent me a,” the decoder figures out the most likely next token. Maybe it says “lion.” Why lion? Because in the model’s learned probabilities, “lion” fits better than many other words in that spot.

One thing to remember: decoders only output one token at a time. That means, if we want a whole sentence, we have to keep looping. Generate one token, feed it back in, generate the next, and so on. It’s like typing one character at a time, but incredibly fast.

Encoder-Decoder Together

And then we’ve got the hybrid: encoder-decoder models. These are especially handy for sequence-to-sequence tasks like translation.

Think about translating English to French. First, the encoder reads the English sentence and turns it into embeddings. Then the decoder takes those embeddings and starts producing French tokens, one at a time. After each token is generated, it gets added back into the loop so the decoder knows what’s already been said and what might come next.

That loop continues until the full translation comes out. It’s not magic—it’s just a lot of careful steps repeated over and over.

Wrapping It Up

So far, we’ve got three kinds of transformer setups:

  • Encoder-only models: They’re good at understanding text. Great for things like semantic search and RAG.
  • Decoder-only models: These focus on generating text, token by token. Useful for writing tasks.
  • Encoder-decoder models: A mix of both, excellent for tasks like translation where input and output aren’t the same language.

All of them rely on tokens and embeddings. And once you see how they fit together, it becomes easier to follow why transformers have become the backbone of so many modern AI systems.