Contextual embeddings and Transformers

Embeddings helped NLP move from word counts to meaning vectors.

This is part 3 of a three-part introduction. Part 1 covers NLP, NLU, NLG, and core language tasks. Part 2 covers counting words to embeddings.

But early word embeddings still had a problem.

They gave each word one fixed vector.

Language does not work that way.

The same word can mean different things depending on the sentence.

"The bank approved my loan."
"The boat stopped near the river bank."

In the first sentence, bank is about finance.

In the second sentence, bank is about land beside water.

This is the reason contextual embeddings and Transformers became so important.

Contextual embeddings move the same word by sentence

visual model

The bank approved my loan.

bankmoneyloanaccount

The boat stopped near the river bank.

bankriverwatershore

The token bank starts from the same word, then context pulls it toward a different meaning.

Static embeddings are not enough

A static word embedding gives one fixed vector per word.

Example:

bank -> [0.12, -0.44, 0.91, ...]

That vector does not change between these sentences:

"I deposited money in the bank."
"I sat near the river bank."

Static embeddings are useful because they can place related words near each other.

dog close to cat
car close to vehicle
happy close to joyful

But static embeddings cannot fully answer the question:

What does this word mean here?

They mostly answer:

What does this word usually mean?

For many words, that is not enough.

Contextual embeddings

A contextual embedding gives a different vector for the same word depending on the sentence.

So bank can have different representations:

bank + loan  -> close to money, account, credit
bank + river -> close to water, land, shore

This is the key difference:

Static embedding:
"What does this word usually mean?"
 
Contextual embedding:
"What does this word mean in this exact sentence?"

Contextual embeddings made NLP much more powerful because the model could represent words based on surrounding words, not only based on a fixed dictionary-like representation.

Common tools and models for contextual embeddings:

BERT
RoBERTa
DistilBERT
ELECTRA
T5
GPT-style models
Hugging Face Transformers
SentenceTransformers
spaCy transformer pipelines

Where embeddings fit in the pipeline

Embeddings are the bridge between text and math.

A computer cannot directly understand this sentence:

"I love this movie."

So NLP systems convert text into numbers.

The pipeline looks like this:

Raw text
-> tokenization
-> token IDs
-> embeddings
-> model processing
-> output

The embedding layer converts tokens into vectors.

Example:

"I"     -> [0.12, -0.44, 0.08, ...]
"love"  -> [0.91, 0.33, -0.10, ...]
"this"  -> [0.04, -0.22, 0.76, ...]
"movie" -> [0.62, 0.18, -0.31, ...]

Now the model can process the sentence mathematically.

In classic NLP, embeddings could be used with traditional ML models:

Text
-> embeddings
-> classifier
-> sentiment result

In modern NLP, embeddings are part of deep learning models:

Text
-> tokens
-> token embeddings
-> self-attention
-> contextual representations
-> output

So embeddings can appear in two main places:

1. Inside a model, as the first representation layer.
2. Outside a model, for search, retrieval, clustering, and similarity.

The problem Transformers helped solve

To understand bank in this sentence:

"The bank approved my loan."

the model should pay attention to:

approved
loan

To understand bank in this sentence:

"The boat stopped near the river bank."

the model should pay attention to:

boat
river
near

So the problem is:

How can the model decide which words matter for understanding each word?

The answer is self-attention.

Self-attention lets tokens look at useful context

visual model

Thebankapprovedmyloan

bank -> approved

bank -> loan

Theboatstoppednearriverbank

bank -> river

bank -> boat

The model can give more weight to the allowed context tokens that help explain the current token.

Self-attention formula

formula

Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V

Plain version: Compare queries with keys to get attention weights, then use those weights to combine value vectors.

This is the scaled dot-product attention formula used inside Transformer attention. Masking rules decide which tokens are allowed to be attended to.

Self-attention

Self-attention means each token can use other allowed tokens in the sequence to update its representation.

In a bidirectional encoder, that can include surrounding tokens on both sides. In a decoder-only model generating text, it usually means previous tokens, not future tokens.

At first, the model has a basic vector for each token:

The
bank
approved
my
loan

The word bank starts with a general representation.

Then self-attention updates it using other useful words in the sentence.

In this sentence:

"The bank approved my loan."

bank may pay attention to:

approved
loan

So the model can represent bank as a financial institution.

In this sentence:

"The boat stopped near the river bank."

bank may pay attention to:

boat
river
near

So the model can represent bank as the side of a river.

The core idea:

initial token embedding
-> self-attention
-> contextual embedding

Self-attention is what lets the model update each token based on context.

Token embeddings and positional embeddings

In Transformers, the model needs at least two kinds of information.

First, it needs to know what the token is.

That comes from token embeddings.

bank  -> meaning vector
loan  -> meaning vector
river -> meaning vector

Second, it needs to know where the token appears in the sentence.

That comes from positional embeddings or positional encoding.

Token plus position formula

formula

x_i = token_embedding_i + position_embedding_i

Plain version: For each position i, the model combines what the token is with where it appears.

Some Transformer variants use other positional methods, but this sum is the simplest mental model.

Position matters because word order changes meaning.

"Dog bites man."
"Man bites dog."

The sentences contain the same words, but the meanings are different.

So the model needs both:

token meaning + token position

That helps the Transformer represent not just which words appear, but how they are arranged.

The Transformer flow

A simplified Transformer flow looks like this:

Text
-> tokens
-> token embeddings + positional embeddings
-> self-attention layers
-> contextual representations
-> output

The important part is that the model does not only look at isolated words.

It learns relationships between words.

That is why Transformers became useful for tasks like:

translation
summarization
question answering
chatbots
code generation
semantic search
document understanding
reasoning over long text

Most modern LLMs are built on this general idea. They do not only match keywords. They learn patterns in language, use context, and generate text based on what they have learned.

A simplified Transformer flow

visual model

tokens

The, bank, approved, my, loan

embeddings

token meaning + token position

attention

contextual representation for each token

The important shift is from isolated token vectors to contextual representations updated by attention.

The full evolution

Now the whole story connects.

The evolution of NLP looks like this:

One-hot encoding
-> Bag of Words
-> TF-IDF
-> static word embeddings
-> contextual embeddings
-> Transformers and LLMs

The evolution of text representation

visual model

One-hot

Bag of Words

TF-IDF

Static embeddings

Contextual embeddings

Transformers

Each step keeps something useful from the previous era and fixes a major weakness.

Each step fixed a weakness from the previous step.

One-hot encoding represents word identity, but it has no built-in meaning or similarity.

Bag of Words represents which words appear in a document, but it ignores order and context.

TF-IDF represents which words are important in a document, but it is still mostly keyword-based.

Static embeddings represent general word meaning, but the same word has the same vector everywhere.

Contextual embeddings represent word meaning in context, but they are more expensive and complex than older methods.

Transformers use self-attention to model relationships between tokens, which makes them much better at language understanding and generation.

Final mental model

The simplest way to understand NLP's evolution is this:

Traditional NLP counted words.
Embeddings represented meaning.
Transformers modeled meaning in context.

Older NLP methods asked:

Which words appear?
How often do they appear?
Are these exact keywords present?

Embeddings asked:

What does this word or text mean?
Which words or texts are close in meaning?

Contextual embeddings asked:

What does this word mean in this exact sentence?

Transformers added:

Which other words should each token pay attention to in order to understand the full meaning?

A simple recap:

Bag of Words = words as counts
TF-IDF = words as importance scores
Static embeddings = words as general meaning
Contextual embeddings = words as meaning in context
Transformers = language understanding through attention

That is why modern NLP feels so different from older NLP.

It is no longer only about matching keywords.

It is about representing meaning, using context, and learning relationships between words.

The field evolved from:

counting words

to:

understanding words in context

That shift is what made modern language models possible.