From counting words to embeddings

Before embeddings, many NLP systems represented text with frequency-based methods.

This is part 2 of a three-part introduction. Part 1 covers NLP, NLU, NLG, and core language tasks. Part 3 covers contextual embeddings and Transformers.

They did not represent meaning directly. They mostly looked at:

which words appear
how often they appear
which words appear together

The traditional flow looked like this:

Text
-> split into words
-> count words
-> use the numbers in a machine learning model

This worked for some tasks.

A model could learn that love often appears in positive reviews and hate often appears in negative reviews.

But it did not automatically know that these words are related:

car ~= automobile
cheap ~= affordable
movie ~= film

That is the problem embeddings later helped solve.

One-hot encoding

One of the earliest ways to represent words was one-hot encoding.

Imagine this vocabulary:

["cat", "dog", "car", "happy", "sad"]

Each word gets a vector with one 1 and all other positions set to 0.

cat   -> [1, 0, 0, 0, 0]
dog   -> [0, 1, 0, 0, 0]
car   -> [0, 0, 1, 0, 0]
happy -> [0, 0, 0, 1, 0]
sad   -> [0, 0, 0, 0, 1]

This tells the model which word appears, but not what the word means.

To the computer:

cat and dog = different
cat and car = different

There is no built-in idea that cat and dog are more related than cat and car.

That is the main weakness of one-hot encoding. It captures word identity, not word meaning.

Common tools for one-hot encoding:

scikit-learn
pandas
NumPy

One-hot vectors identify words, but do not explain them

visual model

catdogcarhappysad

cat10000

dog01000

car00100

happy00010

sad00001

Each word owns one active position. Related words are still separate columns.

One-hot encoding formula

formula

v_j = 1 if j = i
v_j = 0 if j != i

Plain version: For the selected word i, turn on exactly one position in the vector and leave every other position as zero.

Bag of Words

One-hot encoding represents individual words.

But what if we want to represent a whole sentence or document?

That is where Bag of Words, or BoW, comes in.

Bag of Words represents a sentence or document by counting which words appear. It ignores word order, grammar, and context.

Imagine this vocabulary:

["I", "love", "hate", "this", "movie"]

Sentence:

"I love this movie"

Bag of Words vector:

I     -> 1
love  -> 1
hate  -> 0
this  -> 1
movie -> 1

As a vector:

[1, 1, 0, 1, 1]

Another sentence:

"I hate this movie"

Vector:

[1, 0, 1, 1, 1]

A machine learning model could learn:

love -> positive
hate -> negative

So Bag of Words worked reasonably well for simple tasks like spam detection and basic sentiment analysis.

But it has a major weakness. It ignores word order.

Example:

"Dog bites man."
"Man bites dog."

Bag of Words sees almost the same thing:

dog   -> 1
bites -> 1
man   -> 1

Humans know the meanings are different.

That is why it is called a bag of words. The words are thrown into a bag, and the structure of the sentence is lost.

Common tools for Bag of Words:

scikit-learn CountVectorizer
NLTK
pandas
NumPy

Bag of Words keeps counts and loses order

visual model

Ilovehatethismovie

I love this movie11011

I hate this movie10111

The model can see that love or hate appears, but it does not preserve sentence structure.

Bag of Words count formula

formula

x_j = count(w_j, document)

Plain version: For each vocabulary word w_j, store how many times that word appears in the document.

Frequency-based methods

Bag of Words can also count how often words appear.

Example:

"good good good movie"

The vector may look like:

good  -> 3
movie -> 1

This helps the model notice that some words are more frequent than others.

For example, words like these may appear often in spam emails:

free
win
money
prize
urgent

A traditional ML model could learn:

many spam-like words -> probably spam

This was useful, but limited.

The system was not modeling the email's meaning directly. It was mostly learning statistical patterns from word frequency.

TF-IDF

Simple word frequency has a problem.

Some words appear often but do not tell us much.

Examples:

the
is
and
of

These words are common, but they are usually not the most useful words for understanding what a document is about.

This is where TF-IDF helps.

TF-IDF stands for:

Term Frequency - Inverse Document Frequency

The idea is simple.

A word is important if:

it appears often in this document
but does not appear in every document

For example, in an article about AI, the word transformer may be important.

But words like the, is, and, and of appear everywhere, so they get lower weight.

TF-IDF gives higher weight to words that are specific to a document.

"transformer" -> high weight
"the"         -> low weight

TF-IDF became useful for:

search
document classification
recommendation
information retrieval
keyword extraction

But TF-IDF still does not represent meaning directly.

Example:

Query:
"cheap laptop"
 
Document:
"affordable notebook computer"

A keyword-based system may not match them well because the exact words are different.

Humans know they are close in meaning:

cheap ~= affordable
laptop ~= notebook computer

This is one of the problems embeddings helped solve.

Common tools for TF-IDF:

scikit-learn TfidfVectorizer
NLTK
gensim

TF-IDF lowers common words and raises specific words

visual model

the

and

transformer

attention

A word is useful when it is frequent in this document but not common across every document.

TF-IDF formula

formula

tfidf(t, d, D) = tf(t, d) * idf(t, D)

idf(t, D) = log(N / df(t))

Plain version: A word gets a high score when it appears often in this document but appears in fewer documents overall.

N is the number of documents. df(t) is the number of documents containing term t. Many libraries use a smoothed version such as log((1 + N) / (1 + df(t))) + 1.

The semantic gap

Traditional methods like one-hot encoding, Bag of Words, and TF-IDF have one big weakness:

They work with word identity and word frequency, not deep meaning.

Example:

car
automobile
vehicle

Humans know these words are related.

But in one-hot encoding or Bag of Words, they are separate columns.

car        -> [1, 0, 0]
automobile -> [0, 1, 0]
vehicle    -> [0, 0, 1]

The model does not automatically know that they are close in meaning.

This is called the semantic gap.

The system can count words, but it does not directly encode what those words mean.

There is another problem too: vector size.

If your vocabulary has 100,000 words, a one-hot vector can have 100,000 dimensions.

Most values are zero.

cat -> [0, 0, 0, 0, 1, 0, 0, ...]

This is called a sparse vector.

Sparse vectors are inefficient, and they do not carry rich meaning.

This is where NLP needed a better representation.

That better representation was embeddings.

What an embedding is

An embedding is a way to represent text as a vector of numbers.

The text can be:

a word
a token
a sentence
a paragraph
a document

The goal is to place similar meanings close together in vector space.

Examples:

king ~= queen
dog ~= cat
movie ~= film
happy ~= joyful
car ~= automobile

This was a major shift.

The model was no longer only working with word identity or word frequency. It started working with meaning-like representations.

Embeddings put similar meanings nearby

visual model

carvehicleautomobiledogcatfilmmovie

The axes are not literal. The useful idea is distance: close points mean related meanings.

Cosine similarity formula

formula

cosine(a, b) = (a . b) / (||a|| * ||b||)

Plain version: Compare two embedding vectors by measuring how similar their directions are.

Cosine similarity is commonly used for semantic search, but some systems use dot product or distance metrics instead.

Sparse vectors vs dense vectors

Older methods often created sparse vectors.

Example:

cat -> [0, 0, 0, 0, 1, 0, 0, ...]

Most values are zero.

Embeddings create dense vectors.

Example:

cat -> [0.23, -0.71, 0.44, 0.09, ...]

Embedding lookup formula

formula

embedding = E[token_id]

E has shape: vocabulary_size x embedding_dimension

Plain version: The model uses the token ID as a row lookup into an embedding table.

Many dimensions contain useful values.

Those values are learned from data and can capture patterns about meaning.

The shift was:

sparse vectors -> dense vectors
word identity  -> word meaning
exact matching -> semantic similarity

A simple mental model:

Bag of Words = words as counts
TF-IDF = words as importance scores
Embeddings = words or text as meaning vectors

Static word embeddings

One major early family of embeddings was the static word embedding.

A static word embedding gives one fixed vector per word.

Example:

bank -> [0.12, -0.44, 0.91, ...]

No matter where bank appears, it has the same vector.

"I deposited money in the bank."
"I sat near the river bank."

In static embeddings, bank has the same representation in both sentences.

This was still a big improvement over Bag of Words because similar words could now be close in vector space.

Examples:

cat close to dog
car close to vehicle
happy close to joyful

But static embeddings have one major limitation.

They capture the general meaning of a word, not the meaning of the word in a specific sentence.

So a word like bank still creates a problem:

bank = financial bank?
bank = river bank?

The vector is fixed, so it cannot fully adapt to the sentence.

Common tools and models for static embeddings:

Word2Vec
GloVe
fastText
gensim
spaCy word vectors

Word embeddings vs text embeddings

At this point, the terminology can get confusing.

There are two separate questions.

First:

What are we embedding?

The answer can be:

word
token
sentence
paragraph
document

Second:

Does the embedding change with context?

The answer can be:

static
contextual

So "word embedding" and "document embedding" describe the size of the thing being embedded.

"Static" and "contextual" describe whether the vector changes depending on the surrounding text.

A word embedding represents one word:

apple -> vector
dog   -> vector
bank  -> vector

A text embedding or document embedding represents a whole sentence, paragraph, chunk, or document:

"How do I reset my password?" -> vector

or:

"Users can reset their password by clicking Forgot Password on the login page." -> vector

Text embeddings are heavily used in semantic search and many RAG systems.

A RAG pipeline often looks like this:

Document
-> split into chunks
-> embed each chunk
-> store vectors in a vector index or vector database

Then when the user asks a question:

Question
-> embed question
-> compare with document vectors
-> retrieve closest chunks
-> send retrieved context to the LLM

Example:

User question:
"How do I change my password?"
 
Stored document chunk:
"Users can reset their password by clicking Forgot Password on the login page."

Even though the words are not exactly the same, the meanings are close.

A semantic search system can retrieve the right chunk because the embeddings are close in vector space.

Common tools for text and document embeddings:

SentenceTransformers
Hugging Face Transformers
OpenAI embeddings
Cohere embeddings
Voyage AI embeddings
LangChain
LlamaIndex
FAISS
Chroma
Pinecone
Weaviate
Qdrant

The short version

The evolution so far looks like this:

One-hot encoding
-> Bag of Words
-> TF-IDF
-> static word embeddings
-> text embeddings

Each step fixes a weakness from the step before it.

One-hot encoding knows word identity.

Bag of Words counts words in documents.

TF-IDF measures word importance.

Embeddings represent meaning-like similarity.

But static embeddings still leave one important problem:

The same word can mean different things in different sentences.

That is the problem contextual embeddings and Transformers were designed to handle.