NLP basics: NLU, NLG, and core language tasks

Machines do not process language the same way humans do.

A word can mean different things depending on where it appears.

"The bank approved my loan."
"I sat near the river bank."

In the first sentence, bank means a financial institution.

In the second sentence, bank means the side of a river.

That is the central difficulty of language processing. Language is not just a list of words. It is meaning, context, order, tone, and relationships between words.

One word, two meanings

visual model

The bank approved my loan.

bankfinanceloanaccount

I sat near the river bank.

bankriverwaterland

The model has to use nearby words to decide which meaning of bank is active.

Before embeddings or Transformers make sense, it helps to understand the broader field they come from.

This is part 1 of a three-part introduction. The next two parts cover counting words to embeddings and contextual embeddings to Transformers.

What NLP is trying to do

NLP stands for Natural Language Processing.

It is the area of AI focused on helping computers work with human language.

NLP includes tasks like:

text classification
translation
summarization
question answering
chatbots
semantic search
speech-to-text
information extraction
sentiment analysis

A simple NLP system takes raw text, processes it, and produces a useful output.

Example:

Input:
"I love this product."
 
Output:
Sentiment = positive

Another example:

Input:
"Summarize this article."
 
Output:
A short summary written in natural language.

NLP is the broad field. Inside it, two important ideas are NLU and NLG.

NLU: understanding language

NLU means Natural Language Understanding.

It is the part of NLP focused on understanding what text means.

Example:

"Book me a flight to Paris next Friday."

An NLU system might extract:

intent: book a flight
destination: Paris
date: next Friday

The system is not only reading the words. It is trying to identify the user's goal and the important details inside the sentence.

NLU is used for tasks like:

intent detection
named entity recognition
sentiment analysis
topic classification
text classification
question understanding

A chatbot needs NLU to understand what the user is asking.

A search engine needs NLU to understand what a query means.

A customer support system needs NLU to determine whether the user is asking for a refund, reporting a bug, or complaining about a product.

NLG: generating language

NLG means Natural Language Generation.

It is the part of NLP focused on producing text.

If NLU is about reading, NLG is about writing.

Examples of NLG-heavy tasks include:

writing an answer
summarizing a document
generating an email
creating chatbot replies
translating text
explaining data in natural language

Translation uses both sides: the system has to interpret the source text, then generate text in another language.

If a support system understands that a customer is upset about a late delivery, NLG helps generate a response like:

"I'm sorry your delivery was late. I can help check the status or request a refund."

A simple way to remember the relationship:

NLP = the whole field
NLU = understanding language
NLG = generating language

Modern LLM apps often use all three ideas at once. The system understands the user's request, retrieves or processes information, then generates a useful answer.

NLP, NLU, and NLG

visual model

User text

Book me a flight to Paris next Friday.

NLU

intent, destination, date, sentiment, entities

NLG

a natural-language response or summary

NLP is the full field. NLU is the reading side. NLG is the writing side.

Tokenization

Tokenization means breaking text into smaller pieces.

Those pieces are called tokens.

Tokens can be:

words
subwords
characters
punctuation

Example:

"Apple opened a new office in Paris."

Tokenized:

["Apple", "opened", "a", "new", "office", "in", "Paris", "."]

Modern language models often use subword tokenization.

One tokenizer might split a word like this:

"unbelievable"
-> ["un", "believ", "able"]

The exact split depends on the tokenizer.

Tokenization matters because models do not process raw text directly. Text has to be split into smaller units first.

The flow looks like this:

Text
-> tokens
-> token IDs
-> embeddings
-> model

Common tools for tokenization:

NLTK
spaCy
Hugging Face Tokenizers
Transformers tokenizers

Tokenization sounds small, but it affects cost, context length, search quality, and how the model handles unfamiliar words.

Tokenization turns text into model-sized pieces

visual model

Apple opened a new office in Paris.

AppleopenedanewofficeinParis.

Appleid 314

openedid 405

aid 496

newid 587

After tokenization, each token can be mapped to an ID and then to an embedding vector.

Named entity recognition

Named Entity Recognition, usually called NER, means identifying important named things in text.

Entities can include:

people
companies
locations
dates
money amounts
organizations
products

Example:

"Apple opened a new office in Paris on Monday."

NER output:

Apple  -> Organization
Paris  -> Location
Monday -> Date

Another example:

"Elon Musk bought Twitter for $44 billion in 2022."

NER output:

Elon Musk   -> Person
Twitter     -> Organization
$44 billion -> Money
2022        -> Date

NER is useful because it turns messy text into structured information.

This sentence:

"Meeting with Sarah in London next Friday."

can become:

person: Sarah
location: London
date: next Friday

Common tools for NER:

spaCy
NLTK
Stanford NLP
Hugging Face Transformers
Flair

NER is one reason NLP became useful in real software. Once text becomes structured, applications can search it, route it, validate it, and store it.

Named entity recognition highlights useful facts

visual model

Apple org opened a new office in Paris loc on Monday date.

Organization

Apple

Location

Paris

Date

Monday

NER turns unstructured text into fields an application can store, filter, or route.

Sentiment analysis

Sentiment analysis means detecting the emotional tone or opinion in text.

The basic labels are usually:

positive
negative
neutral

Examples:

"I love this app. It is fast and easy to use."
-> positive
 
"This app keeps crashing after the update."
-> negative
 
"The app was updated yesterday."
-> neutral

Sometimes sentiment can be more detailed:

happy
angry
sad
frustrated
excited
confused

Sentiment analysis is used for:

product reviews
customer support tickets
surveys
social media monitoring
feedback analysis

Common tools for sentiment analysis:

scikit-learn
NLTK
TextBlob
spaCy
Hugging Face Transformers

Sentiment analysis is a good beginner example because the input and output are easy to understand. The hard part is that real language is subtle.

Example:

"Great, the app crashed again."

The word great looks positive, but the sentence is negative. That is why context matters.

Sentiment analysis maps text to opinion

visual model

positive

I love this app. It is fast and easy to use.

negative

This app keeps crashing after the update.

neutral

The app was updated yesterday.

The simple version predicts positive, negative, or neutral. Real systems often need more nuance.

Why this matters before embeddings

All of these tasks share the same underlying problem:

How do we turn language into something a machine can process?

Early NLP systems often handled this by counting words.

Modern systems try to represent meaning numerically.

That is the path from traditional NLP to embeddings and Transformers:

raw text
-> tokens
-> numbers
-> meaning representations
-> useful output

The important idea is simple:

NLP is the field.
NLU reads and interprets language.
NLG writes language.
Tokenization, NER, and sentiment analysis are common tasks inside that field.

Once that map is clear, the next question is natural:

How did older NLP systems represent text before embeddings?

That is where word counts, Bag of Words, and TF-IDF come in.