If you are building an LLM application, you are probably going to process user intents.
Your user writes something natural:
My dashboard stopped loading after this morning's deployment. I think I need to roll back.Software cannot execute vibes.
It needs structure.
It needs to know:
- What does the user want?
- What data do we need?
- Which workflow should run?
- Is it safe to execute?
That process is called intent classification.
Intent classification turns messy human input into a clear action your system can understand.
User message -> intent -> metadata -> workflowThe basic idea
An intent is the main thing the user wants to do.
Example:
What is the status of my latest deployment?Intent:
CheckDeploymentStatusAnother example:
Roll back the last release.Intent:
RollbackDeploymentAnother one:
Does your CLI support Windows?Intent:
FeatureInquiryOnce the intent is known, the system can route the request to the right workflow.
CheckDeploymentStatus -> fetch pipeline logs
RollbackDeployment -> trigger rollback workflow
FeatureInquiry -> search documentation
ReportBug -> open issue in trackerWithout intent classification, the AI only replies.
With it, the AI can drive the product.
Intent vs metadata
Intent tells the system what action to take.
Metadata tells the system what details are needed.
Example:
Roll back the payment service to version 2.4.1.Structured output:
{
"intent": "RollbackDeployment",
"service": "payment-service",
"targetVersion": "2.4.1"
}Here:
RollbackDeploymentis the intentpayment-serviceand2.4.1are metadata
Another example:
Assign the open auth bug to Priya and set it to high priority.Structured output:
{
"intent": "UpdateIssue",
"issueType": "bug",
"topic": "authentication",
"assignee": "Priya",
"priority": "high"
}The intent routes the request.
The metadata fills the parameters.
A good intent system does both.
Why plain prompts break
For a quick demo, you can ask a large language model:
Classify this message into an intent.That works for simple inputs.
But real users do not always write clean commands.
Example:
My dashboard stopped loading after this morning's deployment. I think I need to roll back.This could involve multiple intents:
ReportBugCheckDeploymentStatusRollbackDeployment
A weak classifier might return:
The user is having a problem with their deployment.That is okay as a human-readable summary, but it does not help much with the backend.
Your app needs something structured:
{
"primaryIntent": "ReportBug",
"secondaryIntent": "RollbackDeployment",
"service": "dashboard",
"trigger": "deployment",
"confidence": 0.82
}The confidence field tells downstream logic whether to proceed automatically or ask a clarifying question.
This is why production systems need more than plain prompts.
They need structured output, validation, and clear allowed intent values.
Common ways to build intent classification
There are a few ways to build this. Each one has a different tradeoff.
1. Rule-based and regex
This is the most direct approach.
You define rules and patterns manually.
Example:
If the message contains "roll back", classify it as RollbackDeployment.Or use regex to extract values:
const match = message.match(/v?(\d+\.\d+\.\d+)/);
const version = match?.[1]; // "2.4.1"Input:
Roll back the payment service to version 2.4.1.Output:
{
"intent": "RollbackDeployment",
"targetVersion": "2.4.1"
}Rule-based systems are useful because they are:
- Fast
- Predictable
- Easy to test
- Good for strict commands
- Good for safety-critical constraints
But they are rigid.
This is easy:
Roll back the deployment.This is harder:
The release from this morning is causing problems. Maybe undo it?The user is describing a rollback, but they did not use that word.
Rules are good when the input is predictable.
They struggle when users write naturally.
2. NLU and machine learning
NLU means Natural Language Understanding.
Before LLMs became popular, many teams used NLU tools like:
- Rasa
- Dialogflow
- spaCy-based classifiers
The concept is straightforward: teach a model by using examples.
For RollbackDeployment, you might provide examples like:
Roll back the deployment.
Undo the last release.
Can you revert the pipeline to the previous build?
Go back to version 2.4.1.The model learns that different phrases can map to the same intent.
NLU is more flexible than regex because the user does not need to use exact keywords.
It is useful when:
- Your intents are known
- Your flows are repeated
- You have enough training examples
- You want a classic chatbot-style system
The downside is maintenance.
You need to collect examples, update training data, and keep the model aligned as your product changes.
3. LLM-based classification
LLMs make intent classification more flexible.
They can understand messy, longer, and more complex requests without needing thousands of labeled examples.
Example:
Deploy the recommendation service to production, but only after the staging health checks pass, and auto-rollback if the error rate exceeds 1% within the first ten minutes.Structured output:
{
"intent": "DeployService",
"service": "recommendation-service",
"targetEnvironment": "production",
"precondition": "staging_health_checks_pass",
"rollbackThreshold": {
"metric": "error_rate",
"value": 1,
"unit": "percent",
"windowMinutes": 10
}
}This is where LLMs are useful.
They can understand:
- Natural phrasing
- Multiple conditions
- Missing context
- Long user requests
- More complex instructions
But LLMs are not automatically reliable.
They can:
- Invent fields
- Misread values
- Return inconsistent formats
- Pick the wrong intent
- Loosen important constraints
So the model should not just answer.
It should return structured data that your system can validate.
Structured outputs
Structured output means the model must respond in a specific format, usually JSON.
Instead of this:
The user wants to roll back their deployment.You want this:
{
"intent": "RollbackDeployment",
"service": "payment-service",
"targetVersion": "2.4.1"
}This matters because software can work with structured data.
Your backend can route it:
if intent == Intent.RollbackDeployment:
trigger_rollback(service, target_version)Free text is hard to trust.
Structured output is easier to test, validate, and connect to real code.
LangChain and Pydantic example
LangChain can help you force structured outputs from an LLM using .with_structured_output().
With Pydantic, you define the shape of the result you expect, and LangChain handles the rest.
from enum import Enum
from pydantic import BaseModel
from langchain_anthropic import ChatAnthropic
class Intent(str, Enum):
CheckDeploymentStatus = "CheckDeploymentStatus"
RollbackDeployment = "RollbackDeployment"
FeatureInquiry = "FeatureInquiry"
ReportBug = "ReportBug"
class UserIntent(BaseModel):
intent: Intent
service: str | None = None
target_version: str | None = None
model = ChatAnthropic(model="claude-sonnet-4-6")
classifier = model.with_structured_output(UserIntent)
result = classifier.invoke("Roll back the payment service to version 2.4.1")
print(result.intent) # Intent.RollbackDeployment
print(result.service) # "payment-service"
print(result.target_version) # "2.4.1"The model returns data that matches your schema directly.
Example response:
{
"intent": "RollbackDeployment",
"service": "payment-service",
"target_version": "2.4.1"
}This is much easier to use than interpreting a model's classification from plain text.
Why enums matter
Enums stop the model from inventing random intent names.
Without an enum, you might get:
rollback deployment
Rollback
DeploymentRollback
UserWantsRollback
RollbackIntentThese all mean similar things, but they create messy backend logic.
With an enum, the system must return one clean value:
RollbackDeploymentNow your intent can map directly to a workflow:
RollbackDeployment -> rollback pipeline
CheckDeploymentStatus -> fetch logs
FeatureInquiry -> search docs
ReportBug -> open issueEnums make the system easier to maintain.
4. Hybrid approach
In production, a combination of approaches often works best.
A strong system can combine:
- Regex for obvious values
- Rules for strict constraints
- LLMs for flexible language understanding
- Schemas for structured output
- Validation for correctness
- Fallback questions when the input is unclear
Example:
Roll back the payment service to version 2.4.1.Regex extracts:
{
"targetVersion": "2.4.1"
}The LLM classifies:
{
"intent": "RollbackDeployment",
"service": "payment-service"
}Validation checks that the required fields exist.
Final result:
{
"intent": "RollbackDeployment",
"service": "payment-service",
"targetVersion": "2.4.1",
"status": "ready_to_execute"
}This is stronger than trusting one layer to do everything.
Why hybrid matters for high-risk systems
If your app only answers questions, mistakes are annoying.
If your app performs actions, mistakes become dangerous.
This matters in:
- Infrastructure operations
- Financial transactions
- Healthcare scheduling
- Legal workflows
- Account permissions
- Any system that executes real actions
Example:
Schedule a follow-up scan for the patient, but only with Dr. Reyes, and only after at least three weeks have passed since the last procedure on June 20th.The user gave a hard constraint:
{
"earliestAllowedDate": "2026-07-11",
"requiredDoctor": "Dr. Reyes"
}The LLM should never be allowed to turn that into:
{
"earliestAllowedDate": "2026-06-22",
"requiredDoctor": null
}For high-risk systems, deterministic rules should win over model guesses.
A better architecture looks like this:
User input
-> deterministic extraction
-> LLM classification
-> schema validation
-> safety checks
-> executionThe LLM helps understand the request.
The system still owns correctness.
Metadata tagging
Metadata tagging is related to intent classification, but it is usually used for documents or larger text.
Intent classification asks:
What does the user want to do?Metadata tagging asks:
What labels can we extract from this content?Example:
The CSV export hangs on reports with more than 50,000 rows. Smaller exports work fine. We're on Enterprise and this is blocking our weekly reporting cycle.Metadata output:
{
"category": "bug",
"severity": "high",
"feature": "csv-export",
"plan": "enterprise",
"impact": "blocking"
}This is useful for search and retrieval.
Before storing support tickets in a vector database, you can tag them with:
- Feature area
- Severity
- Category
- Plan tier
- Impact
- Reporter type
Then search becomes more precise.
Example query:
Find high-severity bugs in the export feature from Enterprise customers.The system can filter by metadata:
{
"severity": "high",
"feature": "csv-export",
"plan": "enterprise"
}This approach is more effective for retrieval than using embeddings by themselves.
Builder mental model
Intent classification is not just an AI feature.
It is a routing layer between natural language and software execution.
User message
-> classify intent
-> extract metadata
-> validate result
-> choose workflow
-> execute safelyThe goal is not to make the model sound smart.
The goal is to turn user input into something your application can actually use.
Final summary
Intent classification turns human language into structured actions.
A simple system detects the intent:
{
"intent": "RollbackDeployment"
}A stronger system also extracts metadata:
{
"intent": "RollbackDeployment",
"service": "payment-service",
"targetVersion": "2.4.1"
}There are several ways to build it:
- Rule-based and regex: fast, strict, and predictable
- NLU: trained classifiers for known workflows
- LLMs: flexible understanding for messy user input
- Hybrid systems: best for production, especially when safety matters
Best practice:
Use LLMs for understanding.
Use schemas for structure.
Use rules for hard constraints.
Use validation before execution.In other words:
Do not let the AI only reply.
Make it classify, extract, validate, and route.