Intent classification is the routing layer for LLM apps

If you are building an LLM application, you are probably going to process user intents.

Your user writes something natural:

My dashboard stopped loading after this morning's deployment. I think I need to roll back.

Software cannot execute vibes.

It needs structure.

It needs to know:

What does the user want?
What data do we need?
Which workflow should run?
Is it safe to execute?

That process is called intent classification.

Intent classification turns messy human input into a clear action your system can understand.

User message -> intent -> metadata -> workflow

The basic idea

An intent is the main thing the user wants to do.

Example:

What is the status of my latest deployment?

Intent:

CheckDeploymentStatus

Another example:

Roll back the last release.

Intent:

RollbackDeployment

Another one:

Does your CLI support Windows?

Intent:

FeatureInquiry

Once the intent is known, the system can route the request to the right workflow.

CheckDeploymentStatus -> fetch pipeline logs
RollbackDeployment    -> trigger rollback workflow
FeatureInquiry        -> search documentation
ReportBug             -> open issue in tracker

Without intent classification, the AI only replies.

With it, the AI can drive the product.

Intent vs metadata

Intent tells the system what action to take.

Metadata tells the system what details are needed.

Example:

Roll back the payment service to version 2.4.1.

Structured output:

{
  "intent": "RollbackDeployment",
  "service": "payment-service",
  "targetVersion": "2.4.1"
}

Here:

RollbackDeployment is the intent
payment-service and 2.4.1 are metadata

Another example:

Assign the open auth bug to Priya and set it to high priority.

Structured output:

{
  "intent": "UpdateIssue",
  "issueType": "bug",
  "topic": "authentication",
  "assignee": "Priya",
  "priority": "high"
}

The intent routes the request.

The metadata fills the parameters.

A good intent system does both.

Why plain prompts break

For a quick demo, you can ask a large language model:

Classify this message into an intent.

That works for simple inputs.

But real users do not always write clean commands.

Example:

My dashboard stopped loading after this morning's deployment. I think I need to roll back.

This could involve multiple intents:

ReportBug
CheckDeploymentStatus
RollbackDeployment

A weak classifier might return:

The user is having a problem with their deployment.

That is okay as a human-readable summary, but it does not help much with the backend.

Your app needs something structured:

{
  "primaryIntent": "ReportBug",
  "secondaryIntent": "RollbackDeployment",
  "service": "dashboard",
  "trigger": "deployment",
  "confidence": 0.82
}

The confidence field tells downstream logic whether to proceed automatically or ask a clarifying question.

This is why production systems need more than plain prompts.

They need structured output, validation, and clear allowed intent values.

Common ways to build intent classification

There are a few ways to build this. Each one has a different tradeoff.

1. Rule-based and regex

This is the most direct approach.

You define rules and patterns manually.

Example:

If the message contains "roll back", classify it as RollbackDeployment.

Or use regex to extract values:

const match = message.match(/v?(\d+\.\d+\.\d+)/);
const version = match?.[1]; // "2.4.1"

Input:

Roll back the payment service to version 2.4.1.

Output:

{
  "intent": "RollbackDeployment",
  "targetVersion": "2.4.1"
}

Rule-based systems are useful because they are:

Fast
Predictable
Easy to test
Good for strict commands
Good for safety-critical constraints

But they are rigid.

This is easy:

Roll back the deployment.

This is harder:

The release from this morning is causing problems. Maybe undo it?

The user is describing a rollback, but they did not use that word.

Rules are good when the input is predictable.

They struggle when users write naturally.

2. NLU and machine learning

NLU means Natural Language Understanding.

Before LLMs became popular, many teams used NLU tools like:

Rasa
Dialogflow
spaCy-based classifiers

The concept is straightforward: teach a model by using examples.

For RollbackDeployment, you might provide examples like:

Roll back the deployment.
Undo the last release.
Can you revert the pipeline to the previous build?
Go back to version 2.4.1.

The model learns that different phrases can map to the same intent.

NLU is more flexible than regex because the user does not need to use exact keywords.

It is useful when:

Your intents are known
Your flows are repeated
You have enough training examples
You want a classic chatbot-style system

The downside is maintenance.

You need to collect examples, update training data, and keep the model aligned as your product changes.

3. LLM-based classification

LLMs make intent classification more flexible.

They can understand messy, longer, and more complex requests without needing thousands of labeled examples.

Example:

Deploy the recommendation service to production, but only after the staging health checks pass, and auto-rollback if the error rate exceeds 1% within the first ten minutes.

Structured output:

{
  "intent": "DeployService",
  "service": "recommendation-service",
  "targetEnvironment": "production",
  "precondition": "staging_health_checks_pass",
  "rollbackThreshold": {
    "metric": "error_rate",
    "value": 1,
    "unit": "percent",
    "windowMinutes": 10
  }
}

This is where LLMs are useful.

They can understand:

Natural phrasing
Multiple conditions
Missing context
Long user requests
More complex instructions

But LLMs are not automatically reliable.

They can:

Invent fields
Misread values
Return inconsistent formats
Pick the wrong intent
Loosen important constraints

So the model should not just answer.

It should return structured data that your system can validate.

Structured outputs

Structured output means the model must respond in a specific format, usually JSON.

Instead of this:

The user wants to roll back their deployment.

You want this:

{
  "intent": "RollbackDeployment",
  "service": "payment-service",
  "targetVersion": "2.4.1"
}

This matters because software can work with structured data.

Your backend can route it:

if intent == Intent.RollbackDeployment:
    trigger_rollback(service, target_version)

Free text is hard to trust.

Structured output is easier to test, validate, and connect to real code.

LangChain and Pydantic example

LangChain can help you force structured outputs from an LLM using .with_structured_output().

With Pydantic, you define the shape of the result you expect, and LangChain handles the rest.

from enum import Enum
from pydantic import BaseModel
from langchain_anthropic import ChatAnthropic
 
 
class Intent(str, Enum):
    CheckDeploymentStatus = "CheckDeploymentStatus"
    RollbackDeployment = "RollbackDeployment"
    FeatureInquiry = "FeatureInquiry"
    ReportBug = "ReportBug"
 
 
class UserIntent(BaseModel):
    intent: Intent
    service: str | None = None
    target_version: str | None = None
 
 
model = ChatAnthropic(model="claude-sonnet-4-6")
classifier = model.with_structured_output(UserIntent)
 
result = classifier.invoke("Roll back the payment service to version 2.4.1")
print(result.intent)          # Intent.RollbackDeployment
print(result.service)         # "payment-service"
print(result.target_version)  # "2.4.1"

The model returns data that matches your schema directly.

Example response:

{
  "intent": "RollbackDeployment",
  "service": "payment-service",
  "target_version": "2.4.1"
}

This is much easier to use than interpreting a model's classification from plain text.

Why enums matter

Enums stop the model from inventing random intent names.

Without an enum, you might get:

rollback deployment
Rollback
DeploymentRollback
UserWantsRollback
RollbackIntent

These all mean similar things, but they create messy backend logic.

With an enum, the system must return one clean value:

RollbackDeployment

Now your intent can map directly to a workflow:

RollbackDeployment    -> rollback pipeline
CheckDeploymentStatus -> fetch logs
FeatureInquiry        -> search docs
ReportBug             -> open issue

Enums make the system easier to maintain.

4. Hybrid approach

In production, a combination of approaches often works best.

A strong system can combine:

Regex for obvious values
Rules for strict constraints
LLMs for flexible language understanding
Schemas for structured output
Validation for correctness
Fallback questions when the input is unclear

Example:

Roll back the payment service to version 2.4.1.

Regex extracts:

{
  "targetVersion": "2.4.1"
}

The LLM classifies:

{
  "intent": "RollbackDeployment",
  "service": "payment-service"
}

Validation checks that the required fields exist.

Final result:

{
  "intent": "RollbackDeployment",
  "service": "payment-service",
  "targetVersion": "2.4.1",
  "status": "ready_to_execute"
}

This is stronger than trusting one layer to do everything.

Why hybrid matters for high-risk systems

If your app only answers questions, mistakes are annoying.

If your app performs actions, mistakes become dangerous.

This matters in:

Infrastructure operations
Financial transactions
Healthcare scheduling
Legal workflows
Account permissions
Any system that executes real actions

Example:

Schedule a follow-up scan for the patient, but only with Dr. Reyes, and only after at least three weeks have passed since the last procedure on June 20th.

The user gave a hard constraint:

{
  "earliestAllowedDate": "2026-07-11",
  "requiredDoctor": "Dr. Reyes"
}

The LLM should never be allowed to turn that into:

{
  "earliestAllowedDate": "2026-06-22",
  "requiredDoctor": null
}

For high-risk systems, deterministic rules should win over model guesses.

A better architecture looks like this:

User input
-> deterministic extraction
-> LLM classification
-> schema validation
-> safety checks
-> execution

The LLM helps understand the request.

The system still owns correctness.

Metadata tagging

Metadata tagging is related to intent classification, but it is usually used for documents or larger text.

Intent classification asks:

What does the user want to do?

Metadata tagging asks:

What labels can we extract from this content?

Example:

The CSV export hangs on reports with more than 50,000 rows. Smaller exports work fine. We're on Enterprise and this is blocking our weekly reporting cycle.

Metadata output:

{
  "category": "bug",
  "severity": "high",
  "feature": "csv-export",
  "plan": "enterprise",
  "impact": "blocking"
}

This is useful for search and retrieval.

Before storing support tickets in a vector database, you can tag them with:

Feature area
Severity
Category
Plan tier
Impact
Reporter type

Then search becomes more precise.

Example query:

Find high-severity bugs in the export feature from Enterprise customers.

The system can filter by metadata:

{
  "severity": "high",
  "feature": "csv-export",
  "plan": "enterprise"
}

This approach is more effective for retrieval than using embeddings by themselves.

Builder mental model

Intent classification is not just an AI feature.

It is a routing layer between natural language and software execution.

User message
-> classify intent
-> extract metadata
-> validate result
-> choose workflow
-> execute safely

The goal is not to make the model sound smart.

The goal is to turn user input into something your application can actually use.

Final summary

Intent classification turns human language into structured actions.

A simple system detects the intent:

{
  "intent": "RollbackDeployment"
}

A stronger system also extracts metadata:

{
  "intent": "RollbackDeployment",
  "service": "payment-service",
  "targetVersion": "2.4.1"
}

There are several ways to build it:

Rule-based and regex: fast, strict, and predictable
NLU: trained classifiers for known workflows
LLMs: flexible understanding for messy user input
Hybrid systems: best for production, especially when safety matters

Best practice:

Use LLMs for understanding.
Use schemas for structure.
Use rules for hard constraints.
Use validation before execution.

In other words:

Do not let the AI only reply.
Make it classify, extract, validate, and route.