How to Use Pydantic with LLMs? Remove The Need For Manual Parsing

I was trying to build an "LLM as a Judge." The idea was simple: I wanted the LLM to look at another LLM's input prompt and output, and grade it. This would be used by another program for further processing. Sounds simple, right? Wait for the full story.

This blog is about that journey. We'll start with the chaos—the manual parsing, the breaking code, the frustration. Then I'll show you how Pydantic solved it all, and why it's become essential for anyone building production LLM applications. We'll also peek under the hood to understand what makes Pydantic so powerful, and how tools like Instructor make it even better.

Ready? Set go.

Why Can’t LLMs Just Follow Simple Instructions?

Going back to the LLM as a judge model. Initially, my prompt to this judge model looked something like this:

"Evaluate the answer. First, give me the confidence level (High, Medium, or Low), and then give a brief explanation."

Simple enough, right? What could go wrong?

Everything, apparently.

I expected the output to look like: High: The answer covers all key points.

So, I wrote some Python code to fetch the first word as confidence and the rest as explanation.

And that is where the nightmare began.

Sometimes the model would start with "I have high confidence that..." instead of just "high." Other times it would say "Confidence: High". And my personal favorite: it would launch into the explanation and casually mention its confidence level somewhere in the middle, like an afterthought.

My parsing code kept breaking. Error after error after error…

"Okay Fine, Let’s Use JSON" (That Didn’t Work Either)

"Fine," I thought. "I'll just ask for JSON. That's structured, right?"

So I updated my prompt: "Return your response as JSON with 'confidence' and 'explanation' fields."

Problem solved, right? Wrong.

This is what I was expecting in return:

{
  "confidence": "medium",
  "explanation": "While the approach is solid..."
}

But the LLM started returning this:

Here's my evaluation:
```json
{
  "confidence": "medium",
  "explanation": "While the approach is solid..."
}
```

Those backticks! The text before it! ...

Python's json.loads() absolutely hates markdown code fences. My code crashed again.

And sometimes—sometimes—the model would just... forget. It would give me a perfectly crafted explanation in plain English, completely ignoring my JSON request.

Can I Build a Retry Loop to Fix This?

By this point, I was in too deep. I built what I called a "feedback loop"—a fancy name for "try again when it breaks."

The logic went something like this:

def get_llm_judgment(text):
    max_retries = 3
    for attempt in range(max_retries):
        response = call_llm(text)
        
        try:
            # Try to parse JSON
            cleaned = response.strip()
            if cleaned.startswith('```'):
                # Remove markdown code fences
                cleaned = cleaned.split('```')[1]
                if cleaned.startswith('json'):
                    cleaned = cleaned[4:]
            
            data = json.loads(cleaned)
            
            # Validate the structure
            if 'confidence' not in data:
                raise ValueError("Missing confidence field")
            if data['confidence'] not in ['high', 'medium', 'low']:
                raise ValueError("Invalid confidence value")
            if 'explanation' not in data:
                raise ValueError("Missing explanation field")
            if not isinstance(data['explanation'], str):
                raise ValueError("Explanation must be a string")
            
            return data
            
        except (json.JSONDecodeError, ValueError) as e:
            if attempt < max_retries - 1:
                # Pass the error back to the LLM
                text = f"Previous response had an error: {str(e)}. Please try again with valid JSON."
            else:
                raise Exception("Failed to get valid response after retries")

It worked. Mostly. But look at that code. It's a mess. And this is just for two simple fields!

What if I needed to add more fields? What if I wanted nested structures? What if I needed to validate that an email field actually contained a valid email?

I'd be writing validation code forever.

How Does Pydantic Solve LLM Output Validation?

"Have you tried Pydantic?" a colleague asked casually.

I was like, "Pie-what-now?"

But with Pydantic, my entire validation nightmare collapsed into this:

from pydantic import BaseModel, Field
from typing import Literal

class LLMJudgment(BaseModel):
    confidence: Literal["high", "medium", "low"]
    explanation: str = Field(min_length=10)

When I get a response from the LLM, I just do:

try:
    judgment = LLMJudgment.model_validate_json(llm_response)
    print(f"Confidence: {judgment.confidence}")
    print(f"Explanation: {judgment.explanation}")
except ValidationError as e:
    print(f"Invalid response: {e}")

If the JSON is malformed? Pydantic tells me cleanly.

All those manual checks? Gone. All that string parsing? Gone. All that pain? Gone.

But the story doesn't end there.

Wait, I Still Have to Parse JSON Myself?

Even with Pydantic, I was still doing this dance:

Call the LLM
Get a string response
Clean up any markdown formatting
Parse the JSON
Pass it to my Pydantic model
Handle errors and maybe retry

What Is Instructor and How Does It Simplify Pydantic Integration?

Instructor is like Pydantic's best friend who knows how to talk to LLMs. Instead of all those steps, I write this:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI(api_key=<your_key>))

judgment = client.chat.completions.create(
    model="gpt-4",
    response_model=LLMJudgment,
    messages=[
        {"role": "user", "content": "Evaluate this response: ..."}
    ]
)

# judgment is already a validated LLMJudgment object!
print(judgment.confidence)  
print(judgment.explanation)

That's it. No JSON parsing. No error handling. No cleanup. The judgment variable is already a fully validated Python object.

And if the LLM returns invalid data? Instructor automatically retries with the validation error fed back to the model. It's the feedback loop I built manually, except it actually works reliably.

This was my journey from chaos to structure. From brittle string parsing to type-safe, validated outputs with barely any code.

Let's zoom into Pydantic now, shall we?

What Is Pydantic and Why Does It Matter for LLMs?

Here's something that surprised me: Pydantic has been around since 2017.

That's before the "Attention Is All You Need" paper that introduced Transformers—the foundation of modern LLMs—was published by Google. Pydantic wasn't built for LLMs. LLMs just happened to have a problem that Pydantic solved perfectly.

So what was Pydantic built for?

Python doesn't actually enforce types.

So when we write this:

def process_user(name: str, age: int):
    print(f"{name} is {age} years old")

Those type hints (str and int)? They're just suggestions. Friendly recommendations. Python won't stop us from doing this:

process_user(["Alice", "Bob"], "twenty-five")
# Python happily runs this nonsense

This is fine for small scripts, but in production systems? It's a nightmare. We need to know that data is the shape we expect it to be. We need validation.

That's what Pydantic does. It takes Python's optional type hints and makes them real.

Today, Pydantic is everywhere in the Python ecosystem where data consistency matters—API servers (FastAPI is built on it), data pipelines, configuration management, and yes, now LLM applications too.

How Does Instructor Work with Multiple LLM Providers?

Instructor is much newer—it launched in September 2023, right when everyone was figuring out how to make LLMs useful in production.

The creator, Jason Liu, saw everyone was writing the same boilerplate code to parse and validate LLM outputs. Why not wrap that up into a library?

Instructor is specifically designed to make Pydantic and LLMs work together seamlessly. It handles:

Generating the JSON schema from your Pydantic model
Injecting it into the LLM prompt
Parsing the response
Validating against your model
Automatic retries when validation fails
Support for multiple LLM providers (OpenAI, Anthropic, local models, etc.)

It's the bridge that didn't exist before.

And it provides one common API which can be used across multiple providers:

# OpenAI
client = instructor.from_provider("openai/gpt-4o", api_key="sk-...")

# Anthropic
client = instructor.from_provider("anthropic/claude-3-5-sonnet", api_key="sk-ant-...")

# Google
client = instructor.from_provider("google/gemini-pro", api_key="...")

# Ollama (local)
client = instructor.from_provider("ollama/llama3.2")

# All use the same API!
user = client.chat.completions.create(
    response_model=User,
    messages=[{"role": "user", "content": "..."}],
)

But today's focus is not Instructor—we'll talk more about it some other time. Just note that Pydantic can be used with multiple other frameworks like LangChain and LlamaIndex to provide similar functionality. The plus point of Instructor is that it's lightweight, faster, and super focused on one thing: structured extraction.

Also, to clarify the difference between Pydantic and Instructor:

Pydantic helps in enforcing the schema, but for retries, feedback, and making the model aware of the schema, tools like Instructor, LangChain, etc., are used. So both have their separate places.

What Advanced Features Does Pydantic Offer?

Let me show you what I mean by "more than simple fields."

Remember my LLMJudgment model? That was using BaseModel—the foundation class you inherit from to create Pydantic models. But you can build much more sophisticated structures.

How Do I Create Nested Pydantic Models?

Let's say I'm building an LLM system to analyze customer feedback:

from pydantic import BaseModel, Field
from typing import List, Literal

class Sentiment(BaseModel):
    score: float = Field(ge=-1.0, le=1.0, description="Sentiment score from -1 (negative) to 1 (positive)")
    label: Literal["positive", "neutral", "negative"]

class Issue(BaseModel):
    category: str
    severity: Literal["low", "medium", "high"]
    description: str

class FeedbackAnalysis(BaseModel):
    sentiment: Sentiment
    issues: List[Issue]
    action_items: List[str]
    summary: str = Field(min_length=20, max_length=500)

Look at that—FeedbackAnalysis contains a Sentiment object and a list of Issue objects. Pydantic validates the entire nested structure automatically.

When the LLM returns this JSON:

{
  "sentiment": {
    "score": -0.6,
    "label": "negative"
  },
  "issues": [
    {
      "category": "shipping",
      "severity": "high",
      "description": "Package arrived damaged"
    }
  ],
  "action_items": ["Contact customer", "Issue refund"],
  "summary": "Customer received damaged product and is unhappy with shipping service."
}

Pydantic checks:

Is the sentiment score between -1 and 1? ✓
Is the severity one of the allowed values? ✓
Is the summary between 20 and 500 characters? ✓
Are all the required fields present? ✓

And I wrote zero validation code for any of that.

Can I Add Custom Validation Logic in Pydantic?

But what if you have business logic that goes beyond basic type checking?

Say I'm extracting date ranges and I need to ensure the start date comes before the end date:

from pydantic import BaseModel, field_validator
from datetime import datetime

class DateRange(BaseModel):
    start_date: datetime
    end_date: datetime
    
    @field_validator('end_date')
    def end_must_be_after_start(cls, v, info):
        if 'start_date' in info.data and v < info.data['start_date']:
            raise ValueError('end_date must be after start_date')
        return v

The @field_validator decorator lets you write custom validation logic. You can validate individual fields, cross-check multiple fields, transform data—whatever you need.

Pydantic also has built-in validators for common patterns:

from pydantic import BaseModel, EmailStr, HttpUrl

class UserProfile(BaseModel):
    email: EmailStr  # Validates email format
    website: HttpUrl  # Validates URL format
    age: int = Field(ge=0, le=120)  # Between 0 and 120

No regex writing. No manual format checking. It just works.

Why Is Pydantic Essential for AI Agent Workflows?

If you've worked with AI agents, you know the pain.

An agent isn't just one LLM call—it's a conversation. The agent uses tools, maintains state, makes decisions, and chains multiple steps together. At each step, you need structured data.

Without structured outputs, this becomes impossible to manage reliably.

Let me show you a quick example using LangChain, one of the most popular frameworks for building AI agents:

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

class SearchQuery(BaseModel):
    query: str = Field(description="The search query to execute")
    max_results: int = Field(default=5, description="Maximum number of results")

llm = ChatOpenAI(model="gpt-4")
structured_llm = llm.with_structured_output(SearchQuery)

result = structured_llm.invoke("Find me recent papers about transformers")
# result is a SearchQuery object
print(result.query)  
print(result.max_results)

The .with_structured_output() method? That's using Pydantic under the hood.

In agentic workflows, this pattern repeats everywhere:

Tool calling: Agents use Pydantic models to define what tools exist and what parameters they accept
State management: Frameworks like LangGraph use Pydantic to define the state that flows through the agent's decision-making process
Output parsing: Every step of the agent's reasoning can be structured and validated

Without Pydantic (or something like it), you'd be writing validation code for every single step. The overhead would be crushing.

Actually, the Pydantic team was so convinced that agentic workflows need structured outputs that they launched their own agentic framework—PydanticAI—in late 2024, with Pydantic integration built in from day one. But that's a story for another time.

The Takeaway

When we're building with LLMs, we're working with something fundamentally unpredictable. The model might give perfectly expected output (schema) 99% of the time, but that 1% will break our application in production.

Pydantic gives you that 1% guarantee.

Happy Building! Bye!

Site Metrics