What is DSPy? The End of Prompt Engineering (And Why GEPA Changes Everything)

Last updated: December 2025

You've spent hours crafting the perfect prompt. It works beautifully... until you switch models, change your use case slightly, or wake up tomorrow. Then it breaks, and you're back to trial-and-error.

This isn't a skill issue. It's a fundamental problem with how we've been building AI systems.

DSPy fixes this. And its newest optimizer, GEPA, might be the most important development in LLM programming since chain-of-thought prompting.

TL;DR

DSPy is a framework from Stanford NLP that lets you program language models instead of prompting them. You write Python code describing what you want, and DSPy's optimizers automatically figure out how to make it happen.

GEPA is DSPy's newest optimizer (July 2025) that uses natural language reflection to improve prompts—beating reinforcement learning approaches while using 35x fewer resources.

Together, they represent a paradigm shift: from artisanal prompt crafting to systematic AI engineering.

The Problem: Prompt Engineering is Broken

Here's what building with LLMs looks like today:

1. Write a prompt
2. Test it
3. It fails on edge cases
4. Add more instructions
5. Now it's too long and expensive
6. Simplify
7. It fails differently
8. Repeat forever

And that's just for one LLM call. Modern AI systems chain multiple calls together—retrieval, reasoning, tool use, validation. Each link in that chain has its own fragile prompt.

The core issues:

Problem	Reality
Brittleness	Prompts break when you change models, add features, or scale up
No composability	You can't easily combine prompts like you combine functions
Manual optimization	Every improvement requires human intuition and trial-and-error
No portability	Prompts optimized for GPT-4 don't transfer to Claude or Llama

What if there was a better way?

Enter DSPy: Programming, Not Prompting

DSPy (Declarative Self-improving Python) flips the script. Instead of writing prompts, you write programs.

The Key Insight

DSPy separates what you want from how to achieve it:

You define: "Given a question and context, produce an answer"
DSPy figures out: The exact prompt, examples, and structure to make it work

This is the same abstraction leap that made programming languages possible. Assembly programmers once hand-crafted CPU instructions. Then compilers automated that. Now you write Python and trust the compiler.

DSPy is a compiler for language models.

A Simple Example

Traditional prompting:

prompt = """You are a helpful assistant. Given the following context and question, 
provide a comprehensive answer. Be concise but thorough. Use the context to support 
your answer. If you don't know, say so.

Context: {context}
Question: {question}

Answer:"""

response = llm.complete(prompt.format(context=ctx, question=q))

DSPy:

import dspy

qa = dspy.ChainOfThought("context, question -> answer")
response = qa(context=ctx, question=q)

That's it. No prompt template. No manual engineering. DSPy handles the rest.

How DSPy Works

DSPy has three core concepts: Signatures, Modules, and Optimizers.

1. Signatures: What You Want

Signatures declare input/output behavior without specifying implementation:

# Simple signature (inline)
"question -> answer"

# With types
"question: str -> answer: float"

# Multiple inputs and outputs
"context: list[str], question: str -> reasoning: str, answer: str"

For complex tasks, use class-based signatures:

from typing import Literal

class SentimentClassifier(dspy.Signature):
    """Classify the sentiment of a given text."""
    
    text: str = dspy.InputField()
    sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
    confidence: float = dspy.OutputField()

2. Modules: How to Execute

Modules are building blocks that implement signatures with different strategies:

Module	What It Does
`dspy.Predict`	Basic prediction
`dspy.ChainOfThought`	Adds step-by-step reasoning
`dspy.ProgramOfThought`	Generates code to solve problems
`dspy.ReAct`	Agent with tool use
`dspy.MultiChainComparison`	Compares multiple reasoning paths

Composing modules is where DSPy shines:

class RAG(dspy.Module):
    def __init__(self, num_docs=5):
        self.retrieve = dspy.ColBERTv2(url="your-retriever")
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question, k=num_docs)
        return self.generate(context=context, question=question)

This is a complete RAG pipeline in 8 lines. No prompt templates. No manual few-shot examples. Just clean, composable code.

3. Optimizers: Automatic Improvement

Here's where DSPy gets magical. Optimizers take your program and automatically tune it to maximize a metric you define.

from dspy.teleprompt import MIPROv2

# Define what "good" means
def metric(example, prediction, trace=None):
    return prediction.answer.lower() == example.answer.lower()

# Optimize
optimizer = MIPROv2(metric=metric, auto="medium")
optimized_rag = optimizer.compile(RAG(), trainset=your_examples)

# Result: Your RAG pipeline now performs 20-30% better

The optimizer:

Runs your program on training examples
Collects successful traces
Uses them to generate better prompts and demonstrations
Searches over combinations to maximize your metric

You provide examples. DSPy does the prompt engineering.

Why GEPA Changes Everything

DSPy has had several optimizers: BootstrapFewShot, MIPROv2, COPRO. They work well. But in July 2025, Stanford NLP released GEPA (Genetic-Pareto), and it's a breakthrough.

The Problem with Previous Approaches

Traditional LLM optimization uses reinforcement learning—treating the prompt as a policy and using scalar rewards to update it. This works, but:

Requires massive numbers of rollouts (expensive)
Learns slowly from sparse reward signals
Doesn't leverage what LLMs are actually good at: language

GEPA's Key Insight

Language is a richer learning medium than scalar rewards.

Instead of treating optimization as an RL problem, GEPA treats it as a reflection problem:

Sample trajectories: Run the program, collect reasoning traces
Reflect in language: Ask the LLM to diagnose what went wrong
Propose improvements: Generate new prompt variations
Test and combine: Use Pareto optimization to find the best combinations

The LLM doesn't just get a "0.7 reward"—it gets to read what happened and reason about how to improve.

The Results

Comparison	GEPA Improvement
vs GRPO (RL-based)	+10-20% better, 35x fewer rollouts
vs MIPROv2	+10% across multiple LLMs

35x fewer rollouts means 35x less cost. GEPA isn't just better—it's dramatically more efficient.

Using GEPA

import dspy

# Your program
qa = dspy.ChainOfThought("question -> answer")

# Optimize with GEPA
optimizer = dspy.GEPA(metric=your_metric)
optimized_qa = optimizer.compile(qa, trainset=examples)

Same interface as other optimizers. Drop-in replacement. Massive improvement.

Real-World Impact

DSPy isn't academic vaporware. It's in production at:

Company	Use Case
JetBlue	Customer service chatbots
Replit	Code diff synthesis
Databricks	Classification, RAG, LLM judges
Sephora	Agent-based customer experiences
VMware	RAG and prompt optimization
Moody's	Financial document analysis

Benchmark Improvements

When you optimize DSPy programs, the gains are significant:

Task	Before	After	Gain
RAG (SemanticF1)	42%	61%	+19%
ReAct Agent	24%	51%	+27%
Multi-hop QA	31%	59%	+28%
Classification	66%	87%	+21%

These aren't cherry-picked results. This is what happens when you stop hand-tuning prompts and let algorithms optimize them.

DSPy vs. The Alternatives

DSPy vs. LangChain

Aspect	DSPy	LangChain
Focus	Programming LMs	Chaining LM calls
Prompts	Automatically optimized	Manually written
Philosophy	Compilation	Orchestration
Learning	Core feature	Not built-in

LangChain helps you connect LLM calls. DSPy helps you optimize them.

DSPy vs. LlamaIndex

Aspect	DSPy	LlamaIndex
Focus	LM programming	Data retrieval
Strength	Optimization	Indexing & RAG
Use together?	Yes	Yes

They're complementary. Use LlamaIndex for retrieval, DSPy for the LM layer.

DSPy vs. Raw Prompting

Aspect	DSPy	Manual Prompts
Iteration speed	Fast (automatic)	Slow (trial-and-error)
Portability	Cross-model	Model-specific
Composability	Native	Fragile
Optimization	Algorithmic	Intuition-based

Getting Started with DSPy

Installation

pip install -U dspy

Basic Setup

import dspy

# Configure your LM
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_KEY')
dspy.configure(lm=lm)

# Or use Anthropic
lm = dspy.LM('anthropic/claude-sonnet-4-5-20250929', api_key='YOUR_KEY')

# Or local models via Ollama
lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434')

Your First Program

# Simple QA
qa = dspy.Predict("question -> answer")
result = qa(question="What is the capital of France?")
print(result.answer)  # Paris

# With reasoning
qa_cot = dspy.ChainOfThought("question -> answer")
result = qa_cot(question="What is 15% of 80?")
print(result.reasoning)  # Shows step-by-step math
print(result.answer)     # 12

Build a RAG Pipeline

class SimpleRAG(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        # Your retrieval logic here
        context = retrieve_documents(question)
        return self.generate(context=context, question=question)

rag = SimpleRAG()
answer = rag(question="What were Q3 earnings?")

Optimize It

# Prepare training data
trainset = [
    dspy.Example(question="...", answer="...").with_inputs("question"),
    # ... more examples
]

# Define your metric
def accuracy(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

# Optimize with GEPA
optimizer = dspy.GEPA(metric=accuracy)
optimized_rag = optimizer.compile(SimpleRAG(), trainset=trainset)

# Save for production
optimized_rag.save("optimized_rag.json")

Common Patterns

Multi-Hop Reasoning

class MultiHopQA(dspy.Module):
    def __init__(self, num_hops=3):
        self.num_hops = num_hops
        self.generate_query = dspy.ChainOfThought("context, question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = []
        for _ in range(self.num_hops):
            query = self.generate_query(context=context, question=question)
            new_context = search(query.search_query)
            context.extend(new_context)
        return self.generate_answer(context=context, question=question)

Agent with Tools

def search_web(query: str) -> str:
    """Search the web for information."""
    return web_search(query)

def calculate(expression: str) -> float:
    """Evaluate a math expression."""
    return eval(expression)

agent = dspy.ReAct(
    "question -> answer",
    tools=[search_web, calculate]
)

result = agent(question="What is the population of Tokyo divided by 1000?")

Classification with Structured Output

from typing import Literal

class ClassifyIntent(dspy.Signature):
    """Classify user intent for customer support routing."""
    
    message: str = dspy.InputField()
    intent: Literal["billing", "technical", "sales", "other"] = dspy.OutputField()
    confidence: float = dspy.OutputField()
    reasoning: str = dspy.OutputField()

classifier = dspy.Predict(ClassifyIntent)
result = classifier(message="I can't log into my account")
# result.intent = "technical", result.confidence = 0.92

Best Practices

1. Start Simple, Optimize Later

# Start with basic Predict
v1 = dspy.Predict("question -> answer")

# Evaluate, then try ChainOfThought
v2 = dspy.ChainOfThought("question -> answer")

# Then optimize
v3 = optimizer.compile(v2, trainset=data)

2. Write Good Metrics

Your metric defines "success." Make it specific:

# Too simple
def metric(ex, pred, trace=None):
    return ex.answer == pred.answer

# Better: semantic similarity
from dspy.evaluate import SemanticF1
metric = SemanticF1(decompositional=True)

# Best: domain-specific
def metric(ex, pred, trace=None):
    correct = ex.answer.lower() in pred.answer.lower()
    concise = len(pred.answer.split()) < 100
    has_citation = "[source]" in pred.answer
    return correct and concise and has_citation

3. Use Evaluation Throughout

evaluate = dspy.Evaluate(
    devset=dev_examples,
    metric=your_metric,
    num_threads=24,
    display_progress=True
)

# Check before optimization
baseline_score = evaluate(your_program)

# Check after
optimized_score = evaluate(optimized_program)

4. Inspect What's Happening

# See the actual prompts being generated
dspy.inspect_history(n=1)

# Debug mode
import logging
logging.getLogger("dspy").setLevel(logging.DEBUG)

5. Save and Version Your Programs

# Save optimized program
optimized.save("v1_rag_2025_01.json")

# Load later
loaded = SimpleRAG()
loaded.load("v1_rag_2025_01.json")

The Bigger Picture

DSPy represents a fundamental shift in how we build with LLMs:

Era	Approach	Analogy
2022-2023	Prompt engineering	Assembly language
2024-2025	DSPy + GEPA	High-level programming
Future	???	Compilers we haven't imagined

The teams still hand-crafting prompts in 2025 are like developers still writing assembly in the age of Python. It works, but you're competing against people with better tools.

GEPA specifically validates a key thesis: natural language reflection is a more efficient optimization medium than scalar rewards. This has implications beyond DSPy—it suggests future AI systems will improve themselves through linguistic self-reflection, not just gradient descent.

Resources

Official

Documentation: dspy.ai
GitHub: github.com/stanfordnlp/dspy (30k+ stars)
Discord: Active community with 10k+ members

Papers

DSPy: "Compiling Declarative Language Model Calls into Self-Improving Pipelines" (ICLR 2024)
MIPROv2: "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs"
GEPA: "Reflective Prompt Evolution Can Outperform Reinforcement Learning" (July 2025)

Tutorials

DSPy Intro Tutorial (official docs)
RAG with DSPy (dspy.ai/tutorials)
GEPA optimization guide (dspy.ai/tutorials/gepa)

Conclusion

Prompt engineering was a necessary first step. We had to learn what LLMs could do before we could systematize it.

But now we know. And DSPy + GEPA give us the tools to move beyond artisanal prompt crafting to real AI engineering.

The question isn't whether to adopt this paradigm. It's how soon.

Start with a simple program. Optimize it with GEPA. See the results. You won't go back to hand-tuning prompts.

Building something with DSPy? Found a pattern that works well? Share it with the community.