December 2025
What is DSPy? The End of Prompt Engineering (And Why GEPA Changes Everything)
Last updated: December 2025
You've spent hours crafting the perfect prompt. It works beautifully... until you switch models, change your use case slightly, or wake up tomorrow. Then it breaks, and you're back to trial-and-error.
This isn't a skill issue. It's a fundamental problem with how we've been building AI systems.
DSPy fixes this. And its newest optimizer, GEPA, might be the most important development in LLM programming since chain-of-thought prompting.
TL;DR
DSPy is a framework from Stanford NLP that lets you program language models instead of prompting them. You write Python code describing what you want, and DSPy's optimizers automatically figure out how to make it happen.
GEPA is DSPy's newest optimizer (July 2025) that uses natural language reflection to improve prompts—beating reinforcement learning approaches while using 35x fewer resources.
Together, they represent a paradigm shift: from artisanal prompt crafting to systematic AI engineering.
The Problem: Prompt Engineering is Broken
Here's what building with LLMs looks like today:
1. Write a prompt
2. Test it
3. It fails on edge cases
4. Add more instructions
5. Now it's too long and expensive
6. Simplify
7. It fails differently
8. Repeat forever
And that's just for one LLM call. Modern AI systems chain multiple calls together—retrieval, reasoning, tool use, validation. Each link in that chain has its own fragile prompt.
The core issues:
| Problem | Reality |
|---|---|
| Brittleness | Prompts break when you change models, add features, or scale up |
| No composability | You can't easily combine prompts like you combine functions |
| Manual optimization | Every improvement requires human intuition and trial-and-error |
| No portability | Prompts optimized for GPT-4 don't transfer to Claude or Llama |
What if there was a better way?
Enter DSPy: Programming, Not Prompting
DSPy (Declarative Self-improving Python) flips the script. Instead of writing prompts, you write programs.
The Key Insight
DSPy separates what you want from how to achieve it:
- You define: "Given a question and context, produce an answer"
- DSPy figures out: The exact prompt, examples, and structure to make it work
This is the same abstraction leap that made programming languages possible. Assembly programmers once hand-crafted CPU instructions. Then compilers automated that. Now you write Python and trust the compiler.
DSPy is a compiler for language models.
A Simple Example
Traditional prompting:
prompt = """You are a helpful assistant. Given the following context and question,
provide a comprehensive answer. Be concise but thorough. Use the context to support
your answer. If you don't know, say so.
Context: {context}
Question: {question}
Answer:"""
response = llm.complete(prompt.format(context=ctx, question=q))
DSPy:
import dspy
qa = dspy.ChainOfThought("context, question -> answer")
response = qa(context=ctx, question=q)
That's it. No prompt template. No manual engineering. DSPy handles the rest.
How DSPy Works
DSPy has three core concepts: Signatures, Modules, and Optimizers.
1. Signatures: What You Want
Signatures declare input/output behavior without specifying implementation:
# Simple signature (inline)
"question -> answer"
# With types
"question: str -> answer: float"
# Multiple inputs and outputs
"context: list[str], question: str -> reasoning: str, answer: str"
For complex tasks, use class-based signatures:
from typing import Literal
class SentimentClassifier(dspy.Signature):
"""Classify the sentiment of a given text."""
text: str = dspy.InputField()
sentiment: Literal["positive", "negative", "neutral"] = dspy.OutputField()
confidence: float = dspy.OutputField()
2. Modules: How to Execute
Modules are building blocks that implement signatures with different strategies:
| Module | What It Does |
|---|---|
dspy.Predict | Basic prediction |
dspy.ChainOfThought | Adds step-by-step reasoning |
dspy.ProgramOfThought | Generates code to solve problems |
dspy.ReAct | Agent with tool use |
dspy.MultiChainComparison | Compares multiple reasoning paths |
Composing modules is where DSPy shines:
class RAG(dspy.Module):
def __init__(self, num_docs=5):
self.retrieve = dspy.ColBERTv2(url="your-retriever")
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question, k=num_docs)
return self.generate(context=context, question=question)
This is a complete RAG pipeline in 8 lines. No prompt templates. No manual few-shot examples. Just clean, composable code.
3. Optimizers: Automatic Improvement
Here's where DSPy gets magical. Optimizers take your program and automatically tune it to maximize a metric you define.
from dspy.teleprompt import MIPROv2
# Define what "good" means
def metric(example, prediction, trace=None):
return prediction.answer.lower() == example.answer.lower()
# Optimize
optimizer = MIPROv2(metric=metric, auto="medium")
optimized_rag = optimizer.compile(RAG(), trainset=your_examples)
# Result: Your RAG pipeline now performs 20-30% better
The optimizer:
- Runs your program on training examples
- Collects successful traces
- Uses them to generate better prompts and demonstrations
- Searches over combinations to maximize your metric
You provide examples. DSPy does the prompt engineering.
Why GEPA Changes Everything
DSPy has had several optimizers: BootstrapFewShot, MIPROv2, COPRO. They work well. But in July 2025, Stanford NLP released GEPA (Genetic-Pareto), and it's a breakthrough.
The Problem with Previous Approaches
Traditional LLM optimization uses reinforcement learning—treating the prompt as a policy and using scalar rewards to update it. This works, but:
- Requires massive numbers of rollouts (expensive)
- Learns slowly from sparse reward signals
- Doesn't leverage what LLMs are actually good at: language
GEPA's Key Insight
Language is a richer learning medium than scalar rewards.
Instead of treating optimization as an RL problem, GEPA treats it as a reflection problem:
- Sample trajectories: Run the program, collect reasoning traces
- Reflect in language: Ask the LLM to diagnose what went wrong
- Propose improvements: Generate new prompt variations
- Test and combine: Use Pareto optimization to find the best combinations
The LLM doesn't just get a "0.7 reward"—it gets to read what happened and reason about how to improve.
The Results
| Comparison | GEPA Improvement |
|---|---|
| vs GRPO (RL-based) | +10-20% better, 35x fewer rollouts |
| vs MIPROv2 | +10% across multiple LLMs |
35x fewer rollouts means 35x less cost. GEPA isn't just better—it's dramatically more efficient.
Using GEPA
import dspy
# Your program
qa = dspy.ChainOfThought("question -> answer")
# Optimize with GEPA
optimizer = dspy.GEPA(metric=your_metric)
optimized_qa = optimizer.compile(qa, trainset=examples)
Same interface as other optimizers. Drop-in replacement. Massive improvement.
Real-World Impact
DSPy isn't academic vaporware. It's in production at:
| Company | Use Case |
|---|---|
| JetBlue | Customer service chatbots |
| Replit | Code diff synthesis |
| Databricks | Classification, RAG, LLM judges |
| Sephora | Agent-based customer experiences |
| VMware | RAG and prompt optimization |
| Moody's | Financial document analysis |
Benchmark Improvements
When you optimize DSPy programs, the gains are significant:
| Task | Before | After | Gain |
|---|---|---|---|
| RAG (SemanticF1) | 42% | 61% | +19% |
| ReAct Agent | 24% | 51% | +27% |
| Multi-hop QA | 31% | 59% | +28% |
| Classification | 66% | 87% | +21% |
These aren't cherry-picked results. This is what happens when you stop hand-tuning prompts and let algorithms optimize them.
DSPy vs. The Alternatives
DSPy vs. LangChain
| Aspect | DSPy | LangChain |
|---|---|---|
| Focus | Programming LMs | Chaining LM calls |
| Prompts | Automatically optimized | Manually written |
| Philosophy | Compilation | Orchestration |
| Learning | Core feature | Not built-in |
LangChain helps you connect LLM calls. DSPy helps you optimize them.
DSPy vs. LlamaIndex
| Aspect | DSPy | LlamaIndex |
|---|---|---|
| Focus | LM programming | Data retrieval |
| Strength | Optimization | Indexing & RAG |
| Use together? | Yes | Yes |
They're complementary. Use LlamaIndex for retrieval, DSPy for the LM layer.
DSPy vs. Raw Prompting
| Aspect | DSPy | Manual Prompts |
|---|---|---|
| Iteration speed | Fast (automatic) | Slow (trial-and-error) |
| Portability | Cross-model | Model-specific |
| Composability | Native | Fragile |
| Optimization | Algorithmic | Intuition-based |
Getting Started with DSPy
Installation
pip install -U dspy
Basic Setup
import dspy
# Configure your LM
lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_KEY')
dspy.configure(lm=lm)
# Or use Anthropic
lm = dspy.LM('anthropic/claude-sonnet-4-5-20250929', api_key='YOUR_KEY')
# Or local models via Ollama
lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434')
Your First Program
# Simple QA
qa = dspy.Predict("question -> answer")
result = qa(question="What is the capital of France?")
print(result.answer) # Paris
# With reasoning
qa_cot = dspy.ChainOfThought("question -> answer")
result = qa_cot(question="What is 15% of 80?")
print(result.reasoning) # Shows step-by-step math
print(result.answer) # 12
Build a RAG Pipeline
class SimpleRAG(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
# Your retrieval logic here
context = retrieve_documents(question)
return self.generate(context=context, question=question)
rag = SimpleRAG()
answer = rag(question="What were Q3 earnings?")
Optimize It
# Prepare training data
trainset = [
dspy.Example(question="...", answer="...").with_inputs("question"),
# ... more examples
]
# Define your metric
def accuracy(example, pred, trace=None):
return example.answer.lower() in pred.answer.lower()
# Optimize with GEPA
optimizer = dspy.GEPA(metric=accuracy)
optimized_rag = optimizer.compile(SimpleRAG(), trainset=trainset)
# Save for production
optimized_rag.save("optimized_rag.json")
Common Patterns
Multi-Hop Reasoning
class MultiHopQA(dspy.Module):
def __init__(self, num_hops=3):
self.num_hops = num_hops
self.generate_query = dspy.ChainOfThought("context, question -> search_query")
self.generate_answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = []
for _ in range(self.num_hops):
query = self.generate_query(context=context, question=question)
new_context = search(query.search_query)
context.extend(new_context)
return self.generate_answer(context=context, question=question)
Agent with Tools
def search_web(query: str) -> str:
"""Search the web for information."""
return web_search(query)
def calculate(expression: str) -> float:
"""Evaluate a math expression."""
return eval(expression)
agent = dspy.ReAct(
"question -> answer",
tools=[search_web, calculate]
)
result = agent(question="What is the population of Tokyo divided by 1000?")
Classification with Structured Output
from typing import Literal
class ClassifyIntent(dspy.Signature):
"""Classify user intent for customer support routing."""
message: str = dspy.InputField()
intent: Literal["billing", "technical", "sales", "other"] = dspy.OutputField()
confidence: float = dspy.OutputField()
reasoning: str = dspy.OutputField()
classifier = dspy.Predict(ClassifyIntent)
result = classifier(message="I can't log into my account")
# result.intent = "technical", result.confidence = 0.92
Best Practices
1. Start Simple, Optimize Later
# Start with basic Predict
v1 = dspy.Predict("question -> answer")
# Evaluate, then try ChainOfThought
v2 = dspy.ChainOfThought("question -> answer")
# Then optimize
v3 = optimizer.compile(v2, trainset=data)
2. Write Good Metrics
Your metric defines "success." Make it specific:
# Too simple
def metric(ex, pred, trace=None):
return ex.answer == pred.answer
# Better: semantic similarity
from dspy.evaluate import SemanticF1
metric = SemanticF1(decompositional=True)
# Best: domain-specific
def metric(ex, pred, trace=None):
correct = ex.answer.lower() in pred.answer.lower()
concise = len(pred.answer.split()) < 100
has_citation = "[source]" in pred.answer
return correct and concise and has_citation
3. Use Evaluation Throughout
evaluate = dspy.Evaluate(
devset=dev_examples,
metric=your_metric,
num_threads=24,
display_progress=True
)
# Check before optimization
baseline_score = evaluate(your_program)
# Check after
optimized_score = evaluate(optimized_program)
4. Inspect What's Happening
# See the actual prompts being generated
dspy.inspect_history(n=1)
# Debug mode
import logging
logging.getLogger("dspy").setLevel(logging.DEBUG)
5. Save and Version Your Programs
# Save optimized program
optimized.save("v1_rag_2025_01.json")
# Load later
loaded = SimpleRAG()
loaded.load("v1_rag_2025_01.json")
The Bigger Picture
DSPy represents a fundamental shift in how we build with LLMs:
| Era | Approach | Analogy |
|---|---|---|
| 2022-2023 | Prompt engineering | Assembly language |
| 2024-2025 | DSPy + GEPA | High-level programming |
| Future | ??? | Compilers we haven't imagined |
The teams still hand-crafting prompts in 2025 are like developers still writing assembly in the age of Python. It works, but you're competing against people with better tools.
GEPA specifically validates a key thesis: natural language reflection is a more efficient optimization medium than scalar rewards. This has implications beyond DSPy—it suggests future AI systems will improve themselves through linguistic self-reflection, not just gradient descent.
Resources
Official
- Documentation: dspy.ai
- GitHub: github.com/stanfordnlp/dspy (30k+ stars)
- Discord: Active community with 10k+ members
Papers
- DSPy: "Compiling Declarative Language Model Calls into Self-Improving Pipelines" (ICLR 2024)
- MIPROv2: "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs"
- GEPA: "Reflective Prompt Evolution Can Outperform Reinforcement Learning" (July 2025)
Tutorials
- DSPy Intro Tutorial (official docs)
- RAG with DSPy (dspy.ai/tutorials)
- GEPA optimization guide (dspy.ai/tutorials/gepa)
Conclusion
Prompt engineering was a necessary first step. We had to learn what LLMs could do before we could systematize it.
But now we know. And DSPy + GEPA give us the tools to move beyond artisanal prompt crafting to real AI engineering.
The question isn't whether to adopt this paradigm. It's how soon.
Start with a simple program. Optimize it with GEPA. See the results. You won't go back to hand-tuning prompts.
Building something with DSPy? Found a pattern that works well? Share it with the community.