DSPy
Stanford's framework for programming LLMs with optimizable modules instead of hand-written prompts
DSPy is a Stanford research framework that treats LLM programming as a software engineering problem. Instead of writing prompt strings, you write typed Signatures and compose them into Modules. Optimizers then compile your program, generating few-shot examples and instruction text that maximize a metric you define, without you hand-tuning a single prompt template.
There is a particular frustration that comes from spending two hours tuning a prompt, getting it to work well, then watching it degrade when you switch to a different model or add a new use case. Prompt engineering done by hand is brittle. It does not transfer. It does not improve automatically. You are essentially hard-coding the reasoning strategy for a system that does not stay constant.
DSPy was built to fix that. The framework from Stanford NLP treats LLM programs like software programs: you define the structure in code, and a compiler optimizes the implementation. The prompts are not the work product. They are the output of a compilation step.
What makes DSPy different
Most LLM frameworks ask you to write prompt templates. A template might say: "You are a helpful assistant. Given the question: {question}, provide a detailed answer: {answer}." You spend time tuning that template. You add few-shot examples. You move instructions around. You test different phrasings.
DSPy says: describe what you want, define a quality metric, and let the optimizer figure out the prompt. You write a Signature that says "question -> answer" and a Module that uses it. You pick an optimizer. You run compilation over a small training set. The optimizer produces a version of your program with generated few-shot examples and instruction text that score well on your metric.
The practical result is that your source code no longer contains prompt strings. It contains declarations of intent. The prompt strings live in a compiled state file that you save separately and can regenerate if your requirements change.
Core concepts
Signatures
A Signature is a typed I/O contract. In its shorthand form:
import dspy
class AnswerQuestion(dspy.Signature):
"""Answer the question accurately and concisely."""
question: str = dspy.InputField()
answer: str = dspy.OutputField()
Or as a one-liner for quick use:
qa = dspy.Predict("question -> answer")
Each field carries a type annotation and an optional description that DSPy uses when generating prompts. The docstring becomes the task description in the compiled prompt template. This is how DSPy knows what you want the LLM to do without you writing the instruction text yourself.
Signatures can have multiple inputs and outputs:
class ExtractEntities(dspy.Signature):
"""Extract named entities from the text."""
text: str = dspy.InputField(desc="Source text to analyze")
entities: list[str] = dspy.OutputField(desc="List of named entities found")
entity_types: list[str] = dspy.OutputField(desc="Type of each entity (person, place, org)")
Modules
Modules are composable program units that use Signatures. DSPy ships several built-in modules:
dspy.Predict: Direct LLM call using the Signaturedspy.ChainOfThought: Adds a reasoning step before the output fields, improving accuracy on complex tasksdspy.ReAct: Implements the Reason-Act loop for tool-using agentsdspy.ProgramOfThought: Generates and executes code to answer questionsdspy.Refine: Iteratively improves an output until it meets a quality threshold
You can compose these into multi-step programs:
class ResearchAgent(dspy.Module):
def __init__(self):
self.search = dspy.ReAct("query -> findings", tools=[web_search_tool])
self.synthesize = dspy.ChainOfThought("findings, question -> answer")
def forward(self, question):
findings = self.search(query=question)
return self.synthesize(findings=findings.findings, question=question)
The forward method is where you define how data flows through the program. DSPy records the LLM calls made during forward and uses those traces during optimization.
Optimizers (Teleprompters)
Optimizers are the compilation engine. You give them your program, a training set, and a metric function. They produce an optimized version.
import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShot
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
agent = ResearchAgent()
def quality_metric(example, prediction, trace=None):
return float(prediction.answer in example.expected_answers)
optimizer = BootstrapFewShot(metric=quality_metric, max_bootstrapped_demos=4)
compiled_agent = optimizer.compile(agent, trainset=train_examples)
compiled_agent.save("research_agent_optimized.json")
BootstrapFewShot is the entry-level optimizer. It generates few-shot examples by running your program on training data and selecting the traces that score well on the metric. For more aggressive optimization, MIPROv2 also optimizes the instruction text in each module, which can produce larger quality gains at higher compilation cost.
Assertions
DSPy Assertions let you specify constraints that outputs must satisfy:
class SafeAnswer(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought("question -> answer")
def forward(self, question):
result = self.generate(question=question)
dspy.Assert(
len(result.answer) < 500,
"Answer must be concise, under 500 characters"
)
dspy.Assert(
not any(word in result.answer for word in ["unfortunately", "I cannot"]),
"Answer should not refuse or hedge"
)
return result
Assertions run after the LLM call and can trigger a retry with modified prompts if the constraint is violated. This is different from post-processing validation: DSPy feeds the failed assertion back into the prompt as additional context for the retry, which teaches the LLM what went wrong.
Configuring LLM providers
DSPy uses LiteLLM under the hood, which means you configure providers with a string:
# OpenAI
dspy.configure(lm=dspy.LM("openai/gpt-4o"))
# Anthropic
dspy.configure(lm=dspy.LM("anthropic/claude-3-5-sonnet-20241022"))
# Local model via Ollama
dspy.configure(lm=dspy.LM("ollama_chat/llama3.2", api_base="http://localhost:11434"))
# Per-module override
expensive_module = dspy.ChainOfThought("question -> answer")
expensive_module.set_lm(dspy.LM("openai/gpt-4o"))
Multi-model setups where different modules use different models are straightforward. You configure a default at the dspy.configure level and override per module where needed, which makes cost optimization easier.
Where DSPy wins
DSPy is the right tool when your task has a measurable quality metric and you need prompts to stay reliable as models change. Classification tasks, information extraction, factual QA, and multi-hop reasoning are all cases where compilation can find few-shot examples and instruction phrasing that a human would not arrive at through manual tuning.
The cross-model stability is another practical advantage. When you upgrade from one model to another, a hand-written prompt may need to be retuned. A DSPy program can be recompiled against the new model without changing any source code. The optimizer finds what works for the new model's behavior rather than you doing that discovery manually.
The multi-hop reasoning pattern is where DSPy shines most clearly. A ChainOfThought module with compiled few-shot examples for each reasoning step produces meaningfully better results on complex questions than a single prompt asking the model to think step by step. The optimizer finds the reasoning traces that work, not the ones that look reasonable.
Where DSPy struggles
The learning curve is real. DSPy requires a different mental model than writing prompts. You need to understand Signatures, how Modules compose, how the optimizer uses training examples, and how Assertions interact with retries. Most developers coming from LangChain or direct API calls will spend a day just getting comfortable with the concepts before they can use them productively.
Compilation costs API calls. Running BootstrapFewShot on 50 training examples with a program that uses 3 modules will make hundreds of LLM calls. That's an upfront cost you pay once before the program is optimized, but it's a real cost, especially if you're iterating on your program design. Starting with a small training set (20-30 examples) and BootstrapFewShot before moving to heavier optimizers is the practical workflow.
Debugging compiled programs is harder. When something goes wrong in a compiled program, the generated prompt is spread across a state file rather than visible in your source code. You need to load the compiled state and inspect the generated examples and instructions. DSPy has improved its tracing and inspection tools, but it's still less immediate than looking at a prompt string in your code.
DSPy in production: what it actually looks like
Teams using DSPy in production generally follow the same workflow. They start with uncompiled modules, test them against manual examples to get baseline quality, then build a labeled evaluation set with 50-200 examples and a scoring function. They run compilation with BootstrapFewShot first since it's the cheapest optimizer, check the quality gain, then try MIPROv2 if they need more improvement.
The compiled state gets saved as a JSON file and loaded at inference time:
compiled_agent = optimizer.compile(agent, trainset=train_examples)
compiled_agent.save("compiled_v1.json")
# In production
agent = ResearchAgent()
agent.load("compiled_v1.json")
When you change the underlying model or add new training examples, you recompile and save a new version. Treating compiled state files as build artifacts (versioned, tested, deployed separately from code) is the pattern that keeps things manageable.
The evaluation step is the part most teams underinvest in initially. A DSPy program is only as good as the metric it was compiled against. If your metric is too lenient, the optimizer finds programs that technically score well but miss what users actually care about. If your metric is too narrow, the program scores well on the training set and poorly on real inputs. Getting the metric right is the actual hard problem in DSPy projects, not the framework usage.
How it fits into a larger stack
DSPy is not a replacement for LangChain or LangGraph. It is a different tool that solves a specific problem: optimizing the quality of LLM module outputs against a metric. You can use DSPy modules inside a LangGraph workflow where certain nodes need optimized reasoning, while the overall control flow stays in LangGraph's graph model.
Similarly, DSPy pairs naturally with observability tools. LangSmith and MLflow both have native integrations that let you trace compiled programs in production without additional instrumentation. For teams building anything non-trivial, adding Langfuse or a similar observability layer on top of DSPy programs in production is worthwhile.
Who should use DSPy
NLP and ML teams who are used to thinking about training sets, metrics, and optimization will feel at home with DSPy's compilation model. It fits naturally into an ML workflow.
Teams with measurement discipline who already have labeled examples and a quality metric will get the most out of the optimizer. DSPy doesn't help much if you can't define what "good" means quantitatively.
Developers building RAG pipelines where retrieval quality and generation quality need to be co-optimized benefit from DSPy's ability to tune the full pipeline against an end-to-end metric, not just the generation step.
It's a harder sell for teams building conversational chatbots, simple routing agents, or applications where quality is evaluated subjectively by users. Those use cases don't have the labeled training data or objective metrics that make DSPy's optimizer useful.
Getting started
pip install dspy
A minimal RAG pipeline with optimization:
import dspy
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
class RAGSignature(dspy.Signature):
"""Answer questions using retrieved context."""
context: str = dspy.InputField(desc="Retrieved passages from documents")
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="Accurate answer based on the context")
class RAGModule(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought(RAGSignature)
def forward(self, context, question):
return self.generate(context=context, question=question)
rag = RAGModule()
# Test without compilation
result = rag(context="Paris is the capital of France.", question="What is the capital of France?")
print(result.answer)
# Compile with examples (requires labeled training data)
# from dspy.teleprompt import BootstrapFewShot
# optimizer = BootstrapFewShot(metric=your_metric)
# compiled_rag = optimizer.compile(rag, trainset=train_examples)
# compiled_rag.save("compiled_rag.json")
The full documentation at dspy.ai covers optimizers in depth and includes example programs for common tasks. The GitHub repository at stanfordnlp/dspy has notebooks that walk through compilation end-to-end.
The bottom line
DSPy is not for everyone. If you want to write prompt templates and iterate on them manually, it's more friction than you need. If you want to define program structure in code and let a compiler find the optimal prompts, it's one of the best tools available.
The 22,000-star community and the Stanford backing mean the framework is actively maintained and the research underneath it is credible. Version 2.6 is stable. The real question is whether your use case has the measurement infrastructure that makes compilation valuable. If it does, DSPy will save you significant time and produce more reliable programs than manual prompt tuning.
Key features
- Signatures: typed input/output contracts that replace hand-written prompt templates
- Modules: composable building blocks (ChainOfThought, ReAct, Predict, ProgramOfThought)
- Optimizers (Teleprompters) that auto-generate few-shot examples and prompt instructions
- Compile-time optimization against a metric without touching source code
- Multi-model support across OpenAI, Anthropic, Cohere, Mistral, and local models via LiteLLM
- Assertions for specifying constraints the LLM output must satisfy
- MLflow and LangSmith tracing integrations for observability