Python MIT orchestrationprompt-optimization

DSPy

Stanford's framework for programming LLMs with optimizable modules instead of hand-written prompts

DSPy is a Stanford research framework that treats LLM programming as a software engineering problem. Instead of writing prompt strings, you write typed Signatures and compose them into Modules. Optimizers then compile your program, generating few-shot examples and instruction text that maximize a metric you define, without you hand-tuning a single prompt template.

There is a particular frustration that comes from spending two hours tuning a prompt, getting it to work well, then watching it degrade when you switch to a different model or add a new use case. Prompt engineering done by hand is brittle. It does not transfer. It does not improve automatically. You are essentially hard-coding the reasoning strategy for a system that does not stay constant.

DSPy was built to fix that. The framework from Stanford NLP treats LLM programs like software programs: you define the structure in code, and a compiler optimizes the implementation. The prompts are not the work product. They are the output of a compilation step.

What makes DSPy different

Most LLM frameworks ask you to write prompt templates. A template might say: "You are a helpful assistant. Given the question: {question}, provide a detailed answer: {answer}." You spend time tuning that template. You add few-shot examples. You move instructions around. You test different phrasings.

DSPy says: describe what you want, define a quality metric, and let the optimizer figure out the prompt. You write a Signature that says "question -> answer" and a Module that uses it. You pick an optimizer. You run compilation over a small training set. The optimizer produces a version of your program with generated few-shot examples and instruction text that score well on your metric.

The practical result is that your source code no longer contains prompt strings. It contains declarations of intent. The prompt strings live in a compiled state file that you save separately and can regenerate if your requirements change.

Core concepts

Signatures

A Signature is a typed I/O contract. In its shorthand form:

import dspy

class AnswerQuestion(dspy.Signature):
    """Answer the question accurately and concisely."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

Or as a one-liner for quick use:

qa = dspy.Predict("question -> answer")

Each field carries a type annotation and an optional description that DSPy uses when generating prompts. The docstring becomes the task description in the compiled prompt template. This is how DSPy knows what you want the LLM to do without you writing the instruction text yourself.

Signatures can have multiple inputs and outputs:

class ExtractEntities(dspy.Signature):
    """Extract named entities from the text."""
    text: str = dspy.InputField(desc="Source text to analyze")
    entities: list[str] = dspy.OutputField(desc="List of named entities found")
    entity_types: list[str] = dspy.OutputField(desc="Type of each entity (person, place, org)")

Modules

Modules are composable program units that use Signatures. DSPy ships several built-in modules:

dspy.Predict: Direct LLM call using the Signature
dspy.ChainOfThought: Adds a reasoning step before the output fields, improving accuracy on complex tasks
dspy.ReAct: Implements the Reason-Act loop for tool-using agents
dspy.ProgramOfThought: Generates and executes code to answer questions
dspy.Refine: Iteratively improves an output until it meets a quality threshold

You can compose these into multi-step programs:

class ResearchAgent(dspy.Module):
    def __init__(self):
        self.search = dspy.ReAct("query -> findings", tools=[web_search_tool])
        self.synthesize = dspy.ChainOfThought("findings, question -> answer")

    def forward(self, question):
        findings = self.search(query=question)
        return self.synthesize(findings=findings.findings, question=question)

The forward method is where you define how data flows through the program. DSPy records the LLM calls made during forward and uses those traces during optimization.

Optimizers (Teleprompters)

Optimizers are the compilation engine. You give them your program, a training set, and a metric function. They produce an optimized version.

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShot

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

agent = ResearchAgent()

def quality_metric(example, prediction, trace=None):
    return float(prediction.answer in example.expected_answers)

optimizer = BootstrapFewShot(metric=quality_metric, max_bootstrapped_demos=4)
compiled_agent = optimizer.compile(agent, trainset=train_examples)

compiled_agent.save("research_agent_optimized.json")

BootstrapFewShot is the entry-level optimizer. It generates few-shot examples by running your program on training data and selecting the traces that score well on the metric. For more aggressive optimization, MIPROv2 also optimizes the instruction text in each module, which can produce larger quality gains at higher compilation cost.

Assertions

DSPy Assertions let you specify constraints that outputs must satisfy:

class SafeAnswer(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        result = self.generate(question=question)
        dspy.Assert(
            len(result.answer) < 500,
            "Answer must be concise, under 500 characters"
        )
        dspy.Assert(
            not any(word in result.answer for word in ["unfortunately", "I cannot"]),
            "Answer should not refuse or hedge"
        )
        return result

Assertions run after the LLM call and can trigger a retry with modified prompts if the constraint is violated. This is different from post-processing validation: DSPy feeds the failed assertion back into the prompt as additional context for the retry, which teaches the LLM what went wrong.

Configuring LLM providers

DSPy uses LiteLLM under the hood, which means you configure providers with a string:

# OpenAI
dspy.configure(lm=dspy.LM("openai/gpt-4o"))

# Anthropic
dspy.configure(lm=dspy.LM("anthropic/claude-3-5-sonnet-20241022"))

# Local model via Ollama
dspy.configure(lm=dspy.LM("ollama_chat/llama3.2", api_base="http://localhost:11434"))

# Per-module override
expensive_module = dspy.ChainOfThought("question -> answer")
expensive_module.set_lm(dspy.LM("openai/gpt-4o"))

Multi-model setups where different modules use different models are straightforward. You configure a default at the dspy.configure level and override per module where needed, which makes cost optimization easier.

Where DSPy wins

DSPy is the right tool when your task has a measurable quality metric and you need prompts to stay reliable as models change. Classification tasks, information extraction, factual QA, and multi-hop reasoning are all cases where compilation can find few-shot examples and instruction phrasing that a human would not arrive at through manual tuning.

The cross-model stability is another practical advantage. When you upgrade from one model to another, a hand-written prompt may need to be retuned. A DSPy program can be recompiled against the new model without changing any source code. The optimizer finds what works for the new model's behavior rather than you doing that discovery manually.

The multi-hop reasoning pattern is where DSPy shines most clearly. A ChainOfThought module with compiled few-shot examples for each reasoning step produces meaningfully better results on complex questions than a single prompt asking the model to think step by step. The optimizer finds the reasoning traces that work, not the ones that look reasonable.

Where DSPy struggles

The learning curve is real. DSPy requires a different mental model than writing prompts. You need to understand Signatures, how Modules compose, how the optimizer uses training examples, and how Assertions interact with retries. Most developers coming from LangChain or direct API calls will spend a day just getting comfortable with the concepts before they can use them productively.

Compilation costs API calls. Running BootstrapFewShot on 50 training examples with a program that uses 3 modules will make hundreds of LLM calls. That's an upfront cost you pay once before the program is optimized, but it's a real cost, especially if you're iterating on your program design. Starting with a small training set (20-30 examples) and BootstrapFewShot before moving to heavier optimizers is the practical workflow.

Debugging compiled programs is harder. When something goes wrong in a compiled program, the generated prompt is spread across a state file rather than visible in your source code. You need to load the compiled state and inspect the generated examples and instructions. DSPy has improved its tracing and inspection tools, but it's still less immediate than looking at a prompt string in your code.

DSPy in production: what it actually looks like

Teams using DSPy in production generally follow the same workflow. They start with uncompiled modules, test them against manual examples to get baseline quality, then build a labeled evaluation set with 50-200 examples and a scoring function. They run compilation with BootstrapFewShot first since it's the cheapest optimizer, check the quality gain, then try MIPROv2 if they need more improvement.

The compiled state gets saved as a JSON file and loaded at inference time:

compiled_agent = optimizer.compile(agent, trainset=train_examples)
compiled_agent.save("compiled_v1.json")

# In production
agent = ResearchAgent()
agent.load("compiled_v1.json")

When you change the underlying model or add new training examples, you recompile and save a new version. Treating compiled state files as build artifacts (versioned, tested, deployed separately from code) is the pattern that keeps things manageable.

The evaluation step is the part most teams underinvest in initially. A DSPy program is only as good as the metric it was compiled against. If your metric is too lenient, the optimizer finds programs that technically score well but miss what users actually care about. If your metric is too narrow, the program scores well on the training set and poorly on real inputs. Getting the metric right is the actual hard problem in DSPy projects, not the framework usage.

How it fits into a larger stack

DSPy is not a replacement for LangChain or LangGraph. It is a different tool that solves a specific problem: optimizing the quality of LLM module outputs against a metric. You can use DSPy modules inside a LangGraph workflow where certain nodes need optimized reasoning, while the overall control flow stays in LangGraph's graph model.

Similarly, DSPy pairs naturally with observability tools. LangSmith and MLflow both have native integrations that let you trace compiled programs in production without additional instrumentation. For teams building anything non-trivial, adding Langfuse or a similar observability layer on top of DSPy programs in production is worthwhile.

Who should use DSPy

NLP and ML teams who are used to thinking about training sets, metrics, and optimization will feel at home with DSPy's compilation model. It fits naturally into an ML workflow.

Teams with measurement discipline who already have labeled examples and a quality metric will get the most out of the optimizer. DSPy doesn't help much if you can't define what "good" means quantitatively.

Developers building RAG pipelines where retrieval quality and generation quality need to be co-optimized benefit from DSPy's ability to tune the full pipeline against an end-to-end metric, not just the generation step.

It's a harder sell for teams building conversational chatbots, simple routing agents, or applications where quality is evaluated subjectively by users. Those use cases don't have the labeled training data or objective metrics that make DSPy's optimizer useful.

Getting started

pip install dspy

A minimal RAG pipeline with optimization:

import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

class RAGSignature(dspy.Signature):
    """Answer questions using retrieved context."""
    context: str = dspy.InputField(desc="Retrieved passages from documents")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="Accurate answer based on the context")

class RAGModule(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(RAGSignature)

    def forward(self, context, question):
        return self.generate(context=context, question=question)

rag = RAGModule()

# Test without compilation
result = rag(context="Paris is the capital of France.", question="What is the capital of France?")
print(result.answer)

# Compile with examples (requires labeled training data)
# from dspy.teleprompt import BootstrapFewShot
# optimizer = BootstrapFewShot(metric=your_metric)
# compiled_rag = optimizer.compile(rag, trainset=train_examples)
# compiled_rag.save("compiled_rag.json")

The full documentation at dspy.ai covers optimizers in depth and includes example programs for common tasks. The GitHub repository at stanfordnlp/dspy has notebooks that walk through compilation end-to-end.

The bottom line

DSPy is not for everyone. If you want to write prompt templates and iterate on them manually, it's more friction than you need. If you want to define program structure in code and let a compiler find the optimal prompts, it's one of the best tools available.

The 22,000-star community and the Stanford backing mean the framework is actively maintained and the research underneath it is credible. Version 2.6 is stable. The real question is whether your use case has the measurement infrastructure that makes compilation valuable. If it does, DSPy will save you significant time and produce more reliable programs than manual prompt tuning.

Key features

Signatures: typed input/output contracts that replace hand-written prompt templates
Modules: composable building blocks (ChainOfThought, ReAct, Predict, ProgramOfThought)
Optimizers (Teleprompters) that auto-generate few-shot examples and prompt instructions
Compile-time optimization against a metric without touching source code
Multi-model support across OpenAI, Anthropic, Cohere, Mistral, and local models via LiteLLM
Assertions for specifying constraints the LLM output must satisfy
MLflow and LangSmith tracing integrations for observability

Frequently Asked Questions

What is DSPy?

DSPy is a Python framework from Stanford NLP for building LLM programs using modules and signatures instead of prompt strings. You define the inputs and outputs your program needs, compose modules that implement the logic, and run an optimizer that compiles the program into effective few-shot prompts and instruction text. The key idea is that prompts are an implementation detail that an optimizer should handle, not something developers should write by hand.

What is a DSPy Signature?

A Signature is a typed declaration of what a DSPy module takes in and produces. You write it as a class or a shorthand string like "question -> answer". Each field has a type annotation and an optional description. DSPy uses Signatures to generate prompt templates and to validate that outputs match the declared format. Signatures replace hand-written prompt templates and make the expected I/O contract explicit in your code.

What is DSPy compilation?

Compilation is the process of running a DSPy optimizer (like BootstrapFewShot or MIPROv2) against your program and a training set with a quality metric. The optimizer tries different few-shot examples and instruction variations, evaluates them against your metric, and selects the combination that scores best. The result is an optimized version of your program that you save and load at inference time. Compilation costs API calls upfront but replaces the ongoing cost of manual prompt engineering.

How does DSPy compare to LangChain?

LangChain is a general orchestration framework with a large library of integrations for data sources, memory backends, and LLM providers. DSPy is specifically designed for programs where prompt quality can be systematically optimized against a metric. LangChain gives you more integration options and a larger community. DSPy gives you a compiler that makes your prompts better automatically, which matters more as your use case gets more complex. They are also not mutually exclusive: you can use DSPy modules inside a LangChain pipeline.

Is DSPy production-ready?

DSPy is used in production by teams who have specific needs it serves well: classification tasks, information extraction, multi-hop reasoning, and RAG quality improvement. It is not a general-purpose agent framework. For simple chatbot or routing tasks, it adds overhead that is not worth the benefit. For tasks where output quality is measurable and needs to stay stable across model updates, DSPy's compiled programs are more reliable than hand-tuned prompts. At version 2.6 with over 22,000 GitHub stars, the framework is stable for production use in appropriate contexts.