How to Evaluate an AI Vendor in 2026: A Real Procurement Framework
Every procurement team in 2026 is being asked to move faster on AI vendor decisions than they're comfortable with. The business side wants the tool deployed last quarter. Legal wants a six-month review cycle. Security says the questionnaire isn't even finished yet. Meanwhile, someone in sales has already put a corporate card down and started using it.
I've watched this play out at enough companies to have a framework that works. It's not a 100-question security questionnaire. It's the 12 things that actually matter, in the order that matters.
Start with the data question, not the model question
The most common mistake in AI vendor evaluation is leading with model quality. "What model does it use? How does it score on MMLU? Does it use GPT-5 or Claude?" These are legitimate questions, but they're not the first ones.
The first question is: what data does this vendor see, and what do they do with it?
This sounds obvious but the answer is often buried. Some AI vendors train on your data by default. Some use it for "service improvement" which can mean anything. Some claim zero data retention but use a subprocessor that has different terms. Some have enterprise tiers with genuine data isolation and consumer tiers where your data is fair game.
Ask specifically:
- Is my data used to train or fine-tune any model, including third-party models?
- What is your data retention policy? How long do you store prompts and completions?
- Do you have a subprocessor list? What are the data handling terms for each?
- What happens to my data if I cancel?
Get these answers in writing, not from a sales call. The contract and the Data Processing Agreement (DPA) govern, not what the AE told you on a demo.
The DPA is more important than the vendor agreement
Most procurement teams spend 80% of their legal review on the master subscription agreement and 20% on the DPA. That's backwards for AI tools.
The DPA (Data Processing Agreement) is where the critical stuff lives: how your data is classified, whether the vendor acts as a processor or controller, what happens in a breach, how long data is retained, and what rights you have to audit or delete.
Specific DPA provisions to check:
Data residency: Is processing happening in a geography your legal team has cleared? EU-based companies with GDPR obligations need to confirm where inference is occurring. Some vendors' EU data residency is real (data never leaves EU servers). Some is nominal (your data routes through US servers "transiently" before processing in the EU).
Training opt-out: Enterprise tiers at most major vendors (OpenAI, Anthropic, Google) include automatic opt-out from training data. Make sure you're on the right tier and that the opt-out is explicit in your DPA, not just implied by your subscription level.
Subprocessor notification: You want the right to be notified before the vendor adds a new subprocessor, not just after. Some DPAs bury a 30-day notice period that you agree to by continuing to use the service.
Security incident timing: GDPR requires notification within 72 hours of becoming aware of a breach. Some AI vendor DPAs have vaguer language. Nail this down if you're operating under GDPR.
SOC 2 Type II: what it actually tells you
SOC 2 Type II is table stakes for any vendor that touches business data. If a vendor can't produce a current (less than 12 months old) SOC 2 Type II report, that's a meaningful red flag, not a technicality.
What SOC 2 actually tells you: an independent auditor reviewed the vendor's security controls over a period (typically 6-12 months) and found them consistent with their stated policies. It's not a pass/fail security grade. It's evidence that the company has organized security controls and has them audited.
What it doesn't tell you: whether those controls are sufficient for your use case, whether a zero-day vulnerability exists, or whether the vendor will be financially stable enough to maintain those controls next year.
For AI vendors specifically, also ask about:
- AI-specific security controls: Can the model be prompted to exfiltrate data through its responses? Has the vendor done adversarial testing (red-teaming) of their AI system specifically?
- Prompt injection defense: If your application sends user-controlled text to the AI, what prevents an adversarial user from overriding your system prompt?
- Audit logging: Can you retrieve complete logs of all API calls, including prompts and completions, for your security and compliance review?
Model swap risk: the sleeper issue
Here's a risk that most procurement teams don't ask about: what happens when the vendor changes the underlying model?
In 2024 and 2025, several AI product vendors swapped their underlying models without warning. A tool that was running on GPT-4-turbo moved to GPT-4o. Another moved from Claude 2 to Claude 3 Sonnet. From a capability standpoint, these were improvements. From a production application standpoint, they broke things.
Behavior changes between model versions are real. A prompt that produced structured JSON reliably on the old model might produce prose with the new one. A classifier that had 94% accuracy might drop to 89% on the new model. An assistant that politely refused certain requests might handle them differently.
Ask vendors:
- Do you notify customers before changing the underlying model?
- Can we pin to a specific model version? For how long?
- What's your policy on deprecating model versions?
- Do you maintain backwards-compatible model aliases (like "claude-3-sonnet" staying on the same checkpoint)?
If you're building a production application on an AI vendor's product layer (not the raw API), model swap risk is higher because you have less control. At the API level, you can pin to a specific model checkpoint. At the product layer, you're dependent on the vendor's upgrade cycle.
Vendor lock-in: harder to escape than it looks
AI vendor lock-in is different from regular SaaS lock-in. With a regular SaaS tool, the lock-in is in your data and workflows. With an AI tool, the lock-in is in your prompts, fine-tuned models, and the specific behavioral quirks you've learned to work around.
Every model has idiosyncrasies. Claude responds differently to instruction phrasing than GPT does. Gemini formats outputs differently. After months of prompt engineering against a specific model, migrating to a different one isn't just an API swap. You're re-tuning every prompt, re-testing every workflow, and discovering new failure modes.
If vendor lock-in is a concern:
- Prefer abstraction layers (LiteLLM, Portkey, LangChain model-agnostic interfaces) that let you swap models with config changes
- Write prompts that are as model-agnostic as possible (avoid exploiting model-specific behaviors)
- Test your critical prompts quarterly against the top two alternative models
- Evaluate fine-tuning carefully: fine-tuned models are the hardest to migrate
Cost is the other lock-in factor. Custom pricing agreements, pre-paid token commitments, and volume discount tiers all create switching friction. Read the contract before you commit to anything above 6 months, and check what happens to pre-paid credits if you leave.
Hidden token costs: the math you need to do upfront
AI pricing looks simple on the pricing page. It gets complicated in production.
The advertised price is typically for input and output tokens with no caching, at standard rate. The actual cost in a production application depends on:
Context window usage: System prompts get sent with every request. A 10,000 token system prompt that's sent 10,000 times per day is 100 million tokens daily, just for the system prompt. At $15/million input tokens (Claude 4 Opus rate), that's $1,500/day in system prompt costs alone.
Prompt caching: Anthropic's prompt caching reduces repeated context costs by approximately 90% (cache writes at full price, cache hits at 10% of the read price). If you have a large, stable system prompt, caching is not optional. It changes the economics entirely.
Output length variance: If you're generating variable-length outputs, your cost can spike on longer completions in ways that are hard to budget. Put explicit length constraints in your prompts and monitor output token counts separately.
Retry costs: Failed API calls that get retried still consume tokens if they fail mid-generation. Robust retry logic with appropriate backoff matters for both reliability and cost.
Run the actual math before signing. Take your expected daily request volume, multiply by your average prompt+completion token count per request, apply the pricing, and stress-test the number. Include a 3x buffer for traffic spikes.
The contract clauses that bite people
Beyond the DPA, watch for these in the master agreement:
Auto-renewal terms: Many AI vendors have 60-90 day auto-renewal notice windows. Miss the window and you're locked in for another year.
Price change provisions: Some agreements allow price increases on renewal without separate notification. Make sure your contract specifies either price lock for the term or minimum notice period (90 days minimum is reasonable) for price changes.
Uptime SLA and credits: What's the SLA? 99.9% sounds good until you realize that's 8.7 hours of allowable downtime per year. For a customer-facing production application, that's significant. And check whether credits are automatic or require you to file a claim within a specific window (they usually require a claim).
Indemnification for AI output: If your application outputs AI-generated content that causes harm, who is liable? Most AI vendors disclaim responsibility for model outputs. You are generally accepting that the model will sometimes produce wrong or harmful content, and the risk management is your responsibility.
Acceptable use policy changes: AUPs can change mid-contract at most vendors with short notice. If your use case is in a gray area (certain financial advice, medical information, adult content on appropriate platforms), get explicit written carve-outs in your MSA rather than relying on the current AUP's permissiveness.
A practical evaluation sequence
If you're under time pressure (and you always are), here's the order:
- Confirm data handling and training opt-out in writing. Non-negotiable.
- Get the DPA and have legal review it for your regulatory requirements (GDPR, HIPAA, CCPA as applicable).
- Request and review the SOC 2 Type II report.
- Do a technical proof of concept with your actual production prompts, not the vendor's demo prompts.
- Run the token cost math for your actual usage volume.
- Ask about model versioning and swap policy.
- Check contract terms for auto-renewal, price change provisions, and SLA credit procedures.
- Confirm subprocessor list and get notification rights.
Most AI vendor evaluations that go wrong aren't failing on model quality. They're failing on data handling surprises, unexpected token costs in production, or contract terms that created lock-in nobody intended. The framework above doesn't guarantee a good outcome, but it addresses the eight things that cause the most problems.