How to evaluate an AI vendor without falling for the demo.

Every AI demo works. That isn't a coincidence and it isn't fraud — it's craft. The data used in the demo is curated. The use case is the one the product handles best. The sales engineer running the demo has watched it succeed three hundred times. None of that tells you what happens when the same system meets your actual documents, your actual users, and your actual edge cases.

Buyers in unions and in healthcare are seeing a steady wave of vendor pitches: AI for dispatch, AI for grievance triage, AI for clinical documentation, AI for prior authorization, AI for medical coding, AI for member services. Some of these products are genuinely excellent. Most are a familiar large language model wrapped in a thin interface and a heavy markup. The buyer's job is to tell which is which.

What follows is a short framework — not a comprehensive checklist, but the questions that consistently separate real products from polished demos.

Bring your own data

The most useful question to ask in the first sales meeting is straightforward: can we run a paid pilot with our real data, before the contract?

For a union, "real data" means messy dispatch records, scanned apprenticeship files with handwritten notes, grievance documentation in five different formats, training certifications going back two decades. For a clinic, it means actual EHR exports, dictation files with accents and specialty terminology, EOB PDFs that the optical character recognition tool has been mangling for years, and the billing exceptions that pile up every Friday afternoon.

If the vendor will only pilot with sanitized sample data, the answer to your question is already in front of you. If they say yes but only after the contract is signed, the answer is the same. The vendors most likely to deliver are the ones who want the messy stuff in front of them as early as possible — because they know their product handles it.

The five questions that cut through

Which model is under the hood, and what happens when it changes? Most AI products are built on top of an underlying large language model from one of four or five providers. That isn't a bad thing. But it does mean your product is exposed to changes you don't control. Ask explicitly: if the vendor switches models — because of pricing, performance, or policy — what changes for you? Does retraining happen? Do existing prompts still work? Who decides, and how much notice do you get?

What's proprietary, and what's resold? A useful follow-up. If you stripped away the underlying model API, the embedding service, the vector database, and the cloud infrastructure, what would the vendor have left? Sometimes the answer is a meaningful workflow, a real domain-specific dataset, or a useful fine-tune. Sometimes the answer is a logo and a Stripe account. Both can be useful products — but they should cost very different amounts of money.

What's the hallucination rate, and how do you measure it? The honest answer here is more revealing than the number. Vendors who can describe their evaluation methodology — the test sets they hold out, the human review process, the categories of error they track, how they handle regressions when the underlying model changes — are usually the ones building something real. Vendors who give you a single confident percentage and no methodology are giving you marketing.

Where does our data go, and who can see it? The full chain matters. Your data goes into the application. The application sends it to a model API. The model API sits on infrastructure owned by someone else. Logs are kept somewhere. For unions handling member PII and for clinics handling PHI, every stop on that chain is a potential exposure. Ask whether your data is used for training, what the retention policy is, where the data physically resides, and what the breach notification process looks like in writing.

What does exit look like? If you decide in 18 months that this isn't working, can you get your data out? Your prompts? Any fine-tuned models you helped train? In what format, with what notice, at what cost? The vendors who have a clean answer to this question are the ones who don't need to trap you to keep you.

What unions should watch for

Vendors selling "AI dispatch optimization" or "AI grievance triage" deserve a specific look. Dispatch decisions affect member income and seniority — getting them wrong creates grievances faster than it resolves them. Ask how the model is making decisions, what data it was trained on, and whether the vendor can produce an audit trail when a member files a complaint. An AI that can't explain its reasoning to a steward in plain English is an AI that's going to lose every grievance it generates.

For AI tools handling apprenticeship records, training credentials, or member onboarding, the central question is where member PII ends up and whether the vendor will sign a data processing agreement that survives a contract dispute.

A pattern worth noticing: vendors with no labor-side experience often misunderstand the political dynamics of a union local. Software that's optimal in a corporate setting can be unworkable when business agents, stewards, and members all have a stake in the same workflow. References from comparable locals matter more than impressive enterprise logos.

What healthcare should watch for

Ambient AI scribes are the highest-volume category in clinical AI right now. The questions that matter are accuracy on the accents and specialty terminology your providers actually use, the rate at which providers stop editing the output because it's already good (the real measure of utility), and the rate at which the vendor's transcription gets used as the source of truth in a billing or malpractice dispute.

For prior authorization automation, ask what the appeal rate is when the AI's prediction is wrong. For coding assistants, ask explicitly who is liable when the code is wrong — the practice, the vendor, or the EHR vendor whose pipes the data flows through.

The HIPAA question is non-negotiable: will the vendor sign a Business Associate Agreement, and is the underlying model API HIPAA-eligible? AWS Bedrock and Google Vertex AI both offer HIPAA-eligible model access through their BAA programs. Direct calls to consumer-grade LLM APIs are typically not HIPAA-eligible regardless of what the vendor's marketing page says. If the vendor can't articulate this distinction clearly, that's a signal in itself.

The FDA line on clinical decision support is the other thing to know. Tools that suggest specific diagnoses or treatments may cross into regulated medical device territory. Ask whether the vendor has done the regulatory analysis and what their documented position is.

Pricing reality

Per-token, per-seat, per-document, per-call. Each pricing model has a hidden cliff at scale. The pricing that looks reasonable in the pilot can become punitive in production. Ask explicitly what your annual cost looks like at 2x, 5x, and 10x the pilot volume. If the vendor can't model it, they don't fully understand their own pricing.

Watch for two specific patterns: "enterprise tier" requirements that surface only after the contract is signed, and integration costs that are described as "minor" in the demo and turn out to require six weeks of professional services at three hundred dollars an hour.

Reference calls that matter

Three filters: same size as you, same industry as you, live for at least six months. The launch-week logo is the wrong reference call. Ask the reference what they don't like about the product. Ask what they thought it would do that it doesn't. Ask whether they'd buy it again knowing what they know now. The vendors with strong products will give you a list of references who will answer honestly, because they know the honest answer is still good.

A useful default

The cost of a bad AI vendor isn't the contract value. It's the year your team spends integrating, training, and reorganizing workflows around a product that turns out not to fit. By the time you know, you've sunk enough into it that leaving feels harder than staying.

The defense against this is unglamorous: a paid pilot on real data, before the contract. The vendors who refuse aren't the ones who are going to deliver. The vendors who insist on it are usually the ones you want. Make the pilot the first item on the table, not the last.