Why Most AI Pilots Fail – And What To Do Differently

A reflection following the BioIntegrates panel: AI-Driven Alliances: Collaborating Beyond the Possible

The Number Nobody Wants to Talk About

Last week I sat on a panel with some of the ambitious AI-driven biotech companies working in the UK today — teams building AI to screen a billion drug molecules in hours, design precision medicines for brain cancer, and democratise antibody discovery that previously took years.

But behind the excitement, a quieter and more uncomfortable truth kept surfacing. There is still a lot of friction between an AI pilot and a successful AI solution. One that I know from first-hand experience.

Most AI pilots fail. Not because the technology isn’t good enough. But because of how we build, deploy, and govern them.

MIT’s 2025 State of AI in Business report — the “GenAI Divide” study — put a stark number on something practitioners already knew: 95% of enterprise generative AI pilots fail to deliver measurable business impact. Despite $30-40 billion poured into projects globally, the vast majority stall before they ever reach production.

The failure modes are consistent: trust breakdown when models behave unexpectedly, integration that’s more brittle than expected, runaway inference costs, governance bolted on too late, and the classic gap between an impressive demo and a useful product.

The biotech world isn’t immune. The same pattern plays out: enthusiasm, pilot, disappointment, reset. The 95% figure spans industries. The reasons are structural, not sectoral.

But here’s what the same research also found: vendor-led and partnership-based AI projects succeed roughly twice as often as internal builds. Not because buying is inherently better than building, but because partners who have done this before bring pattern recognition that internal teams simply don’t have yet. They’ve already made the expensive mistakes.

Why Do Pilots Fail?

The MIT findings identify six recurring failure modes:

Trust breakdown — models hallucinate, drift, or behave as a black box, and teams can’t see or steer why.
Integration fragility — brittle connections mean every system change becomes a hidden risk.
Cost overruns — inference costs that weren’t modelled upfront, particularly brutal for early-stage startups.
Governance gaps — compliance, audit trails, and oversight treated as afterthoughts.
The value gap — impressive demos that don’t map to real business outcomes.
Operational fragility — models that fail quietly in production, with no alerts until users have already lost trust.

I’d add a seventh, from our own experience: moving target blindness — building for where the technology is today without a plan for where it’s going tomorrow.

Foundation models are advancing faster than most product cycles. A pilot scoped in January can be overtaken by new model releases by June.

Competitors who are agile about swapping in newer capabilities will leapfrog teams that treat their AI component as fixed infrastructure. This isn’t cause for paralysis — it’s cause for designing adaptability in from the start.

“In our experience, pilots fail for three reasons: insufficient relevant data, no infrastructure to get tools into end users’ hands, and low trust in AI outputs. Companies need to plan for targeted data generation, deployment, and trust-building from the start — not as afterthoughts.”

Dr Max Jakobs, Deep Mirror AI

Lessons from Biotech

In drug discovery, there’s a phrase worth borrowing: fail fast, but fail small. The best biotech companies design their experimental pipelines to surface failures early, cheaply, and informatively — not to avoid failure, but to learn from it at the lowest possible cost. AI pilots should work the same way.

The most successful AI-in-biotech models aren’t the ones where a pharma giant builds everything in-house. They’re the ones where specialised AI partners work alongside domain experts, each contributing what they’re genuinely good at — Takeda and Iambic, Lilly and NVIDIA, DeepMirror and the medicinal chemists who don’t need to know ML to use it. The collaboration works because the roles are clear, the data ownership is agreed, and the problem being solved is specific and narrow. The same principles apply to any AI pilot, in biotech or anywhere else.

A Framework for Pilots That Actually Work

Define the problem before you define the solution. Don’t start with “we want to use AI.” Start with a specific, well-understood problem — and then ask whether AI is actually the right solution for it. Plenty of pilots fail because they started with a technology looking for a use case, rather than a use case that genuinely needs AI. Sometimes better data infrastructure or process redesign gets you further, faster.
Get your data in order first. Clean, structured, representative data isn’t a nice-to-have — it’s a precondition. AI doesn’t rescue poor data quality; it amplifies it. A model trained on messy inputs produces confidently wrong outputs. If you don’t have good data, the first phase of your pilot should be creating it — not skipping past it.
Define success before you start — and be honest about the timeline. Agree on metrics upfront: not vague aspirations, but specific, measurable outcomes that a sceptical stakeholder would accept as evidence. If you can’t articulate what “working” looks like within roughly six months of deployment, the pilot isn’t ready to start.
Build your evaluation pipeline on day one, not at the end. You need a way to measure whether your AI is working — and to detect when it starts working less well. This means test datasets, benchmark tasks, and a regular cadence of re-evaluation. When better models arrive — and they will — you can assess them rationally rather than starting from scratch. Design for model evolution, not model permanence. Treat the model layer as a dependency you will update, not a fixed component. The teams winning in AI right now are the ones who can integrate next quarter’s model without rebuilding their product.
Take the jargon problem seriously. AI vendors and domain experts are often talking past each other — AI teams using model-layer language that means nothing to a biologist or operations manager, domain experts using terminology that engineers can’t map to a solution. If the conversation is full of untranslated jargon, collaboration becomes performative rather than productive. Build time into a project for translation: shared glossaries, regular check-ins where both sides explain what they’re doing in plain language. It’s slow. It’s also the difference between a tool people use and one that gets abandoned.
Think hard about inference costs before committing to infrastructure. Separate exploration from production. Cheap, local experimentation on open-source models first. Rented inference for production runs. Long-term infrastructure only when you know your workload well enough to justify it. Many pilots have been killed not by technical failure but by API bills that weren’t modelled upfront.
Plan for organisational change, not just technical deployment. A powerful tool that only a few people understand and trust will not scale. Your users need to be able to interrogate outputs, spot errors, and give meaningful feedback. Building distributed AI literacy — not just a central AI team — is what separates the 5% from the 95%.
Sort out the messy decisions before you’re knee-deep in data. Ownership, milestone-based payments, what happens when requirements shift — these need to be worked through before the build starts, not mid-project. A successful long-term relationship with an AI vendor requires the same clear agreements you’d expect from any significant partnership.

“The big difference in the pilots that got embedded was that there was infrastructure built around it. There was a plan for the data going in, a plan to audit it, a plan to manage it and build it. It’s not just a flash in the pan.”

Ingrid Folland, Japeto

What We've Learned at Japeto

We work predominantly with charities, healthcare providers, and public sector organisations — clients where wasted budget isn’t just a commercial problem, it’s a failure of public trust. That sharpens the mind considerably.

We’ve had our own experiences of being caught by fast-moving foundations: a product that needed rethinking after the generative AI landscape shifted, and a client pilot overtaken by off-the-shelf model improvements before we could get it to launch. Neither failure was dramatic. Both were instructive. The common thread: we were solving for the technology as it existed at the start of the work, not designing for the technology as it would be by the end.

The question for any AI pilot shouldn’t only be “can we build this?” It should be:
Can we maintain this when it’s deployed?
Can we improve this as models advance?
Can we measure whether it’s working — honestly, with real metrics?
Can we explain how it’s working to the people it affects?
If the answer to any of those is no, you don’t have a product. You have a demo.
The good news is that these are solvable problems. They just need to be designed for from the beginning, not discovered at the end.

Share this page

Ingrid Folland

Got a project?

Let us talk it through