Feature|Articles|December 23, 2025

Data Quality in Drug Development: The Missing Foundation to Realize AI’s Promise in Clinical Trials

Author(s)Leigh Cohen
Listen
0:00 / 0:00

AI’s transformative potential in clinical development relies on the industry's ability to rebuild its fractured data infrastructure.

“Garbage in, garbage out” remains as true today as when it was coined in the 1950s, perhaps even more so in the era of large language models and AI-powered analytics. While artificial intelligence holds transformative promise for clinical trials, its success hinges on a foundational element that is often overlooked: clean, high-quality, real-time data.

Despite significant innovation across the life sciences industry, companies continue to struggle with fragmented data ecosystems. The burden of reconciling disparate data sources, often manually and after the fact, undermines the very outcomes AI is meant to accelerate. To unlock AI’s full potential, clinical development leaders must rethink their data strategy from the ground up.

Healthcare is now personal, with precision medicine aiming to match the right therapy to the right patient at the right time, transforming treatment strategies from population averages to individualized care. However, this vision depends on more than breakthrough science, it requires a data foundation capable of capturing and connecting every relevant signal across the clinical development lifecycle.

The Data Deluge: Opportunity and Obstacle

Over the past decade, the volume and variety of data in clinical trials have exploded. Traditional case report forms from randomized controlled trials are now supplemented by real-world data from electronic health records, insurance claims, wearables, patient registries, and other sources. Not only can these external inputs offer a richer, more holistic view of patient outcomes and treatment performance, in many cases, the primary and secondary endpoints of a trial depend on them.

At DPHARM 2025, Tufts CSDD’s Kenneth Getz reported that the average Phase III trial now generates almost six million data points, up from 3.6 million in 2020 and just one million in 2012, yet much of this data remains underutilized. Unlike structured data collected in case report forms, real-world data (RWD) is often incomplete and can be unreliable in terms of representativeness and relevance, making it difficult to integrate and analyze at scale.

RWD’s Promise Meets Persistent Barriers

Real-world data (RWD) is becoming increasingly mainstream as healthcare grows more digitized, and information is captured far beyond clinical trial sites. Global regulatory bodies like the U.S. Food and Drug Administration (FDA) and European Medicines Agency have embraced the potential of RWD, publishing frameworks that encourage its responsible use in submissions. The pandemic further accelerated decentralized trials and remote monitoring, pushing RWD into the mainstream.

Still, integration remains piecemeal, a sensor feed here, an EHR pipeline there, resulting in fragile mosaics of systems, interfaces, and reconciliation steps. Common challenges include:

  • Data silos and incompatibility: Trial data, claims, electronic health records, labs, and other sources often arrive in inconsistent formats with different vocabularies, coding standards, and levels of granularity.
  • Data quality and bias: Real-world sources may contain missing values, incomplete outcomes, or biases from non-random patient selection, limiting their use as standalone evidence and increasing the risk of false findings.
  • Operational drag: Many sponsors run critical workflows on stacks of point solutions stitched together with manual cleanup and reconciliation, defined by time-consuming starts and stops.
  • Interoperability gaps: Despite progress with global standards such as the Fast Healthcare Interoperability Resources (FHIR) and CDISC, consistent adoption remains patchy, making integration across systems slow, costly, and error prone.
  • Regulatory hurdles: Ensuring patient privacy, meeting data provenance requirements, and demonstrating methodological rigor remain challenges. Global trials also face the hurdle of data localization constraints and patient consent.

Without rigorous, real-time data harmonization, the promise of real-world data, and ultimately AI, remains locked in silos.

The Case for Standards-Based Data Harmonization

The goal isn’t simply more data, it’s better data. Harmonizing real-world data with traditional trial data creates a high-fidelity view of treatment performance across both controlled and real-world settings. But this requires a shift in approach.

Embedding standards like CDISC and FHIR from the outset ensures that data flows consistently throughout the trial lifecycle. This proactive strategy eliminates costly, error-prone reconciliation downstream and enables real-time analytics, regulatory readiness, and AI-driven insights.

By contrast, today’s norm of parallel pipelines stitched together after the fact, creates fragile integrations and delays. Early harmonization transforms data from liability to asset, laying the groundwork for speed, quality, and trust in an AI-powered clinical ecosystem.

AI Needs a New Data Infrastructure

AI is already reshaping life sciences, from imaging triage to risk scoring. In drug development, its potential is vast: predictive modeling, synthetic control arms, adaptive trial designs, and more.

Yet these innovations don’t address the core bottleneck of operational drag. Trials still span 10 to 15 years, with only a fraction of that time spent on evidence generation such as dosing and measuring patients. The rest is consumed by study startup, recruitment, data wrangling, and submission packaging.

AI and harmonized data have the potential to dramatically compress clinical timelines and deliver substantial impact. For instance, Phase III trials typically cost around $36 million, with some exceeding $100 million, with nearly 40% being spent on operations and overhead.

Streamlining workflows through AI can yield immediate and significant returns on investment. Beyond efficiency, AI enables smarter trials by leveraging synthetic control arms and predictive modeling to optimize enrollment and inform study design decisions. Ultimately, real-time data further supports continuous learning and adaptive trial designs, paving the way for dramatically faster execution. What currently takes 12 years could be reduced to five or six years if the data is clean, integrated, and reliable.

Building the Ecosystem for AI-Driven Research

To realize this vision, the industry must rearchitect its data infrastructure, with key elements including:

  • Real-time, bi-directional data flow: Replace nightly batches with streaming updates so all stakeholders see the same truth simultaneously.
  • Native integrations: Build standardization into the platform core, not as bolt-ons.
  • Open standards: Avoid vendor lock-in and enable cross-platform collaboration.
  • Privacy and provenance: Make de-identification and audit-grade lineage table stakes for responsible reuse.

AI’s true promise is turning clinical development from a bottleneck into a throughput engine for human health. Drug discovery is already accelerating, as AI identifies thousands of new treatment pathways every breakthrough idea must still traverse Phase I through Phase III trials.

By streamlining clinical development and reimagining data infrastructure, the industry can increase pipeline throughput and bring precision medicine to scale.

The impact of Data Unification on business outcomes

Theory and conceptual thinking are aspirational by design, and most would ask, how can or is this helping sponsors today?

Based on a recent case-study presented at DPHARM 2025, a top pharmaceutical sponsor leveraging a unified data strategy was able to realize significant impact to business processes, development pipeline throughput, and business intelligence.

By eliminating parallel data pipelines and establishing a single source of truth for all stakeholders from data management, clinical operations, biostatistics, medical and safety teams, sponsors can achieve:

  • Faster regulatory submissions: With data already harmonized to CDISC and complete with audit‑grade lineage the sponsor can perform SDTM/ADaM assembly, TLF production, and eCTD packaging in a continuous process rather than episodic.
  • Real-time operational and scientific insights: Replacing batch processing witha single, bi‑directional data fabric, the sponsor gained live views across operational performance and on‑demand scientific analyses.
  • Increased safety and HEOR evidence: Byunifying trial data with EHR, claims, and registries they are strengthening pharmacovigilance and expanding the evidence base beyond the site.
  • Accelerated study startup: With standards‑based libraries for protocols, schedules of activities, and eCRFs, coupled with native FHIR integrations, the sponsor can compress feasibility and site activation timelines.
  • A foundation for AI innovation: Because of a substrate for AI that is explainable, auditable, and repeatable, teams can safely deploy high‑value AI use-cases, without re‑plumbing data for every project.

Conclusion: From Bottleneck to Breakthrough

The future of clinical trials is not just faster, it’s smarter. But only if we start with high-quality, harmonized data. AI can’t fix broken inputs. As the adage reminds us: garbage in, garbage out.

To transform clinical development, we must first transform how we manage data. That means standards-first design, real-time harmonization, and a unified ecosystem built for speed, scale, and scientific rigor.

Only then can AI deliver on its promise and usher in a new era of evidence generation, drug development, and human health.

Newsletter

Lead with insight with the Pharmaceutical Executive newsletter, featuring strategic analysis, leadership trends, and market intelligence for biopharma decision-makers.