Article

Hypothesis - Where Value Accrues in Life Sciences Real-World Data

Overview

We've been investing across the pharma tech stack for several years, and where value accrues in the real-world data (RWD) stack is shifting. Claims data is table stakes, the value of continual RWD purchases is declining as foundation models incorporate these insights directly in their weights, and both pharma and large foundation model buyers are now looking for broader, deeper datasets (not just small n cohorts for translational research).

The early days (2010s) were a coverage arms race where aggregators won on breadth. DRG and Clarivate underwrote the Datavant ecosystem, enabling a generation of aggregators to resell open pharmacy claims at scale. IQVIA and other aggregators captured a majority of health analytics because the moat was simply who had the most claims. Over the past few years, data originators shifted. Inovalon pulled its data from DRG, forced direct contracts, and prohibited reselling. CVS, Evernorth, Optum, and Express Scripts were all launching their own life sciences divisions to sell direct to pharma. Even Datavant is evolving from its historical neutral positioning with the Aetion and Ontellus acquisitions, vertically integrating into analytics and record retrieval.

The dynamic is not just that these buyers need more data (they do) but each iteration of this market restructures where value accrues. The prior era of data aggregation, recycling and re-selling is over. You now must deliver immediate value to the data originators and owners for the right and privilege to then leverage this data for novel downstream use cases. It is a game of both data access across multiple data types (no one data originator has multi-modal data required across 'omics, imaging, claims, EHR, etc) and superior performance across multiple use cases and sites (where foundation models thrive).

What This Means Going Forward

Let's walk through both the demand side and supply side.

On the demand side, pharma budgets and behavior are nuanced. Demand for low-quality datasets of the prior RWD era is declining, but demand for performance and business ROI is increasing as pharma adapts to a changing world (drug pricing, LoEs, clinical trial costs, low-hanging fruit of drug targets is gone, etc). In some cases, they are reducing spend on table-stakes claims data because multiple therapeutic area teams used to buy the same dataset internally and do their own slicing and dicing rather than simply sharing the .csv across therapeutic areas or brand teams. They are more savvy buyers today. In other cases, they are moving on from historical RWD vendors of a prior era that everyone piled into in an industry of fast-followers. The type of demand is shifting with more nuance around data quality and breadth, expansion beyond oncology into new therapeutic areas, and a growing internal debate within pharma around buy vs. build.

We are also seeing an entirely new buyer with large foundation model companies spending to acquire population-level datasets to fuel their models. These purchases aren't yet at the same scale as RL environments or labeling companies in general AI space, but there is a massive appetite from model companies as they move into the application layer in life sciences. That appetite creates an opening for startups who can leverage this demand for early validation and revenue.

On the supply side (where startups we work with operate), we're thinking about a few main areas:

Enabling privacy tech / infra. This is where companies like Integral and Hermes operate. Security is entirely different in an agent-to-agent world. People want access to their data in an AI-first world that changes how consumers engage with their health and HIEs aren't sufficient. We need to think about how we de-identify and access different data types. It's both a technical question (which federated technique?) and a legal/social/policy one (how do we de-identify patient <> provider interaction voice data?).

Access to novel data. Similar to how Sam Altman recently shared that he's looking for acquisitions that are a mix of research and product, we are looking for companies that combine data generation with product. The era of pure data brokers is ending. Standard Model Bio's give-to-get model is one approach. Standard Model Bio builds capabilities that health systems and biopharma need, earn data access as a by-product, and train foundation models on the resulting longitudinal multimodal data. Autoimmune is one specific therapeutic area where this gap is acute. Patients average 4+ years to diagnosis, cycling through primary care and specialists while generating fragmented clinical data that no single system captures longitudinally. Standard inflammation markers (CRP, ESR) lack specificity to distinguish autoimmune subtypes early, so by the time patients reach methotrexate or biologics, the pre-diagnosis progression data is lost. There is a growing ecosystem of companies engaging these patients through digital health, patient communities, and DTC diagnostics to create a patient-centric biobank.

The companies we want to work with are the ones building infrastructure that originators need but structurally cannot build themselves. And ideally, they are doing it in a way that creates a defensible data asset or model as a by-product, not just a service layer that gets commoditized.

Additional Reading

Protege Data Lab — Evaluation Datasets

FDA EHR/Medical Claims Guidance

FDA Bayesian Methodology Draft Guidance (Jan 2026)

Truveta Language Model — 130M patients

FDA Removes Patient-Level Data Barrier (Dec 2025)

TriNetX Survey — 40% Rate RWD "Critical," $300M Average Savings

Hypothesis - Where Value Accrues in Life Sciences Real-World Data

Overview

What This Means Going Forward

Additional Reading

Related Articles