R01OD039830
Project Grant
Overview
Grant Description
Reproducibility statistics and machine learning methods for systematic phenotyping and model integration across animals, organs, and technologies - project summary/abstract
Modern high throughput biomedical research is collecting an ever-increasing number of variables spanning different physical scales, systems, and species.
In this era of “multi-omics”, multiple data sets are often collected on the same animal, spanning modalities that may variously include genomics, transcriptomics, proteomics, epigenomics, metabolomics, connectomics, and other “omics” domains, with each individual omics data set consisting of hundreds up to hundreds of thousands of variables.
These exciting developments in data acquisition have led animal model researchers into a new data paradigm, with many labs now relying on machine learning methods for dimensionality reduction, clustering, phenotyping, data integration, and similar techniques assembled into pipelines as part of standard practice.
Frequently in this scenario, the number of variables collected P (e.g., metabolites) becomes as large as (or larger than) the number of observations N (e.g., mice).
In statistics, this is referred to the large dimensional limit (LDL) regime.
This dramatic increase in the number of variables relative to observations changes the statistical behavior of the data, leading to significant problems with machine learning model fitting behavior that are not widely appreciated outside certain domains of mathematical statistics and physics.
Estimates that should fundamentally describe the structure of data, such as the principal components, no longer converge to their true underlying values (as they do in the classical “small data” case when P is much smaller than N).
Further, similar failures extend far beyond principal components impacting many other widely used machine learning approaches, and multi-stage analysis pipelines and complex animal data distributions further compound these problems, collectively resulting in a modern reproducibility crisis in animal research.
Here we show that these issues also broadly impact the test-retest reproducibility of machine learning methods on held out data, with reproducibility on new data following a “universal reproducibility curve” in which performance changes rapidly from non-reproducible to reproducible as a function of sample size and number of variables.
A direct consequence of this phase transition is that modern animal model research studies are subject to a reproducibility transition such that below a certain value P/N, machine learning algorithms fail to yield reproducible results.
Furthermore, for ratios of P/N above this value, collecting additional data will yield rapidly diminishing marginal returns in reproducibility that may not be justified by data acquisition costs or trade-offs.
Together these phenomena define what we call “universal reproducibility curves” that depend on the “aspect ratio” P/N of the data (as well as the strength of the multivariate biological signal that the data contain).
Gaining a mechanistic understanding of such curves will allow us to systematically solve important open problems in machine learning such as how to measure reproducibility across machine learning methods and pipelines, how many data samples to collect and how many variables to measure to ensure reproducibility, when to stop collecting data, and how to design better embedding algorithms for data integration and systematic phenotyping in real-world animal research models.
Modern high throughput biomedical research is collecting an ever-increasing number of variables spanning different physical scales, systems, and species.
In this era of “multi-omics”, multiple data sets are often collected on the same animal, spanning modalities that may variously include genomics, transcriptomics, proteomics, epigenomics, metabolomics, connectomics, and other “omics” domains, with each individual omics data set consisting of hundreds up to hundreds of thousands of variables.
These exciting developments in data acquisition have led animal model researchers into a new data paradigm, with many labs now relying on machine learning methods for dimensionality reduction, clustering, phenotyping, data integration, and similar techniques assembled into pipelines as part of standard practice.
Frequently in this scenario, the number of variables collected P (e.g., metabolites) becomes as large as (or larger than) the number of observations N (e.g., mice).
In statistics, this is referred to the large dimensional limit (LDL) regime.
This dramatic increase in the number of variables relative to observations changes the statistical behavior of the data, leading to significant problems with machine learning model fitting behavior that are not widely appreciated outside certain domains of mathematical statistics and physics.
Estimates that should fundamentally describe the structure of data, such as the principal components, no longer converge to their true underlying values (as they do in the classical “small data” case when P is much smaller than N).
Further, similar failures extend far beyond principal components impacting many other widely used machine learning approaches, and multi-stage analysis pipelines and complex animal data distributions further compound these problems, collectively resulting in a modern reproducibility crisis in animal research.
Here we show that these issues also broadly impact the test-retest reproducibility of machine learning methods on held out data, with reproducibility on new data following a “universal reproducibility curve” in which performance changes rapidly from non-reproducible to reproducible as a function of sample size and number of variables.
A direct consequence of this phase transition is that modern animal model research studies are subject to a reproducibility transition such that below a certain value P/N, machine learning algorithms fail to yield reproducible results.
Furthermore, for ratios of P/N above this value, collecting additional data will yield rapidly diminishing marginal returns in reproducibility that may not be justified by data acquisition costs or trade-offs.
Together these phenomena define what we call “universal reproducibility curves” that depend on the “aspect ratio” P/N of the data (as well as the strength of the multivariate biological signal that the data contain).
Gaining a mechanistic understanding of such curves will allow us to systematically solve important open problems in machine learning such as how to measure reproducibility across machine learning methods and pipelines, how many data samples to collect and how many variables to measure to ensure reproducibility, when to stop collecting data, and how to design better embedding algorithms for data integration and systematic phenotyping in real-world animal research models.
Funding Goals
THE OFFICE OF RESEARCH INFRASTRUCTURE PROGRAMS (ORIP) IS A PROGRAM OFFICE IN THE DIVISION OF PROGRAM COORDINATION, PLANNING, AND STRATEGIC INITIATIVES (DPCPSI) DEDICATED TO SUPPORTING RESEARCH INFRASTRUCTURE AND RELATED RESEARCH RESOURCE PROGRAMS. ORIP CONSISTS OF THE DIVISION OF COMPARATIVE MEDICINE (DCM) AND THE DIVISION OF CONSTRUCTION AND INSTRUMENTS (DCI).
Grant Program (CFDA)
Awarding Agency
Place of Performance
New York
United States
Geographic Scope
State-Wide
Weill Medical College Of Cornell University was awarded
Reproducibility in Animal Model Research
Project Grant R01OD039830
worth $3,339,048
from the National Institute of Allergy and Infectious Diseases in August 2025 with work to be completed primarily in New York United States.
The grant
has a duration of 4 years and
was awarded through assistance program 93.351 Research Infrastructure Programs.
The Project Grant was awarded through grant opportunity Development of Resources and Technologies for Enhancing Rigor, Reproducibility, and Translatability of Animal Models in Biomedical Research (R01).
Status
(Ongoing)
Last Modified 8/6/25
Period of Performance
8/1/25
Start Date
7/31/29
End Date
Funding Split
$3.3M
Federal Obligation
$0.0
Non-Federal Obligation
$3.3M
Total Obligated
Activity Timeline
Additional Detail
Award ID FAIN
R01OD039830
SAI Number
R01OD039830-107064823
Award ID URI
SAI UNAVAILABLE
Awardee Classifications
Private Institution Of Higher Education
Awarding Office
75AGNA NIH AGGREGATE FINANCIAL ASSISTANCE DATA AWARDING OFFICE
Funding Office
75NA00 NIH OFFICE OF THE DIRECTOR
Awardee UEI
YNT8TCJH8FQ8
Awardee CAGE
1UMU6
Performance District
NY-90
Senators
Kirsten Gillibrand
Charles Schumer
Charles Schumer
Modified: 8/6/25