R01OD039830

Project Grant

Overview

Grant Description

Reproducibility statistics and machine learning methods for systematic phenotyping and model integration across animals, organs, and technologies - project summary/abstract

Modern high throughput biomedical research is collecting an ever-increasing number of variables spanning different physical scales, systems, and species.

In this era of “multi-omics”, multiple data sets are often collected on the same animal, spanning modalities that may variously include genomics, transcriptomics, proteomics, epigenomics, metabolomics, connectomics, and other “omics” domains, with each individual omics data set consisting of hundreds up to hundreds of thousands of variables.

These exciting developments in data acquisition have led animal model researchers into a new data paradigm, with many labs now relying on machine learning methods for dimensionality reduction, clustering, phenotyping, data integration, and similar techniques assembled into pipelines as part of standard practice.

Frequently in this scenario, the number of variables collected P (e.g., metabolites) becomes as large as (or larger than) the number of observations N (e.g., mice).

In statistics, this is referred to the large dimensional limit (LDL) regime.

This dramatic increase in the number of variables relative to observations changes the statistical behavior of the data, leading to significant problems with machine learning model fitting behavior that are not widely appreciated outside certain domains of mathematical statistics and physics.

Estimates that should fundamentally describe the structure of data, such as the principal components, no longer converge to their true underlying values (as they do in the classical “small data” case when P is much smaller than N).

Further, similar failures extend far beyond principal components impacting many other widely used machine learning approaches, and multi-stage analysis pipelines and complex animal data distributions further compound these problems, collectively resulting in a modern reproducibility crisis in animal research.

Here we show that these issues also broadly impact the test-retest reproducibility of machine learning methods on held out data, with reproducibility on new data following a “universal reproducibility curve” in which performance changes rapidly from non-reproducible to reproducible as a function of sample size and number of variables.

A direct consequence of this phase transition is that modern animal model research studies are subject to a reproducibility transition such that below a certain value P/N, machine learning algorithms fail to yield reproducible results.

Furthermore, for ratios of P/N above this value, collecting additional data will yield rapidly diminishing marginal returns in reproducibility that may not be justified by data acquisition costs or trade-offs.

Together these phenomena define what we call “universal reproducibility curves” that depend on the “aspect ratio” P/N of the data (as well as the strength of the multivariate biological signal that the data contain).

Gaining a mechanistic understanding of such curves will allow us to systematically solve important open problems in machine learning such as how to measure reproducibility across machine learning methods and pipelines, how many data samples to collect and how many variables to measure to ensure reproducibility, when to stop collecting data, and how to design better embedding algorithms for data integration and systematic phenotyping in real-world animal research models.

Awardee

Weill Medical College Of Cornell University

Funding Goals

THE OFFICE OF RESEARCH INFRASTRUCTURE PROGRAMS (ORIP) IS A PROGRAM OFFICE IN THE DIVISION OF PROGRAM COORDINATION, PLANNING, AND STRATEGIC INITIATIVES (DPCPSI) DEDICATED TO SUPPORTING RESEARCH INFRASTRUCTURE AND RELATED RESEARCH RESOURCE PROGRAMS. ORIP CONSISTS OF THE DIVISION OF COMPARATIVE MEDICINE (DCM) AND THE DIVISION OF CONSTRUCTION AND INSTRUMENTS (DCI).

Grant Program (CFDA)

93.351 - Research Infrastructure Programs

Awarding Agency

National Institutes of Health (NIH) [HHS]

Funding Agency

National Institute of Allergy and Infectious Diseases (NIAID) [HHS - NIH]

Place of Performance

New York United States

Geographic Scope

State-Wide

Related Opportunity

Development of Resources and Technologies for Enhancing Rigor, Reproducibility, and Translatability of Animal Models in Biomedical Research (R01) (PAR-23-040)

Weill Medical College Of Cornell University was awarded Reproducibility in Animal Model Research Project Grant R01OD039830 worth $3,339,048 from the National Institute of Allergy and Infectious Diseases in August 2025 with work to be completed primarily in New York United States. The grant has a duration of 4 years and was awarded through assistance program 93.351 Research Infrastructure Programs. The Project Grant was awarded through grant opportunity Development of Resources and Technologies for Enhancing Rigor, Reproducibility, and Translatability of Animal Models in Biomedical Research (R01).

Status
(Ongoing)

Last Modified 8/6/25

Period of Performance

8/1/25

Start Date

7/31/29

End Date

8.0% Complete

Funding Split

$3.3M

Federal Obligation

$0.0

Non-Federal Obligation

$3.3M

Total Obligated

100.0% Federal Funding

0.0% Non-Federal Funding

Activity Timeline

Additional Detail

Award ID FAIN

R01OD039830

SAI Number

R01OD039830-107064823

Award ID URI

SAI UNAVAILABLE

Awardee Classifications

Private Institution Of Higher Education

Awarding Office

75AGNA NIH AGGREGATE FINANCIAL ASSISTANCE DATA AWARDING OFFICE

Funding Office

75NA00 NIH OFFICE OF THE DIRECTOR

Awardee UEI

YNT8TCJH8FQ8

Awardee CAGE

1UMU6

Performance District

NY-90

Senators

Kirsten Gillibrand
Charles Schumer

Modified: 8/6/25

R01OD039830

Overview

Status (Ongoing)

Activity Timeline

Additional Detail

Status
(Ongoing)