Search Prime Grants

2311521

Project Grant

Overview

Grant Description
Frameworks: ArXiv as an Accessible Large-Scale Open Research Platform
ArXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics, and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month, there are approximately 5 million active users and 100 million web accesses.

Despite its size and usage, ArXiv has very limited search and recommendation functionality. In order to better serve the ArXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services.

To make ArXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of unknown unknowns, and accelerating research across different subject areas through unexpected synergies.

Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. ArXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields.

This project includes 4 major focus areas: open A/B testing, neural representations of scientific text, ArXiv dynamics, and security & privacy.

(1) Open A/B testing enables ArXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards.

(2) Neural representation of scientific text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive chat-bot enabling researchers to converse with and get a list of recent publications relevant to their interests.

(3) The ArXiv dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a trending and emerging ArXiv topics pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the rich-get-richer effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time.

(4) Under security & privacy, ArXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A Layer 1 API allows researchers to make coarse-grained queries on anonymized ArXiv weblogs and a Layer 2 API which allows researchers to securely experiment on ArXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc.

This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. Subawards are not planned for this award.
Funding Goals
THE GOAL OF THIS FUNDING OPPORTUNITY, "CYBERINFRASTRUCTURE FOR SUSTAINED SCIENTIFIC INNOVATION", IS IDENTIFIED IN THE LINK: HTTPS://WWW.NSF.GOV/PUBLICATIONS/PUB_SUMM.JSP?ODS_KEY=NSF22632
Place of Performance
Ithaca, New York 14850-2820 United States
Geographic Scope
Single Zip Code
Analysis Notes
Amendment Since initial award the total obligations have increased 11% from $4,466,530 to $4,966,530.
Cornell University was awarded Enhancing ARXIV: Search, Recommendations & Accessibility Project Grant 2311521 worth $4,966,530 from the NSF Office of Advanced Cyberinfrastructure in January 2024 with work to be completed primarily in Ithaca New York United States. The grant has a duration of 5 years and was awarded through assistance program 47.070 Computer and Information Science and Engineering. The Project Grant was awarded through grant opportunity Cyberinfrastructure for Sustained Scientific Innovation.

Status
(Ongoing)

Last Modified 9/22/23

Period of Performance
1/1/24
Start Date
12/31/28
End Date
38.0% Complete

Funding Split
$5.0M
Federal Obligation
$0.0
Non-Federal Obligation
$5.0M
Total Obligated
100.0% Federal Funding
0.0% Non-Federal Funding

Activity Timeline

Interactive chart of timeline of amendments to 2311521

Transaction History

Modifications to 2311521

Additional Detail

Award ID FAIN
2311521
SAI Number
None
Award ID URI
SAI EXEMPT
Awardee Classifications
Private Institution Of Higher Education
Awarding Office
490509 OFC OF ADV CYBERINFRASTRUCTURE
Funding Office
490509 OFC OF ADV CYBERINFRASTRUCTURE
Awardee UEI
G56PUALJ3KT5
Awardee CAGE
4B578
Performance District
NY-19
Senators
Kirsten Gillibrand
Charles Schumer

Budget Funding

Federal Account Budget Subfunction Object Class Total Percentage
Research and Related Activities, National Science Foundation (049-0100) General science and basic research Grants, subsidies, and contributions (41.0) $4,966,530 100%
Modified: 9/22/23