2311521
Project Grant
Overview
Grant Description
Frameworks: ArXiv as an Accessible Large-Scale Open Research Platform
ArXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics, and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month, there are approximately 5 million active users and 100 million web accesses.
Despite its size and usage, ArXiv has very limited search and recommendation functionality. In order to better serve the ArXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services.
To make ArXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of unknown unknowns, and accelerating research across different subject areas through unexpected synergies.
Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. ArXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields.
This project includes 4 major focus areas: open A/B testing, neural representations of scientific text, ArXiv dynamics, and security & privacy.
(1) Open A/B testing enables ArXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards.
(2) Neural representation of scientific text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive chat-bot enabling researchers to converse with and get a list of recent publications relevant to their interests.
(3) The ArXiv dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a trending and emerging ArXiv topics pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the rich-get-richer effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time.
(4) Under security & privacy, ArXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A Layer 1 API allows researchers to make coarse-grained queries on anonymized ArXiv weblogs and a Layer 2 API which allows researchers to securely experiment on ArXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc.
This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. Subawards are not planned for this award.
ArXiv is an open-access repository that has played a leading role in disciplines such as computer science, mathematics, and physics for over 30 years. It hosts more than 2 million scientific papers and has a large user community. Each month, there are approximately 5 million active users and 100 million web accesses.
Despite its size and usage, ArXiv has very limited search and recommendation functionality. In order to better serve the ArXiv community, this project is building a new generation of search and recommendation functionality and simultaneously creating a research sandbox to reduce reliance on third-party, commercial services.
To make ArXiv's trove of scientific content accessible to the visually impaired, support is being added for well-structured HTML as well as PDF. Improved discovery of research results provides broad multidisciplinary benefits across areas of science. These include less researcher time wasted browsing through large amounts of irrelevant papers, revelation of unknown unknowns, and accelerating research across different subject areas through unexpected synergies.
Improved recommendation tools, which can provide unbiased and diverse sources of relevant research results and techniques, are urgently needed to break silos. ArXiv will provide improved mechanisms for scientists to find out about important advances, both in their own field of expertise and in adjacent fields.
This project includes 4 major focus areas: open A/B testing, neural representations of scientific text, ArXiv dynamics, and security & privacy.
(1) Open A/B testing enables ArXiv to become a platform for A/B testing of search and recommendation algorithms. In addition to online A/B testing, offline A/B testing is provided using historical data along with counterfactual estimators for policy rewards.
(2) Neural representation of scientific text provides a vector-based representation of scientific texts (documents, paragraphs, and sentences) appropriate for multiple tasks, including citation, author, title, and keyword prediction. Differentiable search indices are investigated due to their potential to provide additional search performance improvements without requiring incremental re-training. Finally, this supports the construction of a scientific question-answering system which can also be used as a context-sensitive chat-bot enabling researchers to converse with and get a list of recent publications relevant to their interests.
(3) The ArXiv dynamics project investigates how scientific fields grow, shrink, and transform over time. Creating a trending and emerging ArXiv topics pattern recognition system predicts how interesting current and historical articles are to researchers. Research is investigating methods to remove the rich-get-richer effect from this model, to correct the model for the effects of the users' historical interactions with the system, and to track performance and solicit user feedback as these models change over time.
(4) Under security & privacy, ArXiv's privacy policy is updated so that users are aware of how their (meta-)data may be used and the protections that will be deployed to protect their privacy. A Layer 1 API allows researchers to make coarse-grained queries on anonymized ArXiv weblogs and a Layer 2 API which allows researchers to securely experiment on ArXiv metadata and weblogs. Privacy is preserved by a combination of query restrictions and researcher usage agreements. A machine-learning API layer is being developed which supports differential privacy and allows researchers to investigate the utility of these tools for novel ML-based applications, such as free-form question answering about scientific texts, neural recommender systems, etc.
This award by the Office of Advanced Cyberinfrastructure is jointly supported by the Division of Information and Intelligent Systems in the Directorate for Computer and Information Science and Engineering and the Division of Physics within the Directorate for Mathematical and Physical Sciences. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. Subawards are not planned for this award.
Awardee
Funding Goals
THE GOAL OF THIS FUNDING OPPORTUNITY, "CYBERINFRASTRUCTURE FOR SUSTAINED SCIENTIFIC INNOVATION", IS IDENTIFIED IN THE LINK: HTTPS://WWW.NSF.GOV/PUBLICATIONS/PUB_SUMM.JSP?ODS_KEY=NSF22632
Grant Program (CFDA)
Awarding / Funding Agency
Place of Performance
Ithaca,
New York
14850-2820
United States
Geographic Scope
Single Zip Code
Related Opportunity
Analysis Notes
Amendment Since initial award the total obligations have increased 11% from $4,466,530 to $4,966,530.
Cornell University was awarded
Enhancing ARXIV: Search, Recommendations & Accessibility
Project Grant 2311521
worth $4,966,530
from the NSF Office of Advanced Cyberinfrastructure in January 2024 with work to be completed primarily in Ithaca New York United States.
The grant
has a duration of 5 years and
was awarded through assistance program 47.070 Computer and Information Science and Engineering.
The Project Grant was awarded through grant opportunity Cyberinfrastructure for Sustained Scientific Innovation.
Status
(Ongoing)
Last Modified 9/22/23
Period of Performance
1/1/24
Start Date
12/31/28
End Date
Funding Split
$5.0M
Federal Obligation
$0.0
Non-Federal Obligation
$5.0M
Total Obligated
Activity Timeline
Transaction History
Modifications to 2311521
Additional Detail
Award ID FAIN
2311521
SAI Number
None
Award ID URI
SAI EXEMPT
Awardee Classifications
Private Institution Of Higher Education
Awarding Office
490509 OFC OF ADV CYBERINFRASTRUCTURE
Funding Office
490509 OFC OF ADV CYBERINFRASTRUCTURE
Awardee UEI
G56PUALJ3KT5
Awardee CAGE
4B578
Performance District
NY-19
Senators
Kirsten Gillibrand
Charles Schumer
Charles Schumer
Budget Funding
| Federal Account | Budget Subfunction | Object Class | Total | Percentage |
|---|---|---|---|---|
| Research and Related Activities, National Science Foundation (049-0100) | General science and basic research | Grants, subsidies, and contributions (41.0) | $4,966,530 | 100% |
Modified: 9/22/23