In the span of a week, a unique cooperative effort between academic, government and industry researchers has produced a new, structured dataset that the worldwide machine learning community can use to advance COVID-19 research. The COVID-19 Open Research Dataset (CORD-19)[1] is comprised of more than 24,000 scholarly articles (including more than 10,000 full-text artciles) about coronavirus family viruses, goes live on Monday at SemanticScholar.org[2]. It's the most extensive machine-readable coronavirus literature collection available for data and text mining to date.
Organized by the White House, the organizations that helped structure the data include the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown University's Center for Security and Emerging Technology, Microsoft Research and the National Library of Medicine (NLM) of the National Institutes of Health (NIH).
Now that the dataset is available, the White House Office of Science and Technology Policy, as well as the organizations involved, are issuing a call to action[3] to the nation's AI experts to develop new text and data mining techniques that could help answer high-priority scientific questions related to COVID-19.
These questions relate to the virus's incubation, treatment, symptoms and prevention, according to US CTO Michael Kratsios. These questions were developed in coordination with the World Health Organization (WHO) and the National Academies of Sciences, Engineering, and Medicine's Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats. The key questions are available on Kaggle[4], where researchers can submit their insights.
This is a "truly all hands-on-deck approach," Kratsios said Monday.
In the face of a crisis like the COVID-19 pandemic, "the biggest challenge a researcher faces initially is understanding, 'Where can I contribute? What has already been done?,'" the Allen Institute's Doug Raymond said to ZDNet. "Without resources like the