COVID-ARC helps address the immediate need to understand the spread and impact of COVID-19 with a platform of networked and centralized archives that store, curate, visualize, and disseminate multimodal data related to the disease. The platform includes patients’ demographics, clinical evaluations, vitals, EKGs, and imaging data, such as CT, X-Ray, PET, and MRI.
COVID-ARC data, together with a variety of analytic tools, are shared broadly with the world-wide scientific community. This will help maximize the potential for research progress by uniting scientists from diverse fields, including medicine, public health, and artificial intelligence.
COVID-ARC helps address the immediate need to understand the spread and impact of COVID-19 with a platform of networked and centralized archives that store, curate, visualize, and disseminate multimodal data related to the disease. The platform includes patients’ demographics, clinical evaluations, vitals, EKGs, and imaging data, such as CT, X-Ray, PET, and MRI.
COVID-ARC data, together with a variety of analytic tools, are shared broadly with the world-wide scientific community. This will help maximize the potential for research progress by uniting scientists from diverse fields, including medicine, public health, and artificial intelligence.
COVID-19 Data
COVID-ARC helps address the immediate need to understand the spread and impact of COVID-19 with a platform of networked and centralized archives that store, curate, visualize, and disseminate multimodal data related to the disease. The platform includes patients’ demographics, clinical evaluations, vitals, EKGs, and imaging data, such as CT, X-Ray, PET, and MRI.
COVID-ARC data, together with a variety of analytic tools, are shared broadly with the world-wide scientific community. This will help maximize the potential for research progress by uniting scientists from diverse fields, including medicine, public health, and artificial intelligence.
COVID-ARC is a data archive that stores multimodal (i.e., demographic information, clinical outcome reports, imaging scans) and longitudinal data related to COVID-19 and provides various statistical and analytic tools for researchers.This archive provides access to data along with user-friendly tools for researchers to perform analyses to better understand COVID-19 and encourage collaboration on this research. The COVID-19 pandemic is spreading rapidly across the world, and governments are imposing travel bans, quarantine laws, business and school closings, and many other restrictions in efforts to contain the virus and limit the spread. However, much is still unknown about COVID-19. There is an urgent need for scientists around the world to work together to model the virus, study how the virus has changed and will change over time, understand how it spreads, and discover a vaccine. The work from this project can also prepare scientists for future pandemics by putting the infrastructure in place to enable researchers to aggregate data and perform analyses quickly in the event of an emergency.
The approach is to develop a platform of networked and centralized web-accessible data archives to store multimodal data related to COVID-19 and make them broadly available and accessible to the world-wide scientific community to expedite research in this area due to the urgent nature of the COVID-19 pandemic. By leveraging previous work in developing data repositories and archival capabilities at the at the Laboratory of Neuro Imaging at the USC Mark and Mary Stevens Neuroimaging and Informatics Institute, COVID-ARC aims to provide an efficient and secure data repository platform that facilitates data access and analysis. COVID-ARC provides tools for researchers to visualize and analyze various types of data as well as a website with tools for training, announcements, virtual information sessions, and a knowledgebase wherein researchers post questions and receive answers from the community.
An efficient, secure, HIPAA-compliant data repository platform.
A multi-center data review and assessment system that preserves data quality, fidelity and provenance.
Mechanism and regular training sessions to increase the ease of data aggregation and the downloading of large datasets .
Spatially normalize data and create subject cohorts to search, compare and download data.
Integrated processing on an extensible framework of protocols that can integrate with modules from any other software suite.
COVID-ARC accommodates a wide variety of data types related to COVID-19, including clinical evaluation
(symptoms), vitals (spirometry, temperature, respiration rate, heart rate, etc.), demographic,
geolocation, EKG, EEG, CT, X-ray, PET, and MRI, in order to create a comprehensive picture of the
virus and its spread. The following table provides an overview of data types currently accepted,
but due to its adaptable design, other formats will be integrated according to end-user needs.
LONI has more than 18 PB of storage capacity.
Data Categories | Data Types | File Formats* |
---|---|---|
Imaging | Structural MRI, resting state fMRI, DTI, PET, CSM, CT | DICOM, NIFTI, NII, MGZ |
Clinical Data | Symptoms, vitals, patient history, medical history, cognitive assessments, demographic, geolocation | Python scripts, R, C code, Excel, CSV, NPY* |
Temporal Recordings | EEG, ECoG, multi- or single-unit microelectrode recording, EMG, TMS. | Python scripts, R, C code, Excel, MP4, AVI, WAV, CSV, NPY |
*The file formats listed provide a snapshot of COVID-ARC’s capabilities but do not represent a comprehensive list. Other formats will be integrated over time according to end-user needs.
QUALITY
Some data, such as EEG, are subject to amplifier noise, technical artifacts (e.g. poor electrode location, issues with electrode impedance), and physiological artifacts (e.g. eye movement). In order to ensure consistent data quality, COVID-ARC uses range checking, signal-to-noise ratio checks, artifact removal techniques, power spectrum analysis, and various filtering methods to inspect for noise. The LONI Quality Control System (LONI QC) is used for all modalities of imaging data and regularly reviewed by participating collaborators.
PROVENANCE
All information pertaining to acquisition, QC, pre-processing, and analyses is captured and retained, providing a comprehensive history and provenance to the data. When algorithms are executed within the LONI Pipeline, provenance is captured in the form of machine- and human-readable XML files.
CONTROL
COVID-ARC relies on an infrastructure comprised of fail-safe, redundant, and secure components to store both raw and processed data. In the event of a single system failure, redundant web, application, and database servers ensure service continuity, while data backup mechanisms are in place to protect the integrity of data. Investigators may choose to use centralized, federated, or cloud-based solutions to store their data. Data providers will maintain control over their data at all times; COVID-ARC simply provides a user-friendly tool to facilitate the storage, management, and sharing of those data.
Case Summaries
Johns Hopkins University Coronavirus Resource Center
Comprehensive global COVID-19 case tracker with critical trend evaluation.
Coronavirus (COVID-19) Data in the United States
The New York Times is compiling time series data of confirmed and probable COVID-19 cases and deaths at the federal, state, and county level
Coronavirus (COVID-19) Cases Worldwide
Worldwide confirmed tested cases of COVID-19 and the number of deaths and recoveries from the disease sourced from Johns Hopkins University Center for Systems Science and Engineering
Novel Coronavirus 2019 Dataset
Day-level information on COVID-19 cases worldwide extracted from Johns Hopkins Coronavirus Resource Center .
COVID-19 Case Surveillance Dataset
How to request access?
Data are made available for limited use upon completion of the registration
information and data use restrictions agreement (RIDURA).
To access the restricted dataset, please email: eocevent394@cdc.gov and include the completed RIDURA.
Access will be granted through https://github.com/cdc-data.
What variables are included in the dataset?
The public dataset includes the following variables: Initial case report date to CDC,
Date of first positive specimen collection, Symptom onset date, if symptomatic,
Case status, Sex, Age group (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80+ years),
Race and ethnicity (combined), Hospitalization status, ICU admission status, Death status and
Presence of underlying comorbidity or disease.
In addition to variables in the public dataset, the restricted access
dataset also includes the following variables: State of residence,
County of residence, County FIPS code, Healthcare worker status,
Pneumonia present, Acute respiratory distress syndrome (ARDS) present,
Abnormal chest x-ray (CXR) present, Mechanical ventilation (MV)/intubation status and
Presence of each of the following symptoms: fever, subjective fever, chills, myalgia, rhinorrhea,
sore throat, cough, shortness of breath, nausea/vomiting, headache, abdominal pain, diarrhea.
Dataset was constructed for the purpose of pneumonia lesion segmentation and it contains CT scans (3D volumes) of 120 patients diagnosed with COVID-19.
Dataset of the CT images and metadata are constructed from cohorts from the China Consortium of Chest CT Image Investigation (CC-CCII). All CT images are classified into novel coronavirus pneumonia (NCP) due to SARS-CoV-2 virus infection, common pneumonia and normal controls.
Chest X-ray dataset from the Institute for Diagnostic and Interventional Radiology, Hannover Medical School, Hannover Germany. The public dataset containing 243 images for COVID-19 positive patients also includes extensive metadata for each image.
Database of chest X-ray images with 219 COVID-19 positive patients, 1345 viral pneumonia images, and 1341 normal images.
Oxford COVID-19 Government Response Tracker (OxCGRT)
Oxford COVID-19 Government Response Tracker (OxCGRT) provides a systematic way to track government responses to COVID-19 across countries and sub-national jurisdictions over time. OxCGRT can be used to describe variation in government responses, explore whether the government response affects the rate of infection, and identify correlates of more or less intense responses.
COVID-19 Image Data Collection
Public dataset of chest X-ray and CT images of COVID-19 positive patients and patients suspected of COVID-19 or other viral/bacterial pneumonias
Platform to aggregate and summarize open datasets related to COVID-19
Public Data Lake for Analysis of COVID-19 Data
Centralized repository of up-to-date and curated COVID-19 datasets hosted by Amazon Web Services .
Collection of dataverses and datasets related to COVID-19 cases in China and the United States hosted by China
Open-Access Data and Computational Resources to Address COVID-19
Synthesized open-access data and computational resources freely available to researchers aggregated by the National Institutes of Health
Collection of open datasets about COVID-19 hosted by AMiner
COVID-19 Public Datasets Program
Repository of public datasets related to COVID-19 and the spread of COVID-19 hosted on Google Cloud Platform
Public chest CT image dataset of 1,252 COVID-19 positive and 1,230 COVID-19 negative patients from Sao Paulo, Brazil
COVID-19 Collective CT Image Dataset
Chest CT dataset with 349 COVID-19 positive images and 463 COVID-19 negative images from various hospitals in China
COVID-19 CT Image and Lung Segmentation Dataset
Chest CT dataset containing 100 COVID-19 positive images and lung segmentation masks from Italian Society of Medical Radiology and Interventional (SIRM)
COVID-19 CT Image and Segmentation Dataset
Chest CT dataset with 20 images and the corresponding segmentation masks from Wenzhou Medical University and Netanya, Israel
COVID-19 CT and CX Image Dataset
CT and CX dataset from the Valencian Region Medical ImageBank (BIMCV)
CT Scan database containing 1,110 COVID-19 positive cases and 50 lung segmentation masks from the Moscow Center of Diagnostics and Telemedicine
COVID-19 CT dataset containing 63,849 images from 377 patients from the Negin Medical Center in Sari, Iran
Pneumonia X-ray Image Collection
X-ray dataset from Guangzhou Women and Children’s Medical Center containing 5,863 images from healthy patients and patients diagnosed with bacterial/viral pneumonias
Data upload uses secure encryption and is HIPAA compliant. Once data are uploaded, they are securely stored and regularly backed up.
ASPERA is the latest HIPAA compliant software utility from IBM used as a way to transfer large datasets. ASPERA is an extremely fast and light file transfer client and is not subject to the limitations that exist in web browsers. This upload method requires providers install ASPERA on their local machines and request the connection and host credentials from COVID-ARC, which include the correct file paths and storage locations on COVID- ARC servers. Once installed, providers simply log in and select files to transfer. No file structure or naming requirements are involved and data can be deleted from COVID-ARC at any time.View ASPERA instructions and ASPERA HIPAA Compliance
LONI PIPELINE
The LONI Pipeline is a free workflow application primarily for computational scientists. With the LONI Pipeline, users can quickly create workflows that take advantage of all the greatest tools developed in various programming languages that can be applied to neuroimaging, genomics, bioinformatics, and other related data.
LONI quality control
LONI Quality Control (LONI QC) is an imaging data review and assessment platform for human imaging research studies involving either one or multiple centers. LONI QC allows users to anonymously download imaging data from COVID-ARC and run a standardized quality control check via an automated preprocessing system using a number of metrics. Users then receive a detailed report of the image quality.
Principal Investigator
Dominique Duncan, PhDAssistant Professor of Neurology, Biomedical Engineering, and Neuroscience, Laboratory of Neuro Imaging, USC Mark and Mary Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California.
Dr. Duncan has developed novel analytic tools for analyzing multimodal data, including imaging and electrophysiology. Her interests lie at the intersection of data analysis, signal processing, and machine learning, particularly applied to traumatic brain injury and epilepsy. She has led and co-led large-scale multimodal databasing projects that are linked with visualization and analytic tools, aiming to encourage collaboration across multiple fields. She has also developed virtual reality tools to optimize the process of analyzing neuroimaging data and to improve neuroscience education among K-12 students.
Postdoctoral Scholar
Marianna La Rocca, PhD
Dr. La Rocca received her PhD in Physics applied to Neuroscience from Bari University. Her research involves the use of neuroimaging techniques and computational methods to study biomarkers of epileptogenesis after traumatic brain injury using multimodal data. She has been developing and applying complex network-based quantitative methods and machine learning techniques to electrophysiology and imaging data.
Project Specialist
Rachael Garner, BA
Rachael Garner received her Bachelor of Arts in Cognitive Science from the University of California, Berkeley. She conducts multimodal data analysis on human and rodent imaging and electrophysiology data. She also works on multimodal databasing, including data curation and harmonization.
Project assistant
Yujia Zhang, BS
Yujia received her B.S. in Chemical Engineering from Michigan State University in 2018 and her M.S. in Chemical Engineering from the University of Southern California in 2020. Her undergraduate research was focused on single-enzyme kinetics using the Atomic Force Microscope. She is currently working on multimodal data analysis and applying various machine learning methods on COVID-ARC data.
Instructor
Michael Sinclair, PhD
Michael Sinclair has been a technology teacher and coordinator at Bravo Medical Magnet High School since 1999 and has been in education since 1984. He has written and coordinated a number of grants and has co-created three academies. Since 2017, he has served as a California Department of Education-appointed Specialized Secondary Programs Mentor overseeing the implementation of programs throughout Southern California.
Instructor
Glendy Ramirez-De La Cruz, BS
Glendy is a Life Science and Career Technical Education (CTE) teacher at Bravo. Has the overall coordinating and management responsibility for the day-to-day operations of the STAR (Science Technology and Research) & EHA (Engineering Academy for Health) biotechnology programs at Bravo since 2012. She is the primary liaison between USC laboratories, principal investigators and graduate student mentors for the STAR & EHA capstone class.
Alexis Bennett
Alexis is currently a fourth year at the University of Southern California pursuing her Bachelor of Science in Computational Neuroscience and her Master of Science in Biomedical Engineering with a focus in Neuroengineering. She works as a student research assistant and performs manual segmentations on human neuroimaging data to contribute to the findings of potential biomarkers of epileptogenesis after traumatic brain injury.
Azrin Khan
Azrin is an incoming freshman at the University of Southern California’s Viterbi School of Engineering majoring in Electrical and Computer Engineering. She analyzes human imaging data of patients with traumatic brain injury using an interactive software application to identify biomarkers of epileptogenesis.
Jiaju Liu
Jiaju is an incoming freshman at Stanford University, where he plans to study symbolic systems. His research interests include using signal processing and unsupervised learning methods on EEG data to detect and classify high frequency oscillations, putative biomarkers of epileptogenesis. He has developed an automated high frequency oscillation classifier that solves novel machine learning problems regarding fast oscillating data.
Noor Nouaili
Noor is a rising freshman at Yale University. She recently graduated from the Marlborough School in Los Angeles and plans to major in neuroscience and global affairs. Noor has been conducting research for the Epilepsy Bioinformatics Study for Antiepileptogenic Therapy (EpiBioS4Rx) for the past year, focusing on performing MRI brain segmentations.
Aubrey Martinez
Aubrey Martinez is currently pursuing a Bachelor of Science in Neuroscience degree at the University of Southern California. She works on the analysis of human and rodent neuroimaging data, using manual and automated segmentation methods, to identify potential biomarkers of post-traumatic epileptogenesis.
Alexis Bennett
Alexis is currently a fourth year at the University of Southern California pursuing her Bachelor of Science in Computational Neuroscience and her Master of Science in Biomedical Engineering with a focus in Neuroengineering. She works as a student research assistant and performs manual segmentations on human neuroimaging data to contribute to the findings of potential biomarkers of epileptogenesis after traumatic brain injury.
Azrin Khan
Azrin is an incoming freshman at the University of Southern California’s Viterbi School of Engineering majoring in Electrical and Computer Engineering. She analyzes human imaging data of patients with traumatic brain injury using an interactive software application to identify biomarkers of epileptogenesis.
Jiaju Liu
Jiaju is an incoming freshman at Stanford University, where he plans to study symbolic systems. His research interests include using signal processing and unsupervised learning methods on EEG data to detect and classify high frequency oscillations, putative biomarkers of epileptogenesis. He has developed an automated high frequency oscillation classifier that solves novel machine learning problems regarding fast oscillating data.
Noor Nouaili
Noor is a rising freshman at Yale University. She recently graduated from the Marlborough School in Los Angeles and plans to major in neuroscience and global affairs. Noor has been conducting research for the Epilepsy Bioinformatics Study for Antiepileptogenic Therapy (EpiBioS4Rx) for the past year, focusing on performing MRI brain segmentations.
Aubrey Martinez
Aubrey Martinez is currently pursuing a Bachelor of Science in Neuroscience degree at the University of Southern California. She works on the analysis of human and rodent neuroimaging data, using manual and automated segmentation methods, to identify potential biomarkers of post-traumatic epileptogenesis.
NEWS AND EVENTS
COVID-ARC Webinar for Francisco Bravo Medical Magnet High School Students and Teachers
Friday, November 13, 2020
COVID-19 Research Lightning Round: Flyer
COVID-19 Research Lightning Round: Webinar and Q&A
Wednesday, September 16, 2020
COVID-19 Research Lightning Round : Video Recording
The COVID Information Commons brings together a group of researchers studying wide-ranging aspects of the current pandemic, to share their research and answer questions from our community. The first monthly webinar included talks by the following researchers.
Erick Jones, University of Texas at Arlington
EAGER: AI-Enabled Optimization of the COVID-19 Therapeutics Supply Chain to Support Community Public Health.
Howard Stone, Princeton University
Flow Asymmetry in Human Breathing and the Asymptomatic Spreader.
Michael Pazzani, University of California San Diego
RAPID: Explainable Machine Learning for Analysis of COVID-19 Chest CT.
Ashok Srinivasan, University of West Florida
Collaborative: RAPID: Leveraging New Data Sources to Analyze the Risk of COVID-19 in Crowded Locations.
Dominique Duncan, University of Southern California
RAPID: COVID-ARC (COVID-19 Data Archive).
Debbie Kim, University of Chicago
RAPID: Pandemic Learning Loss in U.S. High Schools: A National Examination of Student Experiences.
Nora Garza, Laredo College
RAPID: Using real life COVID-19 Data to teach quantitative reasoning skills to undergraduate Hispanic STEM students.
Ajitesh Srivastava, University of Southern California
RAPID: ReCOVER: Accurate Predictions and Resource Allocation for COVID-19 Epidemic Response).
Scientists launch data archive to bolster research on COVID-19
August 19th, 2020
FAQ
How do the centralized and federated storage models differ?
Data providers who choose to store their data under a centralized model will transfer their data to be stored at the USC Stevens Neuroimaging and Informatics Institute. Under the federated model, data providers will keep their data stored locally at their sites. If a COVID-ARC user is granted permission by the data provider to access any of the federated data, the data will be transferred to the user from the data provider’s home institution’s database. COVID-ARC’s central server will only have a description of the federated data but will not store the data.
How can I transfer data to COVID-ARC?
Data can be securely imported to COVID-ARC through the file transfer client Aspera. Since Aspera is a general file transfer client, any data format is transferrable. More information can be found at https://asperasoft.com/software/client-options/desktop-client/ An Aspera tutorial can be found at Aspera Tutorial.
What happens if a network connection is lost during data transfer?
Both COVID-ARC and Aspera have tools in place to minimize the burden of lost connections during data upload. Aspera pauses the transfer and can resume once the network connection is reestablished.
How long will data transfers take?
File transfer depends largely on the uploader’s network connection. Aspera uses a proprietary data transfer protocol developed by IBM that makes it faster than traditional File Transfer Clients.
After COVID-ARC’s funding period concludes, what will happen with the archive?
COVID-ARC is committed to the security and persistence of data shared by COVID-ARC researchers and data providers. All data uploaded to COVID-ARC will remain securely stored at the USC Stevens Neuroimaging and Informatics Institute.
What types of projects would be most suited for federated storage?
The federated model can accommodate the data needs of any project, but it is particularly effective for large datasets that are continuously collecting data over large periods of time. That way, data can be queried as it is collected without the need for team members to upload large datasets repeatedly.
COVID-ARC is powered by:
Laboratory of Neuro Imaging
USC Stevens Neuroimaging and Informatics Institute
Keck School of Medicine of USC
University of Southern California
2025 Zonal Avenue
Los Angeles, CA 90033
Dominique Duncan, Principal Investigator
dduncan@loni.usc.edu
Tel: (323) 865-1754
Fax: (323) 442-0137