SCEC DATA MANAGEMENT PLAN

This SCEC Data Management Plan (DMP) will support the Center’s mission to gather, integrate, and communicate earthquake system science research information. The Center’s data management practices are based on Open Government Data Management1 principles designed to ensure the Center’s data is complete, timely, non-proprietary, and managed using FAIR (Findable, Accessible, Interoperable, Reusable)2 principles so the Center’s data can be accurately and appropriately found, re-used, and cited over time by both human and automated data processing systems.

A. Types of Data, Physical Collections, and Software

Based on their different data management requirements, we divide SCEC data into three broad categories: (1) administration data (AD), (2) collaboration data (CD), and (3) research computing data (RCD). While there are overlaps between the data types and data management requirements in these categories, each area has unique technical requirements that warrant special considerations. AD includes contracts, subcontracts, project budgets, staff personnel and other human resources data. AD will be managed using USC’s Workday3 enterprise resource management system. The variety and volume of AD will be low, estimated at less than 100Gb per year, however, the privacy and security requirements for these data are high, so SCEC will rely on the USC Workday system to provide access, security, privacy, backups, and archives for SCEC administrative data. CD includes products generated and managed as part of SCEC collaborative research activities including proposals, project reports, workshop presentations and reports, meeting minutes, research posters, workforce development training materials, education and outreach materials, and related data products. CD types include text, spreadsheets, images, audio files, video files, reports, and surveys. The Center will manage CD using a flexible and secure web-based content management system (CMS) to gather information, synthesize knowledge, and communicate our understanding within and outside our community. The volume of CD is estimated as less than 1TB per year, and will be stored on SCEC headquarters systems. RCD will include observational data, earth structure models (CXMs), software, verification and validation data, simulation results, machine learning training data and models, and research publications. RCD types will include text files, images files, PDF, PowerPoint, audio files and video files, databases, simulation results, and descriptive metadata. Observational data collected by the Center may include geological samples, ground motion data, DAS, GNSS, INSAR, electromagnetic images, and other observables. SCEC will not host authoritative, persistent archives for any observational data, so we will use partnerships with existing observational data management groups including SAGE/GAGE, USGS, NHERI, OpenTopography, and and NASA, as the official long-term repositories for any observational data collected. Management of SCEC community models will emphasize the assignment of persistent digital identifiers (PDI) (e.g. DOIs) and registration of models into external long term data repositories including IRIS Earth Model Collaboration4, USGS Science Data Catalog5 and Zenodo6. Scientific software will be stored in public open-source repositories including GitHub7. RCD not ready for migration to external long term data repositories will be managed using publicly accessible storage operated by USC CARC8. SCEC’s research computing data volume will likely exceed hundreds of terabytes of temporary and intermediate data products annually.

B. Data and Metadata Standards

SCEC data and metadata standards will vary by data category. AD formats and metadata will be integrated with the USC Workday system. CD will be managed with a CMS that manages digital objects, and metadata and supports organization of objects into collections, and provides tools to discover and access digital content. CD standards include document standards including PDF, Word, Excel, and PowerPoint formats, and common video and audio formats (PNG, JPG, MP4, etc.). CD metadata will be extensible, with default metadata attributes defined in Project Open Data Project Open Data Metadata Schema v1.1+9. The CMS will support adequate metadata for CD, by enabling users to submit a broad range of digital objects and requiring users to annotate the data with standardized metadata. RCD formats and metadata standards for specific data types are based on the estimated number and size of data produced, the frequency of revisions of material, existing standard data and metadata exchange formats, the frequency and scale of access, and long-term storage versus cost to regenerate. SCEC will coordinate selection of appropriate data exchange formats and mechanisms in collaboration with NSF and USGS open data communities. RCD will use self-describing, and machine-readable data formats, to support data access and data exchange. Specific examples of SCEC data and metadata standards include NetCDF version 4.0+10 for some SCEC CXM models and HDF-511 for sets of simulated ground motion time series. RCD metadata for each dataset will include (a) descriptive metadata such as a name, and descriptive prose; (b) administrative metadata, including preservation metadata, provenance and authorization policy data; and (c) structural metadata that includes information about how to read and display an item. RCD metadata will be based on the Open Data Metadata Schema12 v1.1+ that includes metadata attributes such as description, download URL, language, identifier, format, license, data quality, and modification date. The Center’s geographic metadata standards will be based on Federal Geographic Data Committee (FGDC) Digital Spatial Metadata standards13. RCD metadata will include attributes that identify whether data are provisional or approved research results. Metadata will be stored both in database tables and in metadata files formatted in JSON, XML, and similar formats.

C. Policies for Accessing and Sharing Provisions for Appropriate Protection

The Center’s data management policies default to open and immediate access by the community and public. The SCEC CPC will be responsible for reviewing, approving, or modifying the SCEC data management policy defined in this plan. AD will contain personnel, financial, and private information, and are not subjected to the open data policy. CD management policies will emphasize discovery and access to information on research and activities. The SCEC CMS will be used to acquire and organize collaboration data and control data visibility and access, and to provide data and metadata registration capabilities, persistent storage for the data, search capabilities, and retrieval mechanisms. Security, privacy, and operational redundancy of CMS information will be maintained using Amazon Web Services’ (AWS)14 tools and best practices for backup, monitoring, and security, as well as support from USC Information Technology Services (ITS)15. RCD management policies will promote the Center’s ability to publish and share data broadly with science and engineering communities. Availability of data products will be described and announced at SCEC meetings and workshops and in project-developed outreach material prepared for distribution at other professional meetings.

D. Policies and Provisions for Re-use, Re-distribution, and the Derived Products

AD is confidential so access will be limited to Center staff and it will not be redistributed. The Center will restrict access to some CD that may contain Personally Identifiable Information (PII) that could be used to identify specific community members and CMS-based security systems will be used to ensure only appropriate SCEC staff have access to PII in the SCEC CMS. Most CD and RCD will be released under Creative Commons Attribution-ShareAlike 4.0 International License16. Project software distributions will be released under the BSD 3-clause open-source license, or similar OSI-approved17 open-source license. Peer-reviewed RCD will be assigned PDIs and will be registered into Open-access research data repositories18 to support findability, re-distribution, and reuse.

E. Plans for Archiving and Preservation of Access

AD archives will be managed through the USC academic and business computing system that manages institutional archives, providing robust, encrypted, multi-location backups that are compliant with campus IT security practices. CD will be archived and preserved using CMS-backup capabilities and AWS S3 durable object storage service. RCD archives will use digital storage available at low cost through USC CARC which has expertise in archiving extremely large data sets. The Center will use an archive management process, linked to the existing annual SCEC scientific planning process that will solicit and evaluate digital data storage needs of the SCEC community. If SCEC community data archive requests exceed available storage, the SCEC CPC will prioritize which datasets are migrated into archival storage for long term preservation. CARC storage will be used to preserve the Center data archives for three years after the end of the project.

References

  1. OpenDataGov. Available: https://resources.data.gov/
  2. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Nature, 3: 160018. https://www.nature.com/articles/sdata201618
  3. Workday. The finance, HR, and planning system for a changing world. 2021. https://www.workday.com/ 
  4. IRIS Earth Model Collaboration (EMC). https://ds.iris.edu/ds/products/emc/
  5. USGS Science Data Catalog (SDC). https://data.usgs.gov/datacatalog/
  6. Research, E. O. F. N. & OpenAIRE. Zenodo. 2013. https://zenodo.org/
  7. GitHub. https://github.com
  8. USC Advanced Research Computing. 2021. https://carc.usc.edu/
  9. Resources, F. E. D. DCAT-US Schema v1.1 (Project Open Data Metadata Schema). 2021. https://resources.data.gov/standards/catalog/dcat-us/
  10. Programs, U. C. Network Common Data Form (NetCDF). 2021. https://www.unidata.ucar.edu/software/netcdf/
  11. HDF5: The HDF Group. http://www.hdfgroup.org/HDF5/
  12. DCAT-US Schema v1.1 (Project Open Data Metadata Schema). https://resources.data.gov/standards/catalog/dcat-us/
  13. Committee, F. G. D. Content Standard for Digital Geospatial Metadata (CSDGM), Vers. 2. 4/26/ 2021. https://www.fgdc.gov/metadata/csdgm-standard
  14. Amazon Web Services (AWS). https://aws.amazon.com/
  15. USC Information Technology Services (USC ITS). https://itservices.usc.edu/
  16. Commons, C. Attribution 4.0 International (CC BY 4.0). 8/26/ 2020. https://creativecommons.org/licenses/by/4.0/
  17. The 3-Clause BSD License. In: Open Source Initiative [Internet]. 2018. https://opensource.org/licenses/BSD-3-Clause
  18. Registry of Research Data Repositories. https://www.re3data.org/