Best Practices

Data Collections Development

Accepted Data

The DDR accepts engineering datasets generated through simulation, hybrid simulation, experimental, and field research methods regarding the impacts of wind, earthquake, and storm surge hazards, as well as debris management, fire, and blast explosions. We also accept social and behavioral sciences (SBE) data encompassing the study of the human dimensions of hazards and disasters. As the field and the expertise of the community evolves we have expanded our focus to include datasets related to COVID-19. Data reports, publications of Jupyter notebooks, code, scripts, lectures, and learning materials are also accepted.

Accepted and Recommended File Formats

Due to the diversity of data and instruments used by our community, there are no current restrictions on the file formats users can upload to the DDR. However, for long-term preservation and interoperability purposes, we recommend and promote storing and publishing data in open formats, and we follow the Library of Congress Recommended Formats.

In addition, we suggest that users look into the Data Curation Primers, which are "peer-reviewed, living documents that detail a specific subject, disciplinary area or curation task and that can be used as a reference to curate research data. The primers include curation practices for documenting data types that while not open or recommended, are very established in the academic fields surrounding Natural Hazards research such as Matlab and Microsoft Excel.

Below is an adaptation of the list of recommended formats for data and documentation by Stanford Libraries. For those available, we include a link to the curation primers:

Data Size

Currently we do not pose restrictions on the volume of data users upload to and publish in the DDR. This is meant to accommodate the vast amount of data researchers in the natural hazards community can generate, especially during the course of large-scale research projects.

However, for data curation and publication purposes users need to consider the sizes of their data for its proper reuse. Publishing large amounts of data requires more curation work (organizing and describing) so that other users can understand the structure and contents of the dataset. In addition, downloading very large projects may require the use of Globus. We further discuss data selection and quality considerations in the Data Curation section.


Data Curation

Data curation involves the organization, description, quality control, preservation, accessibility, and ease of reuse of data, with the goal of making your data publication FAIR with assurance that it will be useful for generations to come.

Extensive support for data curation can be found in the Data Curation and Publication User Guides and in Data Curation Tutorials. In addition, we strongly recommend that users follow the step by step onboarding instructions available in the My Projects curation interface. Virtual Office Hours are also available twice a week.

Below we highlight general curation best practices.

Managing and Sharing Data in My Projects

All data and documentation collected and generated during a research project can be uploaded to My Project from the inception of the project. Within My Project, data are kept private for sharing amongst team members and for curation until published. Using My Project to share data with your team members during the course of research facilitates the progressive curation of data and its eventual publication.

However, when conducting human subjects research, you must follow and comply with the procedures submitted to and approved by your Institutional Review Board (IRB) as well as your own ethical commitment to participants for sharing protected data in My Project.

Researchers working at a NHERI EF will receive their bulk data files directly into an existing My Project created for the team.

For all other research performed at a non-NHERI facility, it will be the responsibility of the research team to upload their data to the DDR.

There are different ways to upload data to My Project:

  • Do not upload folders and files with special characters in their filenames. In general, keep filenames meaningful but short and without spacing. See file naming convention recommendations in the Data Organization and Description
  • Select the Add button, then File upload to begin uploading data from your local machine. You can browse and select files or drag and drop files into the window that appears.
  • Connect to your favorite cloud storage provider. We currently support integration with Box, Dropbox, and Google Drive.
  • You can also copy data to and from My Data.
  • You may consider zipping files for purpses of uploading: however, you should unzip them for curation and publication purposes.
  • For uploads of files bigger than 2 Gigabytes and or more than 25 files, consider using Globus, CyberDuck and Command Line Utilities. Explanations on how to use those applications are available in our Data Transfer Guide.

Downloading several individual files via our web interface could be cumbersome, so DesignSafe offers a number of alternatives. First, users may interact with data in the Workspace using any of the available tools and applications without the need to download; for this, users will need a DesignSafe account. Users needing to download a large number of files from a project may also use Globus. When feasible, to facilitate data download from their projects users may consider aggregating data into larger files.

Be aware that while you may store all of a project files in My Project, you may not need to publish all of them. During curation and publication you will have the option to select a subset of the uploaded files that you wish to publish without the need to delete them.

More information about the different Workspaces in DesignSafe and how to manage data from one to the other can be found here.

Selecting a Project Type

Depending on the research method pursued, users may curate and publish data as "Experimental", "Simulation", "Hybrid Simulation," or "Field Research" project type. The Field Research project type accommodates "Interdisciplinary Datasets" involving engineering and/or social science collections.

Based on data models designed by experts in the field, the different project types provide interactive tools and metadata forms to curate the dataset so it is complete and understandable for others to reuse. So for example,users that want to publish a simulation dataset will have to include files and information about the model or software used, the input and the output files, and add a readme file or a data report.

Users should select the project type that best fits their research method and dataset. If the data does not fit any of the above project types, they can select project type" Other." In project type "Other" users can curate and publish standalone reports, learning materials, white papers, conference proceedings, tools, scripts, or data that does not fit with the research models mentioned above.

Working in My Project

Once the project type is selected, the interactive interface in My Project will guide users through the curation and publication steps through detailed onboarding instructions.

My Project is a space where users can work during the process of curation and publication and after publication to publish new data products or to analyze their data.

Because My Project is a shared space, it is recommended that teams select a data manager to coordinate file organization, transfers, curation, naming, etc.

After data is published users can still work on My Project for progressive publishing of new experiments, missions or simulations within the project, to version and/or to edit or amend the existing publication. See amends and versions in this document.

General Research Data Best Practices

Below we include general research data best practices but we strongly recommend to review the available Data Curation Primers for more specific directions on how to document and organize specific research data types.

Proprietary Formats

Excel and Matlab are two proprietary file formats highly used in this community. Instead of Excel spreadsheet files, it is best to publish data as simple csv so it can be used by different software. However, we understand that in some cases (e.g. Matlab, Excel) conversion may distort the data structures. Always retain an original copy of any structured data before attempting conversions, and then check between the two for fidelity. In addition, in the DDR it is possible to upload and publish both the proprietary and the converted version, especially if you consider that publishing with a proprietary format is convenient for data reuse.

Compressed Data

Users that upload data as a zip file should unzip before curating and publishing, as zip files prevent others from directly viewing and understanding the published data. If uploading compressed files to "My Data" , it is possible to unzip it using the extraction utility available in the workspace before copying data to My Project for curation and publication.

Simulation Data

In the Data Depot's Published directory there is a best practices document for publishing simulation datasets in DesignSafe. The topics addressed reflect the numerical modeling community needs and recommendations, and are informed by the experience of the working group members and the larger DesignSafe expert cohort while conducting and curating simulations in the context of natural hazards research. These best practices focus on attaining published datasets with precise descriptions of the simulations’ designs, references to or access to the software involved, and complete publication of inputs and if possible all outputs. Tying these pieces together requires documentation to understand the research motivation, origin, processing, and functions of the simulation dataset in line with FAIR principles. These best practices can also be used by simulation researchers in any domain to curate and publish simulation data in any repository.

Geospatial Data

We encourage the use of open Geospatial data formats. Within DS Tools and Applications we provide two open source software for users to share and analyze geospatial data. QGIS can handle most open format datasets and HazMapper, is capable of visualizing geo-tagged photos and GeoJSON files. To access these software users should get an account in DesignSafe.

Understanding that ArcGIS software is widespread in this community in the DDR it is possible to upload both proprietary and recommended geospatial data formats. When publishing feature and raster files it is important to make sure that all of the relevant files for reuse such as the projection file and header file are included in the publication for future re-use. For example, for shapefiles it is important to publish all .shp (the file that contains the geometry for all features), .shx (the file that indexes the geometry) and .dbf (the file that stores feature attributes in a tabular format) files.

Point Cloud Data

It is highly recommended to avoid publishing proprietary point cloud data extensions. Instead, users should consider publishing post-processed and open format extension data such as las or laz files. In addition, point cloud data publications may be very large. In DS, we have Potree available for users to view point cloud datasets. Through the Potree Convertor application, non-proprietary point cloud files can be converted to a potree readable format to be visualized in DesignSafe.

Jupyter Notebooks

More and more researchers are publishing projects that contain Jupyter Notebooks as part of their data. They can be used to provide sample queries on the published data as well as providing digital data reports. As you plan for publishing a Jupyter Notebook, please consider the following issues:

  1. The DesignSafe publication process involves copying the contents of your project at the time of publication to a read only space within the Published projects section of the Data Depot (i.e., this directory can be accessed at NHERI-Published in JupyterHub). Any future user of your notebook will access it in the read only Published projects section. Therefore, any local path you are using while developing your notebook that is accessing a file from a private space (e.g., "MyData", "MyProjects") will need to be replaced by an absolute path to the published project. Consider this example: you are developing a notebook in PRJ-0000 located in your "MyProjects" directory and you are reading a csv file living in this project at this path: /home/jupyter/MyProjects/PRJ-0000/Foo.csv. Before publishing the notebook, you need to change the path to this csv file to /home/jupyter/NHERI-Published/PRJ-0000/Foo.csv.
  2. The published area is a read-only space. In the published section, users can run notebooks, but the notebook is not allowed to write any file to this location. If the notebook needs to write a file, you as the author of the notebook should make sure the notebook is robust to write the file in each user directory. Here is an example of a published notebook that writes files to user directories. Furthermore, since the published space is read-only, if a user wants to revise, enhance or edit the published notebook they will have to copy the notebook to their mydata and continue working on the copied version of the notebook located in their mydata. To ensure that users understand these limitations, we require a readme file be published within the project that explains how future users can run and take advantage of the Jupyter Notebook.
  3. Jupyter Notebooks rely on packages that are used to develop them (e.g., numpy, geopandas, ipywidgets, CartoPy, Scikit-Learn). For preservation purposes, it is important to publish a requirement file including a list of all packages and their versions along with the notebook as a metadata file.

Data Organization and Description

In My Projects, users may upload files and or create folders to keep their files organized; the latter is common when projects have numerous files. However, browsing through an extensive folder hierarchy on the web may be slower on your local computer, so users should try to use the smallest number of nested folders necessary and if possible, none at all, to improve all users’ experience.

Except for project type "Other" which does not have categories, users will categorize their files or folders according to the corresponding project type. Categories describe and highlight the main components of the dataset in relation to the research method used to obtain it. Each category has a form that needs to be filled with metadata to explain the methods and characteristics of the dataset, and there are onboarding instructions on what kind of information is suitable for each metadata field. In turn, some of these fields are required, which means that they are fundamental for the clarity of the project's description. The best way to approach data curation in My Project, is to organize the files in relation to the data model of choice and have the files already organized and complete before categorizing and tagging. While curating data in My Project, do not move, rename or modify files that have been already categorized. In particular, do not make changes to categorized files through an SSH connection or through Globus. If you need to, please remove the category, deselect the files, and start all over.

Within the different project types, there are different layers for describing a dataset. At the project level, it is desirable to provide an overview of the research, including its general goal and outcomes, what is the audience, and how the data can be reused. For large projects we encourage users to provide an outline of the scope and contents of the data. At the categories level, the descriptions need to address technical and methodological aspects involved in obtaining the data.

In addition, users can tag individual files or groups of files for ease of data comprehension and reuse by others. While categories are required, tagging is not, though we recommend that users tag their files because it helps other users to efficiently identify file contents in the publication interface. For each project type the list of tags are agreed upon terms contributed by experts in the field of NH. If the tags available do not apply, feel free to add custom tags and submit tickets informing the curation team about the need to incorporate them in the list. We heard from our users that the list of tags per category reaffirms them of the need to include certain types of documentation to their publication.

To enhance organization and description of projects type "Other," users can group files in folders when needed and use file tags. However, it is always best to avoid overly nesting and instead use the file tags and descriptions to indicate what are the groupings.

File naming conventions are often an important part of the work of organizing and running large scale experimental and simulation data. File names make it possible to identify files by succintly expressing their content and their relations to other files. File naming conventions should be established during the research planning phase, they should be meaningful -to the team and to others- and they should be kept short. When establishing a file naming convention, researchers should think about the key information elements they want to convey for others to identify tiles and group of files. The meaning and components of file naming conventions should be documented in a Data Report or readme file so that others can understand and identify files as well. Users should consider that in DesignSafe you will be able to describe files with tags and descriptions when you curate them. File naming conventions should not have spaces or special characters, as those features may cause errors within the storage systems. See this Stanford University Libraries best practices for file naming convention.

The following are good examples of data organization and description of different project types:

Project Documentation

NH datasets can be very large and complex, so we require that users submit a data report or a readme file to publish along with their data to express information that will facilitate understanding and reuse of your project. This documentation may include the structure of the data, a data dictionary, information of where everything is, explanation of the file naming convention used, and the methodology used to check the quality of the data. The data report in this published dataset is an excellent example of documentation.

To provide connections to different types of information about the published dataset, users can use the Related Work field. We provide different types of tags that explain their relation to the dataset. To connect to information resources that provide contextual information about the dataset (events or organizations) use the tag "context". To link to other published datasets in the DDR use the tag "link Dataset", and to connect to published papers that cite the dataset use the tag "is cited by". Importantly, users should add the DOI of these different information types in http format (if there is no DOI add a URL). The latter information is sent to DataCite, enabling exchange of citation counts within the broader research ecosystem through permanent identifiers. Related Works can be added after the dataset was published using the amends pipeline. This is useful when a paper citing the dataset is published after the publication of the dataset.

When applicable, we ask users to include information about their research funding in the Awards Info fields.

Data Quality Control

Each data publication is unique; it reflects and provides evidence of the research work of individuals and teams. Due to the specificity, complexity, and scope of the research involved in each publication, the DDR cannot complete quality checks of the contents of the data published by users. It is the user's responsibility to publish data that is up to the best standards of their profession, and our commitment is to help them achieve these standards. In the DDR, data and metadata quality policies as well as the curation and publication interactive functions are geared towards ensuring excellence in data publications. In addition, below we include general data content quality recommendations:

Before publishing, use applicable methods to review the data for errors (calibration, correction, validation, normalization, completeness checks) and document the process so that other reusers are aware of the quality control methods employed. Include the explanation about the quality control methods you used in the data report or readme file.

Include a data dictionary or a readme file to explain the meaning of data fields.

Researchers in NH generate enormous amounts of images. While we are not posing restrictions on the amount of files, in order to be effective in communicating their research users being selective with the images chosen to publish is key. For example, making sure they have a purpose, are illustrative of a process or a function, and using file tags to describe them. The same concept can be applied for other data formats.

It is possible to publish raw and curated data. Raw data is that which comes directly from the recording instruments (camera, apps, sensors, scanners, etc). When raw data is corrected, calibrated, reviewed, edited or post-processed in any way for publication, it is considered curated. Some researchers want to publish their raw data as well as their curated data. For users who seek to publish both, consider why it is necessary to publish both sets and how another researcher would use them. Always clarify whether your data is raw or curated in the description or in a data report/readme file including the method used to post-process it.

Managing Protected Data in the DDR

Users that plan to work with human subjects should have their IRB approval in place prior to storing, curating, and publishing data in the DDR. We recommend following the recommendations included in the CONVERGE series of check sheets that outline how researchers should manage/approach the lifecycle data that contain personal and sensitive information; these check sheets have also been published in the DDR.

At the moment of selecting a Field Research project, users are prompted to respond if they will be working with human subjects. If the answer is yes, the DDR curator is automatically notified and gets in touch with the project team to discuss the nature and conditions of the data and the IRB commitments.

DesignSafe My Data and My Projects are secure spaces to store raw protected data as long as it is not under HIPAA, FERPA or FISMA regulations. If data needs to comply with these regulations, researchers must contact DDR through a help ticket to evaluate the need to use TACC‘s Protected Data Service. Researchers with doubts are welcome to send a ticket or join curation office hours.

Projects that do not include the study of human subjects and are not under IRB purview may still contain items with Personally Identifiable Information (PII). For example, researchers conducting field observations may capture human subjects in their documentation including work crews, passersby, or people affected by the disaster. If camera instruments capture people that are in the observed areas incidentally, we recommend that their faces and any Personally Identifiable Information should be anonymized/blurred before publishing. In the case of images of team members, make sure they are comfortable with making their images public. Do not include roofing/remodeling records containing any form of PII. When those are public records, researchers should point to the site from which they are obtained using the Referenced Data and or Related Work fields. In short, users should follow all other protected data policies and best practices outlined further in this document.

Metadata Requirements

Metadata is information that describes the data in the form of schemas. Metadata schemas provide a structured way for users to share information about data with other platforms and individuals. Because there is no standard schema to describe natural hazards engineering research data, the DDR developed data models containing elements and controlled terms for categorizing and describing NH data. The terms have been identified by experts in the NH community and are continuously expanded, updated, and corrected as we gather feedback and observe how researchers use them in their publications.

So that DDR metadata can be exchanged in a standard way, we map the fields and terms to widely-used, standardized schemas. The schemas are: Dublin Core for description of the research data project, DDI (Data Documentation Initiative) for social science data, and DataCite for DOI assignment and citation. We use the PROV schema to connect the different components of multi-part data publications.

Due to variations in research methods, users may not need to use all the metadata elements available to describe their data. However, for each project type we identified a required set that represents the structure of the data, are useful for discovery, and will allow proper citation of data. To ensure the quality of the publications, the system automatically checks for completeness of these core elements and whether data files are associated with them. If those elements and data are not present, the publication does not go through. For each project type, the metadata elements including those that are required and recommended are shown below.

Experimental Research Project
View Metadata Dictionary
  • DOI
  • Project Title
  • Author (PIs/Team Members)*
  • Participant Institution*
  • Project Type*
  • Description
  • Publisher
  • Date of Publication
  • Licenses
  • Related Works*$
  • Award*
  • Keywords
  • Experiment*
    • Report
    • DOI
    • Experiment Title
    • Author (PIs/Team Members)*
    • Experiment Description
    • Date of Publication
    • Dates of Experiment
    • Experimental Facility
    • Experiment Type
    • Equipment Type*
    • Model Configuration*
    • Sensor Information*
    • Event*
    • Experiment Report$
  • Analysis*$
    • Analysis Title
    • Description
    • Referenced Data*
Simulation Research Project
View Metadata Dictionary
  • DOI
  • Project Title
  • Author (PIs/Team Members)*
  • Participant Institution*
  • Project Type*
  • Description
  • Publisher
  • Date of Publication
  • Licenses
  • Related Works*$
  • Award*
  • Keywords
  • Simulation*
    • Report
    • Simulation Title
    • Author (PIs/Team Members)*
    • Description
    • Simulation Type
    • Simulation Model
    • Simulation Input*
    • Simulation Output*
    • Referenced Data*
    • Simulation Report$
  • Analysis*$
    • Analysis Title
    • Description
    • Referenced Data*
Hybrid Simulation Research Project
View Metadata Dictionary
  • DOI
  • Project Title
  • Author (PIs/Team Members)*
  • Participant Institution*
  • Project Type*
  • Description
  • Publisher
  • Date of Publication
  • Licenses
  • Related Works*$
  • Award*
  • Keywords
  • Hybrid Simulation*
    • Report
    • Global Model
      • Global Model Title
      • Description
    • Master Simulation Coordinator
      • Master Simulation Coordinator Title
      • Application and Version
      • Substructure Middleware
    • Simulation Substructure*
      • Simulation Substructure Title
      • Application and Version
      • Description
    • Experiment Substructure*
      • Experiment Substructure Title
      • Description
Field Research Project
View Metadata Dictionary
  • Project Title
  • PI/Co-PI(s)*
  • Project Type
  • Description
  • Related Work(s)*$
  • Award(s)*$
  • Keywords
  • Natural Hazard Event
  • Natural Hazard Date
  • Documents Collection*$
    • Author(s)*
    • Date of Publication
    • DOI
    • Publisher
    • License(s)*
    • Referenced Data*$
    • Description
  • Mission*
    • Mission Title
    • Author(s)*
    • Date(s) of Mission
    • Mission Site Location
    • Date of Publication
    • DOI
    • Publisher
    • License(s)*
    • Mission Description
    • Research Planning Collection*$
      • Collection Title
      • Data Collector(s)*
      • Referenced Data*$
      • Collection Description
    • Social Sciences Collection*
      • Collection Title
      • Unit of Analysis$
      • Mode(s) of Collection*$
      • Sampling Approach(es)*$
      • Sample Size$
      • Date(s) of Collection
      • Data Collector(s)*
      • Collection Site Location
      • Equipment*
      • Restriction$
      • Referenced Data*$
      • Collection Description
    • Engineering/Geosciences Collection*
      • Collection Title
      • Observation Type*
      • Date(s) of Collection
      • Data Collector(s)*
      • Collection Site Location
      • Equipment*
      • Referenced Data*$
      • Collection Description
Other
View Metadata Dictionary
  • DOI
  • Project Title
  • Author(s)*
  • Data Type
  • Description
  • Publisher
  • Date of Publication
  • License(s)
  • Related Works*$
  • Award*
  • Keywords

Data Publication

Protected Data

Protected data in the Data Depot Repository (DDR) are generally (but not always) included within interdisciplinary and social science research projects that study human subjects, which always need to have approval from an Institutional Review Board (IRB). We developed a data model and onboarding instructions in coordination with our CONVERGE partners to manage this type of data within our curation and publication pipelines. Additionally, CONVERGE has a series of check sheets that outline how researchers should manage data that could contain sensitive information; these check sheets have also been published in the DDR.

Natural Hazards also encompasses data that have granular geographical locations and images that may capture humans that are not the focus of the research/would not fall under the purview of an IRB. See both the Privacy Policy within our Terms of Use,

Data de-identification, specially for large datasets, can be tasking. Users working with the RAPID facility may discuss with them steps for pre-processing before uploading to DesignSafe.

To publish protected data researchers should pursue the following steps and requirements:

  1. Do not publish HIPAA, FERPA, FISMA, PII data or sensitive information in the DDR.
  2. To publish protected data and any related documentation (reports, planning documents, field notes, etc.) it must be processed to remove identifying information. No direct identifiers and up to three indirect identifiers are allowed. Direct identifiers include items such as participant names, participant initials, facial photographs (unless expressly authorized by participants), home addresses, phone number, email, vehicle identifiers, biometric data, names of relatives, social security numbers and dates of birth or other dates specific to individuals. Indirect identifiers are identifiers that, taken together, could be used to deduce someone’s identity. Examples of indirect identifiers include gender, household and family compositions, places of birth, or year of birth/age, ethnicity, general geographic indicators like postal code, socioeconomic data such as occupation, education, workplace or annual income.
  3. Look at the de-identification resources below to find answers for processing protected data.
  4. If a researcher has obtained consent from the subjects to publish PII (images, age, address), it should be clearly stated in the publication description and with no exceptions the IRB documentation including the informed consent statement, should be also available in the documentation.
  5. If a researcher needs to restrict public access to data because it includes HIPAA, FERPA, PII or other sensitive information, or if de-identification precludes the data from being meaningful, it is possible to publish only metadata about the data in the landing page along with descriptinve information a bout the dataset. The dataset will show as Restricted.
  6. IRB documentation should be included in the publication in all cases so that users clearly understand the restrictions imposed for the protected data. In addition, authors may publish the dataset instrument, provided that it does not include any form of protected information.
  7. Users interested in restricted data can contact the project PI or designated point of contact through their published email address to request access to the data and to discuss the conditions for its reuse.
  8. The responsibility of maintaining and managing a restricted dataset for the long term lies on the authors, and they can use TACC's Protected Data Services if they see fit.
  9. Please contact DDR through a help ticket or join curation office hours prior to preparing this type of publication.
De-identification Resources

The NISTIR 8053 publication De-Identification of Personal Information provides all the definitions and approaches to reduce privacy risk and enable research.

Another NIST resource including de-identification tools.

John Hopkins Libraries Data Services Applications to Assist in De-identification of Human Subjects Research Data.

Reusing Data Resources in your Publication

Researchers frequently use data, code, papers or reports from other sources in their experiments, simulations and field research projects as input files, to integrate with data they create, or as references, and they want to republish them. It is a good practice to make sure that this data can be reused appropriately and republished as well as give credit to the data creators. Citing reused sources is also important to provide context and provenance to the project. In the DDR you can republish or reference reused data following the next premises:

  1. If you use external data in a specific experiment, mission or simulation, cite it in the Referenced Data field.
  2. Use the Related Work field at project level to include citations for the data you reused as well as your own publication related to the data reuse.
  3. Include the cited resource title and corresponding DOI in https format; this way, users will be directed to the cited resource.
  4. Be aware of the reused data original license. The license will specify if and how you can modify, distribute, and cite the reused data.
  5. If you have reused images from other sources (online, databases, publications, etc.), be aware that they may have copyrights. We recommend using these instructions for how to use and cite them.

Timely Data Publication

Although no firm timeline requirements are specified for data publishing, researchers are expected to publish in a timely manner. Recommended timelines for publishing different types of research data (i.e., Experimental, Simulation, and Reconnaissance) are listed in Table 1.

Guidelines specific to RAPID reconnaissance data can be found at rapid.designsafe-ci.org/media/filer_public/b3/82/b38231fb-21c9-41f8-b658-f516dfee87c8/rapid-designsafe_curation_guidelines_v3.pdf

Table 1. Recommended Publishing Timeline for Different Data Types
Project/Data Type Recommended Publishing Timeline
Experimental 12 months from completion of experiment
Simulation 12 months from completion of simulations
Reconnaissance: Immediate Post-Disaster 3 months from returning from the field
Reconnaissance: Follow-up Research 6 months from returning from the field

Public Accessibility Delay

Some journals require that researchers submitting a paper for review inlcude the corresponding dataset DOI in the manuscript. Data accessibility delay or embargo refers to time during which a dataset has a DOI but it is not made broadly accessible, awaiting for the review to be accepted by a journal paper.

In August 25, 2022 the Office of Science and Technology Policy issued a memorandum with policy guidelines to “ensure free, immediate and equitable access to federally funded research.” In the spirit of this guidance, DesignSafe has ceased offering data accessibility delays or embargos. DesignSafe continues working with users to:

  • Provide access to reviewers via My Projects before the dataset is published. There is no DOI involved and the review is not annonymous.
  • Help users curate and publish their datasets so they are publicly available for reviewers in the best possible form.
  • Provide amends and versioning so that prompt changes can be made to data and metadata upon receiving feedback from the reviewers or at any other time.

In addition, DesignSafe Data Depot does not offer capabilities for enabling single or double blind peer review.

Licensing

Within DesignSafe, you will choose a license to distribute your material. The reason for offering licences with few restrictions, is that by providing less demands on reusers, they are more effective at enabling reproducible science. We offere licenses with few to no restrictions. By providing less demands on reusers, they are more effective at enabling reproducible science. Because the DesignSafe Data Depot is an open repository, the following licenses will be offered:

  • For datasets: ODC-PDDL and ODC-BY
  • For copyrightable materials (for example, documents, workflows, designs, etc.): CC0 and CC-BY
  • For code: GPL

You should select appropriate licenses for your publication after identifying which license best fits your needs and institutional standards. Note that datasets are not copyrightable materials, but works such as reports, instruments, presentations and learning objects are.

Please select only one license per publication with a DOI.

Available Licenses for Publishing Datasets in DesignSafe

DATASETS

If you are publishing data, such as simulation or experimental data, choose between:

Open Data Commons Attribution
Recommended for datasets
  • You allow others to freely share, reuse, and adapt your data/database.
  • You expect to be attributed for any public use of the data/database.
Please read the License Website
Open Data Commons Public Domain Dedication
Consider and read carefully
  • You allow others to freely share, modify, and use this data/database for any purpose without any restrictions.
  • You do not expect to be attributed for it.
Please read the License Website
WORKS

If you are publishing papers, presentations, learning objects, workflows, designs, etc, choose between:

Creative Commons Attribution
Recommended for reports, instruments, learning objects, etc.
  • You allow others to freely share, reuse, and adapt your work.
  • You expect to be attributed for any public use of your work.
  • You retain your copyright.
Please read the License Website
Creative Commons Public Domain Dedication
Consider and read carefully
  • You allow others to freely share, modify, and use this work for any purpose without any restrictions.
  • You do not expect to be attributed for it.
  • You give all of your rights away.
Please read the License Website
SOFTWARE

If you are publishing community software, scripts, libraries, applications, etc, choose the following:

GNU General Public License
  • You give permission to modify, copy, and redistribute the work or any derivative version.
  • The licensee is free to choose whether or not to charge a fee for services that use this work.
  • They cannot impose further restrictions on the rights imposed by this license.
Please read the License Website

Subsequent Publishing

With the exception of Project Type Other, which is a one time publication, in the DDR it is possible to publish datasets or works subsequently. A project can be conceived as an umbrella where reports or learning materials, code, and datasets from distinct experiments, simulations, hybrid simulations or field research missions that happen at different time periods, involve participation of distinct authors, or need to be released more promptly, can be published at different times. Each new product will have its own citation and DOI, and users may select a different license if that is appropriate for the material, (e.g. a user publishing a data report will use a Creative Commons license, and an Open Data Commons license to publish the data). The subsequent publication will be linked to the umbrella project via the citation, and to the other published products in the project through metadata.

After a first publication, users can upload more data and create a new experiment/simulation/hybrid simulation or mission and proceed to curate it. Users should be aware that momentarily they cannot publish the new product following the publication pipeline. After curation and before advancing through the Publish My Project button, they should write a help ticket or attend curation office hours so that the DDR team can assist and publish the new product.

Amends and Version Control

Once a dataset is published users can do two things to improve and or / continue their data publication: amends and version control. Amends involve correcting certain metadata fields that do not incur major changes to the existing published record, and version control includes changes to the data. Once a dataset is published, however, we do not allow title or author changes. If those changes need to be made due to omission or mistake, users have to submit a Help ticket and discuss the change with the data curator. If applicable, changes will be done by the curation team.

Amends include:

  • Improving descriptions: often after the curator reviews the publicationn, or following versioning, users need to clarify or enhance descriptions.
  • Adding related works: when papers citing a dataset are published we encourage users to add the references in Related Works to improve data understandibility and cross-referencing and citation count.
  • Changing the order of authors: even though DDR has interactive tools to set the order of authors in the publication pipeline, users further require changes after publication due to oversight.

Version control includes:

  • Adding or deleting files to a published dataset.
  • Documenting the nature of the changes which will publicly show in the landing page.
  • Descriptions of the nature of the changes are displayed for users to see what changed and stored as metadata.
  • In the citation and landing pages, different versions of a dataset will have the same DOI and different version number.
  • The DOI will always resolve to the latest version of the data publication.
  • Users will always be able to access previous versions through the landing page.

When implementing amend and version take the following into consideration:

Amend is only going to update the latest version of a publication (if there is only one version that will be the target). Only the specified fields in the metadata form will be updated. The order of authors must be confirmed before the amendments can be submitted.

Version will create a new published version of a project. This pipeline will allow you to select a new set of files to publish, and whatever is selected in the pipeline is what will be published, nothing else. Additionally, the order of authors can be updated.

Important: Any changes to the project’s metadata will also be updated (this update is limited to the same fields allowed in the Amend section), so there is no need to amend a newly versioned project unless you have made a mistake in the latest version.

Leave Data Feedback

We welcome feedback from users about the published datasets. For this, users can click on the "Leave Feedback" button at the top of the data presentation on the data publication landing pages. We suggest that feedback is written in a positive, constructive language. The following are examples of feedback questions and concerns:

  • Questions about the dataset that are not answered in the published metadata and or documentation.
  • Missing documentation.
  • Questions about the method/instruments used to generate the data.
  • Questions about data validation.
  • Doubts/concerns about data organization and or inability to find desired files.
  • Interest in bibliography about the data/related to the data.
  • Interest in reusing the data.
  • Comments about the experience of reusing the data.
  • Request to access raw data if not published.
  • Congratulations.

Marketing Datasets

Datasets take a lot of work to produce; they are important research products. By creating a complete, organized, and clearly described publication in DDR, users are inviting others to reuse and cite their data. Researchers using published data from DDR must cite it using the DOI, which relies on the DataCite schema for accurate citation. For convenience, users can retrieve a formatted citation from the published data landing page. It is recommended to insert the citations in the reference section of the paper to facilitate citation tracking and count.

When using social media or any presentation platform to communicate research, it is important to include the proper citation and DOI on the presentations, emails, tweets, professional blog posts, etc.. A researcher does not actually need to reuse a dataset to cite it, but rather may cite it to point/review something about a dataset (e.g., how it was collected, its uniqueness, certain facts, etc.). This is similar to the process of citing other papers within a literary review section.


Data Preservation

In the Data Depot Repository (DDR) data preservation is achieved through the combined efforts of the NH community that submits data and metadata following policies and best practices, and the DDR's administrative responsibilities and technical capabilities. The following data preservation best practices ensure preservation of the data from the moment in which researchers plan their data projects and for the long term after the data is published.

Depositing your data and associated research project materials in the DDR meets NSF requirements for data management. See our Data Management Plan.

Follow the curation and publication onboarding instructions and steps -documented in the Data Curation and Publication Guides - to ensure that your data curation and publication process is smooth and that your public datasets are well organized, complete, and understandable to others.

To facilitate long term access to your published data, when possible, we recommend using open file formats. Open file formats facilitate interoperability between datasets and with applications, which in turn facilitates long term access to the datasets.

DDR data is stored in high performance storage (HPC) resources deployed at the Texas Advanced Computing Center. These storage resources are reliable, secure, monitored 24/7, and under a rigorous maintenance and update schedule.

While you are uploading and working with your data in DDR, your data is safe and geographically replicated in Corral, TACC's storage and data management resource.

DDR operates a dedicated open source Fedora 5.x digital repository. Once the dataset is curated and the user has agreed to the last step in the publication process, the data and the metadata that the user has been inputting throughout the curation processare sent to Fedora where each published dataset contains linkages between datastreams, versions, metadata, and system metadata. At ingest, Fedora metadata records are created and publication binaries are bundled with a hash (fixity) and stored on Corral in a secure location that is recorded on the metadata (See Fedora data model). For each individual file, Fedora generates and maintains preservation metadata in the standard PREMIS format.

In the case of the DDR, file system replication is automatic. Ingestion of data from the web-visible storage into Fedora takes place under automated control when the publication workflow executes. The Fedora repository and database is likewise replicated as well as backed up on an automated schedule. Metadata preservation is assured through the backup of Fedora's metadata database. In case of failure where data is compromised, we can restore the system from the replication.

Both the front-end copies and the Fedora repositories are in systems that implement de-clustered RAID and have sufficient redundancy to manage up to 3 drive failures for a single file stripe. The file system itself is mirrored daily between two datacenters. The primary data is also periodically backed up to a tape archive for a third copy, in a third datacenter. The database that manages metadata in Fedora is also quiesced, snapshotted, and backed to tape on a regular automated schedule.

The underlying storage systems for the DDR are managed in-house at TACC. All the storage systems used by DesignSafe are shared multi-tenant systems, hosting many projects concurrently in addition to DesignSafe – the front-end disk system currently has ~20PB of data, with the tape archive containing roughly 80PB. These systems are operated in production by a large team of professional staff, in conjunction with TACC’s supercomputing platforms. Public user guides document the capabilities and hardware, and internal configuration management is managed via Redmine, visible only to systems staff.

This preservation environment allows maintaining data in secure conditions at all times, before and after publication, comply with NDSA Preservation Level 1, attain and maintain the required representation and descriptive information about each file, and be ready at any time to transfer custody of published data and metadata in an orderly and validated fashion. Fedora has export capabiiities for transfer of data and metadata to another Fedora repository or to another system.

Each published dataset has a digital object identifier (DOI) that provides a persistent link to the published data. The DOI is available in the dataset landing page, along with all the required metadata and documentation.

To learn about our commitment to data preservation, please read our Digital Preservation Policy.