Skip to Main Content

Research Data Management

Analysis Ready Datasets

Analysis-ready datasets have been responsibly collected and reviewed so that analysis of the data yields clear, consistent, and error-free results to the greatest extent possible. When working on a research project, take steps to ensure that your data is safe, authentic, and usable.

Since data is often messy, with data management, we aim to clean it before we analyze it. The following are concepts for preparing analysis-ready datasets.

Creation: Working with two-dimensional data

Spreadsheets are a two-dimensional way to store, view, analyze, and alter two-dimensional data. Best practices for creating datasets:

  • All data should be labeled
  • Each experimental subject should have a unique study ID
  • Data should be in rectangular format (flat files)
  • Rows should represent the appropriate unit of analysis
  • Columns should represent the unique attributes of the rows
  • Data files should contain the same number of columns in each row. Problems arise when data are missing in the middle of a row
  • Data should be atomic within each column. Discrete data should not be combined into a single column
Formatting: Tidy data principles

Tidy Data specifically has become the standard format for the sciences because it easily allows people to easily turn a data table into graphs, analysis and insight. Dr. Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University, coined the term “tidy data” in order to minimize the effort involved when preparing data for visualization and statistical modeling.

A “tidy dataset” has the default structure:

  • Each variable forms a column
  • Each observation forms a row
  • Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)
Validation: Review your data

Validation helps ensure that data is collected correctly. Best practices for validating your datasets:

  • Program valid ranges for inputting data into fields when applicable.
  • Apply data formatting to fields in advance to prevent risk of inaccurate "Automatic" formatting
  • Prevent the entry of leading and/or trailing spaces or other characters that may interfere with data analysis
  • Plan for “other” data responses
  • Plan for “prefer not to answer”
Standardization: Establish consistency

Standardization ensures the data is internally consistent. Ensures the data is the same kind and format for each data element that you are collecting. It also helps minimize data collection and analysis errors and prevents inconsistencies. Best practices for standardizing your data:

  • Data should be coded harmoniously
  • Standardize free text into categorical data
  • Treat date and time consistently. Choose one data format and employ that standard throughout (e.g. ISO 8601 YYYY-MM-DD)
Cleaning: Make your data easier to work with

Before performing analysis of your data, review the datasets for inaccuracies, inconsistencies, or sensitive data. Cleaning your data allows you to identify outliers or errors before you compile your results. Best practices for cleaning your data:

  • Check for outliers. Ensure all data elements are in the correct formats and ranges.
  • Check for missing data. Ensure there are no data items or records that are missing, creating null elements. Code missing data appropriately.
  • Ensure that your data does not contain Protected Health Information (PHI). HIPAA requires that researchers protect the privacy and confidentiality of their patients. No individually identifiable health information should be included in your datasets.
Documentation: Ensure dataset metadata

Before analyzing or sharing your data, ensure that you have appropriate documentation. Appropriate documentation facilitates the understanding, analysis, sharing and reuse of your data. Best practices for documenting your data:

  • Data should be stored with appropriate metadata
  • Create and use a data dictionary and README files
  • Save data as machine-readable ASCII or UNICODE file
  • Adopt appropriate file naming practices to accommodate multiple versions of data files