Research Guides: Research Data Management: Analysis Ready Datasets

Analysis Ready Datasets

Analysis-ready datasets have been responsibly collected and reviewed so that analysis of the data yields clear, consistent, and error-free results to the greatest extent possible. When working on a research project, take steps to ensure that your data is safe, authentic, and usable.

Since data is often messy, with data management, we aim to clean it before we analyze it. The following are concepts for preparing analysis-ready datasets.

Creation: Working with two-dimensional data

Spreadsheets are a two-dimensional way to store, view, analyze, and alter two-dimensional data. Best practices for creating datasets:

All data should be labeled
Each experimental subject should have a unique study ID
Data should be in rectangular format (flat files)
Rows should represent the appropriate unit of analysis
Columns should represent the unique attributes of the rows
Data files should contain the same number of columns in each row. Problems arise when data are missing in the middle of a row
Data should be atomic within each column. Discrete data should not be combined into a single column

Formatting: Tidy data principles

Tidy Data specifically has become the standard format for the sciences because it easily allows people to easily turn a data table into graphs, analysis and insight. Dr. Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University, coined the term “tidy data” in order to minimize the effort involved when preparing data for visualization and statistical modeling.

A “tidy dataset” has the default structure:

Each variable forms a column
Each observation forms a row
Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)

Validation: Review your data

Validation helps ensure that data is collected correctly. Best practices for validating your datasets:

Program valid ranges for inputting data into fields when applicable.
Apply data formatting to fields in advance to prevent risk of inaccurate "Automatic" formatting
Prevent the entry of leading and/or trailing spaces or other characters that may interfere with data analysis
Plan for “other” data responses
Plan for “prefer not to answer”

Standardization: Establish consistency

Standardization ensures the data is internally consistent. Ensures the data is the same kind and format for each data element that you are collecting. It also helps minimize data collection and analysis errors and prevents inconsistencies. Best practices for standardizing your data:

Data should be coded harmoniously
Standardize free text into categorical data
Treat date and time consistently. Choose one data format and employ that standard throughout (e.g. ISO 8601 YYYY-MM-DD)

Cleaning: Make your data easier to work with

Before performing analysis of your data, review the datasets for inaccuracies, inconsistencies, or sensitive data. Cleaning your data allows you to identify outliers or errors before you compile your results. Best practices for cleaning your data:

Check for outliers. Ensure all data elements are in the correct formats and ranges.
Check for missing data. Ensure there are no data items or records that are missing, creating null elements. Code missing data appropriately.
Ensure that your data does not contain Protected Health Information (PHI). HIPAA requires that researchers protect the privacy and confidentiality of their patients.

Documentation: Ensure dataset metadata

Before analyzing or sharing your data, ensure that you have appropriate documentation to facilitate the understanding, analysis, sharing and reuse of your data. Best practices for documenting your data:

Data should be stored with appropriate metadata
Create and use a data dictionary and README files
Save data as machine-readable ASCII or UNICODE file
Adopt appropriate file naming practices to accommodate multiple versions of data files

Research Data Management