Dealing with missing data and outlier in a clinical trial

5 min readSep 26, 2023

Missing data and outliers are common challenges in clinical statistical programming. Both missing data and outliers can affect the quality and validity of the statistical analysis and inference, and therefore, it is essential to identify, report, and handle them appropriately.

This article provides an overview of the best practices for dealing with missing data and outliers in clinical statistical programming in CDISC SDTM ADaM. The article covers the following topics:

Identification and reporting of missing data and outliers
Handling of missing data and outliers in ADaM datasets
Documentation of the rationale and logic for handling missing data and outliers

Missing data refers to the absence of values for certain variables or observations in a dataset. Missing data can occur due to various reasons, such as non-response, measurement error, data entry error, data processing error, or data censoring. Missing data can affect the quality and validity of the statistical analysis and inference, and therefore, it is essential to identify, report, and handle missing data appropriately.

Outliers are values that are significantly different from the rest of the data. Outliers can be caused by genuine variation, measurement error, data entry error, or data processing error. Outliers can also affect the accuracy and reliability of the statistical analysis and inference, and therefore, it is important to detect, report, and handle outliers properly.

One of the principles of ADaM is that “one proc away”, which means that any analysis result should be able to be reproduced by applying a single SAS procedure to an ADaM dataset. This principle implies that ADaM datasets should contain all the necessary information and variables to perform the planned analysis without further manipulation or transformation.

Therefore, when dealing with missing data and outliers in CDISC SDTM ADaM, it is recommended to follow these steps:

Identify

Identify the source and type of missing data and outliers in the raw or SDTM datasets. For example, use descriptive statistics, graphical methods, or data quality checks to detect missing values or extreme values in the data. Missing data can arise from various reasons, such as incomplete data collection, non-response, measurement errors, or data processing errors. Outliers can be caused by natural variability, errors in data entry or measurement, or unusual events. Some common methods to identify missing data and outliers are:
Descriptive statistics: Use summary measures such as mean, median, standard deviation, minimum, maximum, frequency, and percentage to describe the distribution of the data and detect any anomalies or gaps. For example, use PROC MEANS or PROC UNIVARIATE in SAS to calculate these statistics.
Graphical methods: Use visual tools such as histograms, boxplots, scatterplots, or heatmaps to explore the shape and spread of the data and identify any outliers or clusters. For example, use PROC SGPLOT or PROC SGSCATTER in SAS to create these plots.
Data quality checks: Use logical rules or criteria to validate the accuracy and consistency of the data and flag any errors or discrepancies. For example, use PROC FREQ or PROC SQL in SAS to check for invalid values, duplicates, or mismatches.
Data quality checks: Use logical rules or criteria to validate the accuracy and consistency of the data and flag any errors or discrepancies. For example, use PROC FREQ or PROC SQL in SAS to check for invalid values, duplicates, or mismatche

Report

Report the frequency and distribution of missing data and outliers in the raw or SDTM datasets. For example, use summary tables or listings to show the number and percentage of missing values or outliers for each variable or observation in the data. Additionally, provide explanations for the reasons of missingness or outlierness, such as adverse events, protocol deviations, measurement errors, or data entry errors. Use appropriate methods to handle missing data and outliers, such as imputation, deletion, transformation, or robust analysis techniques. Refer to the CDISC guidelines and the FDA guidance. for best practices and recommendations on dealing with missing data and outliers in clinical trials.

Handle

Handle missing data and outliers in the ADaM datasets according to the analysis plan and statistical methods. For example, use imputation methods, exclusion criteria, transformation methods, or robust methods to deal with missing values or outliers in the analysis variables or parameters. Imputation methods replace missing data with estimated values based on other available information, such as the mean, median, or most frequent value of each column . Exclusion criteria define the reasons for which potential study participants or observations are to be excluded from the analysis, such as ethical considerations, practical issues, or confounding factors . Transformation methods apply mathematical functions to the data to reduce skewness, heteroscedasticity, or non-linearity, such as taking logarithms, square roots, or reciprocals . Robust methods are statistical techniques that are not sensitive to outliers or deviations from model assumptions, such as using weighted least squares, median regression, or Huber loss function .

Document

Document the rationale and logic for handling missing data and outliers in the ADaM datasets. For example, use metadata files or define documents to explain how missing values or outliers were identified, reported, and handled in the derivation of analysis variables or parameters. Additionally, provide clear and consistent rules for imputing missing data or excluding outliers, and justify the choice of imputation method or exclusion criterion based on the study design, analysis objectives, and statistical assumptions. Also, describe the impact of missing data or outliers on the analysis results and sensitivity tests, if any.
Missing data and outliers can affect the validity and reliability of statistical analysis, and therefore need to be handled appropriately in the ADaM datasets. One way to deal with missing data is to use static imputation, which replaces the missing values with a fixed value, such as the mean, median, or mode of the variable. This method is simple and easy to implement, but it can introduce bias and reduce variability in the data. Another way to deal with missing data is to use dynamic imputation, which replaces the missing values with values predicted from other variables using methods such as KNNs. This method can preserve the distribution and relationships in the data, but it can also introduce noise and uncertainty in the imputed values.
Outliers are extreme or unusual values that deviate significantly from the rest of the data. They can be caused by measurement errors, data entry errors, or natural variation. Outliers can distort the summary statistics and influence the estimation and testing of parameters. Therefore, it is important to identify and report outliers in the ADaM datasets, and decide whether to keep them, remove them, or adjust them. One way to identify outliers is to use graphical methods, such as boxplots or scatterplots, which can visually display the distribution and dispersion of the data. Another way to identify outliers is to use numerical methods, such as z-scores or interquartile ranges, which can quantify how far a value is from the mean or median of the data.

By following these steps, one can ensure that missing data and outliers are dealt with in a transparent and consistent manner in CDISC SDTM ADaM programming. This will help to maintain the quality and integrity of the statistical analysis and reporting of clinical trial results.

Dealing with missing data and outlier in a clinical trial

Identify

Report

Handle

Document

Written by Christian Baghai

No responses yet