Mastering the Creation of Analysis Datasets Specifications: A Crucial Step in Clinical Study Reporting

Christian Baghai
19 min readMar 7, 2023

--

Photo by Thought Catalog on Unsplash

Why do we create ADaM datasets?

In the field of clinical research, the creation of tables, figures and listings (TFLs) for a Clinical Study Report (CSR) is an important aspect of the reporting process. These TFLs provide a summary of the data collected during the clinical trial and help to convey the results of the study to regulatory authorities and other stakeholders. In order to create these TFLs, it is necessary to produce analysis datasets that are structured in a standard way and are analysis-ready.

Creating analysis datasets can be a complex and time-consuming process. It requires a thorough understanding of the CDISC standards, the protocol, the SAP, the CRF, and the SDTM data. Additionally, it requires collaboration between the study team, the data management team, and the statistical programming team. Therefore, it is crucial to plan and coordinate the creation of analysis datasets well in advance to avoid delays and errors.

ADaM

The pharmaceutical industry standard for producing analysis datasets is the CDISC Analysis Dataset Model (ADaM). The ADaM standard provides detailed specifications on how to create analysis datasets that are structured in a standardized manner, making them easy to use and interpret. These datasets are analysis-ready, meaning that they can be used to generate TFLs with minimal additional manipulation.

SDTM

The CDISC Standard Data Tabulation Model (SDTM) is used as the input dataset structure for ADaM datasets. SDTM is a standard format for organizing and submitting clinical trial data to regulatory authorities. It provides a standard structure for collecting and organizing data, making it easier to compare data across different trials and to combine data from multiple sources.

ADaM dataset specifications

To create an ADaM dataset, it is necessary to follow the ADaM specifications, which provide detailed guidance on how to structure the dataset, which variables to include and how to create them. The specifications cover a wide range of topics, including:

  • How to define the dataset structure: This includes information on the primary and secondary endpoints of the study, as well as any subgroup analyses that may be required.
  • How to create the analysis variables: This includes information on how to calculate derived variables and how to handle missing data.
  • How to define the analysis populations: This includes information on how to define the analysis populations (such as the intention-to-treat population or the per-protocol population) and how to handle protocol deviations.

The process of creating analysis datasets

The process of creating analysis datasets for a clinical trial typically begins with the collection of raw data. This data is then cleaned and transformed into SDTM format, which provides a standardized structure for the data. From there, the data is transformed into ADaM format, which provides a standardized structure for the analysis datasets.

Once the analysis datasets have been created, they can be used to generate TFLs. This process typically involves the use of statistical software such as SAS or R to generate tables, figures and listings based on the analysis datasets. These TFLs are then reviewed and verified to ensure that they accurately reflect the data collected during the trial.

The protocol

The protocol is a critical document in the field of clinical research. It is a detailed plan that outlines how a clinical study will be conducted, including study timings, assessments, data collection methods, and endpoints. The protocol is the backbone of the study, and it is important for anyone involved in the study to thoroughly read and understand this document before proceeding with any other tasks.

When reading a protocol, there are several key questions that should be asked. These questions will help to ensure that everyone involved in the study has a clear understanding of the study design, and that the study is being conducted in a way that will produce valid and reliable results. Some of the initial questions that should be asked when reading a protocol include:

  1. What is the study population? The protocol should clearly define the study population, including inclusion and exclusion criteria. This information is critical for ensuring that the study is being conducted in a way that will produce meaningful results.
  2. What are the study objectives? The protocol should clearly state the study objectives, including primary and secondary endpoints. This information is critical for ensuring that the study is designed to answer the research questions of interest.
  3. What are the study procedures? The protocol should detail the study procedures, including how data will be collected, what assessments will be performed, and what study interventions will be used. This information is critical for ensuring that the study is being conducted in a way that is consistent with best practices in clinical research.
  4. What are the study timelines? The protocol should detail the study timelines, including when data will be collected, when analyses will be performed, and when the study is expected to be completed. This information is critical for ensuring that the study is being conducted in a way that is efficient and timely.
  5. What are the safety considerations? The protocol should detail the safety considerations for the study, including adverse event reporting requirements, monitoring plans, and risk management strategies. This information is critical for ensuring that the study is being conducted in a way that is safe for study participants.

Specific information for ADaM specifications

Blinding is an important consideration in clinical research, as it helps to reduce bias and ensure that the study results are valid. It is important to determine whether any data, apart from randomization schema, will be blinded until Database Lock (DBL). This will help to ensure that the study is being conducted in a way that is consistent with best practices in clinical research.

Subject disposition is another important consideration. Is this done at the study level? How is treatment discontinuation handled? Will a subject remain in the study if they have discontinued IMP? What visit schedule is followed for subjects who discontinue early?

The number of subjects in the study is another important consideration, as this will determine the volume of data that is expected. It is important to consider the duration of the study, as well as the visit timings. Are these the same for all trial arms? Are visits split into timepoints? Are they held across more than 1 day? Are they grouped into study periods, and are these likely to be used in the analysis?

Primary, secondary, and exploratory objectives are critical components of the study protocol. It is important to understand what these objectives are, how they are to be measured, and what datasets will be needed. Are there any corporate templates? Will a new standard need to be requested? What structure will be needed, and what are the data sources? Are multiple sources required for one efficacy/safety/exploratory dataset? It is also important to consider whether any statistical tests are specified in the protocol, and whether any tests are done at particular timepoints.

Inclusion/exclusion criteria should also be considered when crafting ADaM specifications. Will these need to be programmed? Will these need to be considered when defining protocol deviations? Concomitant medications are another important consideration. Are any prohibited? Are rescue medications defined? Are there any medications that are not investigational medication products (IMP) that need to be included in the analysis?

Subject demographics and characteristics should be included in ADSL. It is important to consider the medical history/condition of the subjects, as well as any underlying diseases, surgeries, or additional diagnostic tests that may affect their progress through the study.

Subgroup analysis and covariates are also important considerations. It is important to consider whether subgroup analysis data will be from ADSL, and to check the SDTM origin of these data. Covariates should be considered, and it should be determined whether they will be from ADSL or if it is more efficient for them to be created in the efficacy dataset. Company policy for covariates should also be considered.

Exploratory analysis is another important consideration. It is important to understand how this data is collected, whether any specialist instruments are used, and what the company standards are. Is this similar enough to the primary efficacy that it will be included in the same ADaM, or will an additional dataset be required?

Finally, it is important to consider whether any data collected may be analyzed by a different team, such as PK data. If so, does an ADPK dataset specification still need to be created?

Information from Case Report Form (CRF)

In the field of clinical research, the Case Report Form (CRF) is a critical document that investigators use to input data into the database. The CRF is designed to mirror the data collection modules in the database and includes code lists for questions with lists of answers that investigators can choose from. Annotating the CRF with the SDTM domain is an important step in the data collection process, as it helps to ensure that the data collected is suitable for analysis.

The CRF is an important document in the data collection process, as it provides a standardized method for collecting data from study participants. The CRF should be designed to be consistent with the study protocol, ensuring that the data collected is relevant to the study objectives and endpoints. The CRF should also be designed to be user-friendly, with clear and concise instructions for investigators on how to complete the form.

Code lists

One important aspect of the CRF is the use of code lists for questions with lists of answers that investigators can choose from. This helps to ensure that the data collected is consistent and that it can be easily analyzed. Code lists should be created in a standardized format, and they should be consistent with the SDTM domain. This will help to ensure that the data collected is suitable for analysis and that it can be easily mapped to the appropriate ADaM dataset.

Annotated CRF

Annotating the CRF with the SDTM domain is an important step in the data collection process. This involves adding metadata to the CRF that indicates the SDTM domain that each data element is associated with. This metadata helps to ensure that the data collected is consistent with the SDTM standard and that it can be easily mapped to the appropriate ADaM dataset. Annotating the CRF with the SDTM domain also helps to ensure that the data collected is suitable for analysis and that it can be easily integrated into the overall study database.

When annotating the CRF with the SDTM domain, it is important to consider the format of the data as well as the anticipated values of the variables. This will help to ensure that the data collected is consistent with the SDTM standard and that it can be easily mapped to the appropriate ADaM dataset. It is also important to ensure that the metadata is consistent with the SDTM standard and that it includes all of the necessary information to ensure that the data collected is suitable for analysis.

Study Data Tabulation Model (SDTM) specification

The Study Data Tabulation Model (SDTM) specification is a critical document that contains the mappings from the raw data to the SDTM datasets. The SDTM specification provides a standardized format for organizing and presenting study data, and it is an essential component of the data analysis process.

The SDTM specification serves as a reference document when writing ADaM specifications. The ADaM specification outlines how the analysis datasets will be structured and what variables will be included. By referencing the SDTM specification, it is possible to ensure that the ADaM specifications are consistent with the SDTM standard and that the analysis datasets are structured in a way that is suitable for analysis.

The SDTM specification includes a number of important elements, including the number and type of domains to be mapped, the variables that will be included in each domain, the mappings, code lists, and any derivations (for example, of Baseline). The SDTM specification also includes the final structure of the dataset, which serves as a reference document when writing the ADaM specifications.

CDASH

One important consideration when working with the SDTM specification is whether the database is set up to be CDASH compliant. If so, the mapping from raw data to the SDTM datasets will be straightforward. CDASH (Clinical Data Acquisition Standards Harmonization) is a set of standard data collection templates that are used to streamline the data collection process and ensure that data is collected in a standardized format. CDASH compliant databases make it easier to create mappings between raw data and SDTM datasets, and they also help to ensure that the data collected is consistent and can be easily analyzed.

When working with the SDTM specification, it is important to consider the more complicated SDTM domains, such as Disposition (DS) and Laboratory Test Results (LB). These domains require careful attention to detail, as they involve complex data structures and may require additional mappings and derivations.

Statistical Analysis Plan (SAP)

The Statistical Analysis Plan (SAP) is a critical document that expands on the analysis from the protocol. The SAP provides a detailed plan for how the study data will be analyzed, including the statistical methods that will be used and the final outputs that will be produced. The SAP is an essential component of the data analysis process, and it is important to carefully review the document to ensure that the analysis is valid, reliable, and accurate.

When reading the SAP, there are a number of important points and questions to keep in mind. These include:

  1. What statistical methods will be used? The SAP should provide a detailed explanation of the statistical methods that will be used to analyze the data. This should include information on the types of statistical tests that will be used, the significance level that will be used, and any adjustments that will be made for multiple comparisons.
  2. What endpoints will be analyzed? The SAP should specify which endpoints will be analyzed and how they will be measured. This may include primary, secondary, and exploratory endpoints.
  3. What populations will be analyzed? The SAP should specify which populations will be analyzed, such as the intent-to-treat (ITT) population or the per-protocol population.
  4. What datasets will be created? The SAP should provide information on which datasets will be created and how they will be structured. This may include information on the ADaM datasets that will be created and the variables that will be included.
  5. What statistical models will be used? The SAP should provide information on the statistical models that will be used to analyze the data. This may include information on the types of regression models that will be used, such as logistic regression or linear regression.
  6. What adjustments will be made for confounding variables? The SAP should specify whether any adjustments will be made for confounding variables, such as age, gender, or baseline characteristics.
  7. What sensitivity analyses will be conducted? The SAP should provide information on any sensitivity analyses that will be conducted to assess the robustness of the study results.
  8. What assumptions will be made? The SAP should specify the assumptions that will be made when analyzing the data, such as assumptions of normality or homogeneity of variance.
  9. What significance level will be used? The SAP should specify the significance level that will be used to assess the statistical significance of the study results.
  10. What outputs will be produced? The SAP should provide information on the final outputs that will be produced, such as tables, figures, and listings. These outputs should be consistent with the objectives and endpoints of the study.

Elements to be considered to create ADaM

Subject level flags in ADSL

One of the essential elements to consider when creating analysis sets is the subject level flags in ADSL. These flags should be populated for all subjects/observations in a dataset, and their values should be either Y or N. Additionally, all populations defined in the Statistical Analysis Plan (SAP) should be considered when creating these datasets.

Disposition

Another element to consider when creating analysis sets is the Disposition. This information should be clearly distinguished from the study treatment disposition, and ADaM Implementation Guide version 1.1 has naming conventions for these variables. It is also essential to decide whether Screen Failure subjects will be included in ADaM and if so, which datasets they will be included in.

Protocol Deviations

When creating analysis sets, it is also crucial to define the dataset that will contain the Protocol Deviations. This can be variables in ADSL or a separate dataset, depending on the Sponsor’s choice. Working with the Statistician and the study team to define the deviations is essential, and it is necessary to use the CRF and SDTM specifications to explicitly define the programmable deviations. Additionally, deviations that cannot be defined due to the way data is captured in the CRF should be incorporated into the analysis sets.

Baseline

Another essential element to consider when creating analysis sets is Baseline. It is necessary to define Baseline for Safety and Efficacy and decide whether it is the same for all measurements or whether it needs to be defined differently across datasets. It is also crucial to determine whether multiple Baselines are required for multiple analysis periods and if the definition/flags from SDTM can be used or if the SAP requires a more complex derivation, such as multiple observations.

Missing data

Missing data is another crucial element to consider when creating analysis sets. It is essential to determine how missing data will be imputed for efficacy and safety and whether dates and times should be imputed. It is also necessary to decide whether visit windowing should be applied and to which datasets. Additionally, how the analysis visits are structured and whether APERIOD and APHASE variables are required should be determined.

Efficacy models

Efficacy models are also an essential element to consider when creating analysis sets. Primary endpoints, derivations, imputations, timepoints, models, covariates, and subgroups must all be considered and included in the efficacy dataset(s). It is also necessary to perform sensitivity analyses of these primary endpoints and determine whether the definition of a responder vs. a non-responder needs to be a flag in ADSL as well as the efficacy ADaM. Secondary endpoints and exploratory endpoints should also be considered, and it should be determined whether they are similar to the primary endpoint and included in the same ADaM or if additional datasets are required.

Safety summaries

Safety summaries are another critical element to consider when creating analysis sets. The complexity of these summaries should be determined, and it should be decided whether all safety data will be analyzed in the same way. Imputations required should also be determined, and the appropriate dataset structure (ADaM Basic Data Structure (BDS), Occurrence Dataset Structure (OCCDS), or other) should be chosen. Additionally, coding or standardizing of tests should be considered.

Covariates

Covariates can be included in different SDTM datasets depending on the variable’s origin and meaning. For example, demographic covariates, such as age, sex, and race, are usually included in the Demographics (DM) domain, while medical history and concomitant medication covariates are included in the Medical History (MH) and Concomitant Medications (CM) domains, respectively. Additionally, some domains, such as the Laboratory Tests (LB) and Vital Signs (VS) domains, may include covariates related to the test or measurement being performed.

When considering whether covariates should be added to ADSL, it is essential to evaluate their relevance and potential impact on the analysis. If the covariate is important and used frequently in different analysis sets, it may be beneficial to add it to ADSL. However, if the covariate is only used in a specific analysis set, it may not be necessary to include it in ADSL.

Furthermore, when creating analysis sets, it is important to determine whether covariates should be included in all ADaM datasets. The decision to include a covariate in all ADaM datasets may depend on its relevance to the analysis and whether it is expected to affect multiple endpoints. Including covariates in all ADaM datasets may increase the file size and processing time, so it is essential to evaluate the benefits and drawbacks of this approach.

Subgroups

Another important element to consider when creating analysis sets is subgroups. Subgroups are groups of subjects within a study that share a specific characteristic, such as age or disease severity. Subgroup analysis is often performed to assess the treatment effect in specific populations or to identify potential subgroups that may benefit more from the treatment.

When creating analysis sets, it is essential to determine how subgroups are grouped and whether they should be stored in ADSL. Subgroups are usually stored in ADSL, but it is important to evaluate whether they should be copied to all other ADaM datasets. Copying subgroups to all ADaM datasets may increase the file size and processing time, so it is important to evaluate the benefits and drawbacks of this approach.

Required datasets

When deciding what datasets to create, it may be useful to find out what datasets are required for a submission. Regulatory agencies, such as the US Food and Drug Administration (FDA), may require a specific subset of datasets for submission. The required datasets may vary depending on the study design, endpoints, and data collected. Therefore, it is important to review the regulatory guidelines and discuss with the study team to ensure that the required datasets are created.

ADSL

However, regardless of the required datasets for submission, an ADSL dataset is mandatory for an analysis dataset submission to be considered ADaM compliant. The ADSL dataset contains subject-level data and is the foundation for creating the other analysis datasets. It is essential for producing accurate and consistent analysis across different endpoints and analysis sets.

In addition to the ADSL dataset, other analysis datasets may be required depending on the study design and endpoints. For example, efficacy datasets may include primary and secondary endpoints, responder analyses, and subgroup analyses. Safety datasets may include adverse events, concomitant medications, and vital signs. Each dataset should be structured in a standard way, as defined by the ADaM specifications, to ensure consistency and comparability across studies.

TFL shells

TFL shells, or table, figure, and listing shells, are mock-ups of the outputs that give the programmer a template for creating the final outputs for the clinical study report (CSR). These shells typically follow a standard template, but for some efficacy analyses, new templates may be required. The programming notes contained in the TFL shells may also contain additional information not included in the SAP, such as sort orders, parameters that must be included, and selected coded variables.

It is important to note that the information contained in the TFL shells needs to be included in the ADaM dataset specifications. This may include information on observation selection methods, which variables are to be summarized, categorized, or selected for the model, and any other programming notes. If this information cannot be included in the ADaM specification itself, then it must be included in the Analysis Data Reviewer’s Guide.

Annotated TFL shells

To ensure that the ADaM datasets have been created for all outputs and nothing has been missed, it is recommended to annotate the TFL shells with the dataset, observation selection method, and the variables to be summarized or categorized for the model. This will help the programming team to select the correct variables for the analysis, as created by the specification writer. This ensures that the protocol and SAP are interpreted correctly and consistently, giving high-quality outputs to be used in the clinical study reports.

Furthermore, by annotating the TFL shells with the necessary information, it can help to streamline the programming process, minimize errors, and reduce the need for revisions. It can also help to improve the quality and consistency of the outputs, ensuring that they are accurate and informative.

Putting together the ADaM specification

When putting together the ADaM specification, it is important to consider all the elements of the analysis plan, as discussed in previous sections. It is also important to follow CDISC standards and guidelines for ADaM specifications. Here are some key steps to follow:

  1. Define the structure of the datasets: Use SDTM specifications to determine the structure of the datasets. This includes the number and type of domains, as well as the variables to be included. Consider any required derivations or transformations, and define them clearly in the specifications.
  2. Define any flags or variables needed for analysis sets: Review the SAP to determine the analysis sets that will be needed. Define any subject level or observation level flags or variables that will be needed for these sets.
  3. Define the structure of the safety datasets: Determine the structure of the safety datasets, including the summaries and any required imputations.
  4. Define any required baseline data: Determine the definition of baseline data for safety and efficacy datasets. Consider any required derivations or transformations.
  5. Define any required programming for protocol deviations: Define the dataset that will contain the deviations and work with the Statistician and study team to define the deviations. Use the CRF and SDTM specifications to explicitly define the programmable deviations.
  6. Ensure compliance with CDISC standards: Ensure that the ADaM specifications comply with CDISC standards and guidelines, and follow any applicable company or industry standards.
  7. Review and finalize the specifications: Review the ADaM specifications for completeness and accuracy, and finalize them for use in programming.

ADaM structure

Datasets are the foundation of ADaM specifications, and they must be created in accordance with Sponsor guidelines and CDISC ADaM standards. The define.xml file requires metadata about the dataset, such as the ADaM dataset type (ADSL, BDS, OCCDS, or other), name, description, and primary key. Naming conventions for ADaM datasets can also aid in maintaining traceability between SDTM and ADaM. For example, naming the ADaM dataset after the SDTM input domain, such as OCCDS for ADAE, can provide an occurrence structure to the SDTM AE events dataset.

Variables

Variables are the building blocks of ADaM datasets, and they must be created in accordance with Sponsor guidelines and CDISC ADaM standards. For each dataset, the required variables must be identified based on the input documentation, including the protocol, SAP, and CRF. Identifier variables, timepoint variables, and topic variables (coded terms or parameter names and analysis values) must be included in the dataset structure. Additionally, Baseline, analysis periods, and analysis enabling variables may be required or permitted for the dataset structure.

Code lists

Code lists are often overlooked in the ADaM specification creation process, but they play a critical role in ensuring consistency and accuracy in the analysis. Creating code lists for categorical or grouping variables, such as subgroups, analysis categories, analysis visits, or time points, can guide the programmer on how to populate variables and ensure consistency with the final outputs. CDISC SDTM and ADaM standards should be used as much as possible when creating code lists, and they should be included in the metadata to ensure compliance with the CDISC standards.

Computational algorithms

Finally, computational algorithms can aid in creating derivations or imputations that are required across multiple datasets. For example, visit windowing or analysis period creation can be applied to more than one dataset, and an imputation technique may be required in multiple datasets. Creating a computational algorithm can serve as the basis for a macro that can be called in multiple dataset creation programs, ensuring consistency and accuracy in the analysis. Sponsor guidelines should be followed when creating computational algorithms to ensure that the metadata is created correctly.

Tracking specification updates

Firstly, it is important to understand that ADaM dataset specifications are not set in stone and may evolve throughout the study. As conversations with the Statistician and study team evolve, so may the analysis and the specification. This means that any changes to the specifications must be carefully controlled and logged to ensure that the ADaM datasets remain consistent and accurate.

One method of tracking specification updates is to put updates in a different color to highlight any differences. This makes it easy for the programming team to see where changes have been made and what those changes are. Another method is to date and log the updates in a dedicated part of the specification. This provides a clear record of when changes were made and what they were. Additionally, detailing updates in a separate document or email to the programming team can ensure that everyone is aware of any changes and can update their work accordingly.

Finally, logging questions and decisions in a dedicated document that can be uploaded into the Trial Master File (TMF) can provide a clear audit trail of the changes made to the ADaM dataset specifications throughout the study. This ensures that any regulatory inquiries can be easily answered and that the trial remains compliant.

It is important to note that while updates to ADaM specifications may be necessary, they should be made with care and consideration. Any changes must be thoroughly reviewed by the Statistician and study team to ensure that they are necessary and will not affect the accuracy or integrity of the study results. Additionally, any updates should be communicated clearly to the programming team to ensure that all members are aware of the changes and can adjust their work accordingly.

In addition to specification updates, data issues may also be encountered during the creation of ADaM datasets. These issues should also be logged and communicated to the Data Management team. If certain errors are identified consistently, then additional checks or study site retraining may be required to ensure that the data is accurate and consistent across all sites.

--

--

Christian Baghai
Christian Baghai

No responses yet