Automating Data Definition Document Creation: Streamlining Content Creation and Facilitating Document Review

Christian Baghai
6 min readMar 6, 2023

--

Photo by Scott Graham on Unsplash

Introduction

Data definition documents are an essential part of any study involving data collection and analysis. The document specifies the variables defined in the database and provides other critical information such as origin and control terminology. It is essential to create the data definition document before validating the database to ensure that the database meets the intended purpose of the study.

FDA eSub and eCTD

The FDA electronic submission (eSub) guidance and the electronic Common Technical Document (eCTD) documents mandate the provision of a document describing the content and structure of the included data in any submission. Typically, three important Data Definition Documents are needed for a submission to FDA: ODM, SDTM, and ADS/ADaM. The ODM provides Case Report Tabulations Data Definitions in XML format, the SDTM provides Study Data Tabulation Data Definitions in XML format, and the ADS/ADaM provides Analysis Data Definitions in PDF format.

Content creation and format conversion

The process of creating a data definition document is tedious and time-consuming, and it can be divided into two steps: content creation and format conversion. The content creation process involves defining variables, variable attributes, and comments. The format conversion process involves converting the document to XML or PDF with all the necessary hyperlinks.

This article will focus on the automation of content creation using two macros in Excel. The macros can produce both short and long versions of the data definition document, which helps facilitate the document review process during the study.

Content Creation

The content creation process involves defining variables, variable attributes, and comments. This process is typically done manually, which is time-consuming and prone to errors.

A macro can be developed in Excel to automate this process and help to minimize errors and reduce the time needed to create the document.

The first macro would be used to create the short version of the data definition document, which contains only key columns. The columns include Variable Name, Label, Type, Length, and Description. The macro allows the user to enter the required information for each variable, and the macro automatically populates the remaining columns.

The second macro would be used to create the long version of the data definition document, which includes all the columns required for a complete data definition document. The columns include Variable Name, Label, Type, Length, Origin, Code List, Controlled Terminology, Question Text, and Description. The macro allows the user to enter the required information for each variable, and the macro automatically populates the remaining columns.

The macros developed in Excel would be user-friendly, and they allow the user to add or delete columns as needed. This flexibility ensures that the macros can be used for a variety of studies and that they can be adapted to meet the needs of each study.

Document Review Process

The data definition document is a critical document in any study involving data collection and analysis. The document is used to ensure that the data collected is accurate and complete and that the database meets the intended purpose of the study.

The macros developed in Excel help to facilitate the document review process during the study. The short version of the data definition document is used to facilitate team review, and the long version is used to load into an internal application that converts the Excel file to the desired format (XML or PDF) with all necessary hyperlinks.

Problem with controlled terminology and format

Data definition documents are crucial in any study involving data collection and analysis. These documents specify the variables defined in the database, along with other critical information such as origin, controlled terminology, and format. Two common formats for data definition documents are Define.pdf for analysis data ADDM in PDF and Define.xml for SDTM (DM) in XML. However, creating these documents can be a tedious and time-consuming process, and some fields are harder to create than others.

Using Existing Source Files

One way to simplify the creation process of data definition documents is to use existing source files. By using these files, we can automatically populate fields like Origin and Controlled Terms or Format. This approach saves time and effort while ensuring that the information provided is accurate.

Origin

For example, in the case of Origin, we can use existing source files to determine where the data came from. This approach is particularly useful when dealing with data from multiple sources or when the Origin field is not clear. By using the source files, we can accurately determine the Origin of the data and populate the field automatically.

Controlled Terms and Format

Similarly, in the case of Controlled Terms or Format, we can use existing source files to determine the appropriate controlled terms. This approach is especially useful when the CRF design does not use CDISC recommended controlled terms. By using the source files, we can map the variables to the appropriate controlled terms automatically.

Using existing source files also ensures consistency between the data definition document and the variables defined in the database. By comparing the data definition document against the source files, we can identify any discrepancies and ensure that the document accurately reflects the variables defined in the database.

Validation

The approach of using existing source files can also be used for validation purposes. By comparing the data definition document against internal standards, we can ensure that the document meets the necessary criteria. This approach is particularly useful when working towards submission milestones, as it helps ensure that the document is accurate and complete.

Output Files

Creating a data definition document requires having all the necessary column information available, including variables, labels, types, code/format, controlled terms, origin, role, and comments.

In this case, a CSV file is created that contains at least these variables. The variables variable, label, type, and code/format are obtained from the SAS contents of the datasets. The remaining variables would have to be created manually in the past or come from sources other than SAS dataset in this case.

Utilizing Existing Documents

The challenge in creating a data definition document electronically is how to utilize existing documents, such as standard or similar study data definition documents, to create the document for your study. This approach is particularly useful when datasets were created without formal dataset specification documents. Additionally, this approach can be used to enhance or improve dataset specifications that already exist during the analysis of the study.

The process of utilizing existing documents to create a data definition document involves several steps. These steps include:

Step 1: Obtain Existing Documents

Obtain any existing documents that may be used as a reference for creating the data definition document.

These documents may include standard or similar study data definition documents.

Step 2: Analyze Existing Documents

Analyze the existing documents to identify the relevant information needed for the data definition document.

This analysis involves identifying the key columns needed for the data definition document, including variables, labels, types, code/format, controlled terms, origin, role, and comments.

Step 3: Extract Information

Extract the relevant information from the existing documents and store it in a format that can be used to create the data definition document. This step may involve manual data entry or the use of macros to automate the process.

Step 4: Create Data Definition Document

Use the extracted information to create the data definition document. This step may involve the use of macros or other tools to automate the process.

Step 5: Review and Validate

Review and validate the data definition document to ensure that it accurately reflects the variables defined in the database and meets any necessary criteria.

This process allows for the creation of a data definition document electronically by utilizing existing documents as a reference. This approach is particularly useful when datasets were created without formal dataset specification documents or when enhancing or improving existing dataset specifications during the analysis of the study.

Creating a data definition document is a crucial step in any study involving data collection and analysis. The document specifies the variables defined in the database and provides other critical information, such as origin, controlled terminology, and format. The process of creating the document can be tedious and time-consuming, but by using macros and utilizing existing documents, the process can be streamlined and made more efficient.

--

--

Christian Baghai
Christian Baghai

No responses yet