Common Issues with Define.xml: Impact on Regulatory Review Process

6 min readMar 1, 2023

Introduction

Define.xml is a critical document in the regulatory submission process that describes the structure and content of the data submitted. The document acts as the “Table of Contents” of the submission package and provides information on the study design, variable definitions, and other important details. A well-defined and properly created Define.xml document can significantly improve the efficiency of the regulatory review process, while a poorly created document can result in delays and additional questions from the regulatory authorities.

Data Quality Findings

Despite Define-XML being a part of the data submission process for more than a decade, the industry is still struggling with the document’s creation and submission. Data Quality Findings from Jumpstart highlighted that a significant percentage of Define-XML submitted to regulatory authorities with study data are still faulty. In this article, we will discuss some common issues with Define-XML and their impact on the regulatory review process.

Common Issues with Define.xml

Non-Descriptive Values Used for Race:

One of the most common issues found in Define.xml is the use of non-descriptive values for race, such as “Other,” “Missing,” “<null>,” and so on. Such values are not specific and do not provide any information on the actual race of the study participants. The regulatory authorities require the use of descriptive values, such as “Caucasian,” “African American,” “Asian,” and so on, to accurately represent the race of the participants.

Separate UNIT Codelist Not Used for Each Variable:

Another common issue found in Define.xml is the failure to use a separate UNIT codelist for each variable. The use of a separate UNIT codelist for each variable is critical for the accurate representation of the data. Without separate UNIT codelists, there is a risk of confusion between different variables, leading to incorrect interpretation of the data.

Define.xml v2.0 Not Used:

Despite Define.xml v2.0 being available for more than a decade, many submissions still use the earlier version of the document. Define.xml v2.0 offers several enhancements, including improved support for complex studies, streamlined authoring, and review processes, and better integration with other metadata standards. The failure to use Define.xml v2.0 can result in delays in the regulatory review process.

Applicant Extended by 10% or More When the Codelist was Extensible:

When an applicant extends a codelist by 10% or more, it is considered an “extensible” codelist. An extensible codelist requires additional review by the regulatory authorities, which can result in delays in the review process. Applicants should ensure that they use extensible codelists only when necessary and provide a clear explanation of why an extensible codelist is required.

Trial Summary Domain Does Not Follow Standards:

The Trial Summary Domain is a critical section of Define.xml that provides an overview of the study design, participants, and other important details. Failure to follow the standards for the Trial Summary Domain can result in confusion and delays in the regulatory review process. Applicants should ensure that they follow the standards for the Trial Summary Domain and provide accurate and comprehensive information.

Potential Clinically Relevant Duplicate Records Exist:

Duplicate records in the data can result in inaccurate interpretation of the data and can have a significant impact on the regulatory review process. Applicants should ensure that they identify and eliminate any potential duplicate records in the data before submission.

RFPENDTC Not Populated According to SDTM Guidance:

RFPENDTC is a critical variable in the data that represents the end date of the study. Failure to populate RFPENDTC according to SDTM guidance can result in delays in the regulatory review process. Applicants should ensure that they follow SDTM guidance for the population of RFPENDTC and provide accurate information.

Use Define.xml v2.0

The first suggestion provided by JumpStart is to use Define.xml v2.0. Despite the document being available for more than a decade, many submissions still use the earlier version of the document. Define.xml v2.0 offers several enhancements, including improved support for complex studies, streamlined authoring, and review processes, and better integration with other metadata standards. The failure to use Define.xml v2.0 can result in delays in the regulatory review process.

Include Detailed Description of Data Elements

JumpStart suggests including a detailed description of data elements in Define.xml, such as code lists that describe categories, subcategories, and reference time-points. Additionally, including reproducible computational algorithms for all derived variables and applicable value level metadata and description of SUPPQUAL domains can significantly improve the efficiency of the regulatory review process.

Explanations of Sponsor-Defined Identifiers

Providing explanations of sponsor-defined identifiers (i.e. — SPID, GRPID, etc.) can make it easier for regulatory authorities to understand the data and interpret it correctly.

Provide Separate Unit Code Lists for Each Domain

JumpStart suggests providing separate unit code lists for each domain. The use of separate UNIT codelists for each variable is critical for the accurate representation of the data. Without separate UNIT codelists, there is a risk of confusion between different variables, leading to incorrect interpretation of the data.

Provide Study Data Reviewer’s Guide (SDRG)

Providing a Study Data Reviewer’s Guide (SDRG) for each data package with each section populated can significantly improve the efficiency of the regulatory review process. The SDRG should include a clear and detailed explanation for all “non-fixable” issues identified by FDA Validation Rules and a Data Flow diagram that shows traceability between data capture, storage, and creation of datasets.

The dataset level information check

The dataset level information check involves verifying the accuracy and completeness of the individual datasets in the submission. Here are some critical elements that need to be checked:

• All domains(.xpt) presented include data and are not empty?

The first element that needs to be verified is whether all the datasets are present and have data. It is critical to ensure that there is no missing or empty data in any of the datasets as it can impact the analysis and interpretation of the results. The reviewer needs to check that all the datasets have data, and there are no missing values or blanks.

• Dataset labels are as per Implementation guides and are not missing in actual data or Define.xml.

The dataset labels need to be checked to ensure they are consistent with the Implementation guides and match the labels in the actual data and Define.xml file. The labels need to be accurate and consistent across the submission, and any discrepancies need to be identified and corrected.

• Domain label matches with the actual dataset (xpt) label?

The domain label needs to match the actual dataset label, which is in the xpt file. It is critical to ensure that the domain label is consistent with the actual data as any discrepancies can lead to confusion and errors in the analysis.

• All attributes (structure, class, or keys) are present?

The structure, class, and key attributes need to be present and accurate. The structure attribute defines the data structure, the class attribute defines the data class, and the key attribute defines the unique identifier for each record in the dataset. It is critical to ensure that all these attributes are present and accurate as they impact the data analysis.

• Order of domains in Define.xml and csdrg/adrg is correct:

The order of domains in the Define.xml and the csdrg/adrg needs to be correct. The order should be as per the SDTM or ADaM Implementation guides. In SDTM, the order of domains should be TDMs, Special Purpose, Interventions, Events, Findings, Finding about, Relationship. In ADaM, the order should be ADSL, followed by all other alphabetically by Dataset name.

• The order of domains is the same as the order in csdrg/adrg?

The order of domains in the Define.xml file needs to be the same as the order in the csdrg/adrg. This ensures consistency and accuracy in the submission and helps the reviewer navigate the submission more efficiently.

• The key variables are valid and natural and not surrogate.

The key variables in the datasets need to be valid and natural and not surrogate. For example, the USUBJID variable is a valid and natural key, while the AESEQ variable is a surrogate key and is invalid.

In addition to these elements, it is essential to ensure that the Define.xml file has been validated using the Pinnacle 21 validator and that there are no major issues or rejection criteria present. All acceptable messages need to be appropriately explained in the csdrg/adrg as applicable.

In conclusion, conducting individual dataset/variable level metadata checks is critical in ensuring the accuracy, consistency, and validity of the data presented in the Define-XML file. By carefully checking the list of variables, data types, attributes, order, and descriptions, reviewers can identify and correct any errors or inconsistencies. Furthermore, avoiding irrelevant or misleading information in the Define-XML file can ensure that the reviewers can focus on the critical aspects of the study results, leading to a successful review process.