Streamlining Define.xml Generation: Practical Approach Using SDTM Specifications and Excel
INTRODUCTION
The Clinical Data Interchange Standards Consortium (CDISC) has been pivotal in standardizing the pharmaceutical industry’s clinical trial data structure. A key product of these efforts is the Define.xml (Case Report Tabulation Data Definition Specification) document, a requisite by the Food and Drug Administration (FDA) for drug submission. Define.xml delineates the structure and contents of the data amassed during the clinical trial process, enhancing automation and streamlining the Regulatory Review process. It is grounded in the CDISC Operational Data Model (ODM), accessible at http://www.cdisc.org/standards/index.html.
Creating the code for Define.xml presents three challenges that SAS programmers typically grapple with:
- Basic understanding of XML
- Comprehensive understanding of the CDISC-specific XML structure of Define.xml
- Proficiency in SAS to generate the XML code
The first two are intrinsic challenges that cannot be circumvented. However, for the third challenge, alternative methods can be employed. Rather than using SAS or XML tools, SDTM specifications and Microsoft Excel can be leveraged to program Define.xml in a more practical and efficient way.
PROCESS FLOW OF DEFINE.XML CODE GENERATION
The process flow comprises three major steps:
- XPT File Generation
- Annotated CRF Generation
- Use of SDTM Specifications to Generate Code of Define.xml
STEP 1. XPT FILE GENERATION
Before proceeding with Define.xml code creation, the SDTM datasets need to be converted into .xpt files. This transformation is facilitated by the SAS XPORT engine, using either the DATA-SET step or the PROC COPY.
Here’s how it’s done:
LIBNAME source 'SAS-data-library';
LIBNAME xportout xport 'transport-file';
DATA xportout.xyz;
SET source.xyz;
RUN;
Alternatively, use the PROC COPY method:
PROC COPY IN = source
OUT = xportout memtype=data;
RUN;
STEP 2. ANNOTATED CRF GENERATION
An annotated CRF, prepared by the data management team, is generally available in most clinical trials for data collection. However, given that variable attributes may be modified across all SDTM datasets based on the SDTM Specifications (which are contingent on the specific statistical analysis plan or SAP), these changes need to be incorporated into the annotated CRF for Define.xml.
STEP 3. USE SDTM SPECIFICATIONS TO GENERATE CODE OF DEFINE.XML
Define.xml generally comprises four sections:
- Table of Contents (TOC, or Data Metadata)
- Collection of Data Definition Tables (Variable Level Metadata)
- Controlled Terminology
- ODM XML Header, Study, and MetaDataVersion
The first two sections constitute the main part of Define.xml.
GENERATE THE TOC SECTION
The TOC catalogues all the datasets (domains) included in the drug submission. It’s fairly straightforward to create an Excel sheet for the TOC based on the SDTM specifications and the SDTM IG.
The final column generates a hyperlink to the XPT files created earlier. You can use the ODM (Operational Data Model) element — ItemGroupDef to generate XML code for the TOC section. Here’s an example for the AE domain:
<ItemGroupDef OID="AE"
Name="AE"
Repeating="Yes"
IsReferenceData="No"
Purpose=" Tabulation"
def:Label="Adverse Events "
def:Structure="One record per adverse event per subject"
def:DomainKeys="STUDYID, USUBJID, AEDECOD, AESTDTC"
def:Class="Events"
def:ArchiveLocationID="LOCATION.AE">
… …
<def:leaf ID="LOCATION.AE"
xlink:href="ae.xpt">
<def:title>ae.xpt</def:title>
</def:leaf>
</ItemGroupDef >
Two hyperlinks — ‘Adverse Events’ and ‘ae.xpt’ are created, which directly link to the corresponding variable level Metadata section and the XPT file of the specific domain, respectively.
In conclusion, while Define.xml generation might seem daunting initially, breaking it down into manageable steps like XPT file generation, Annotated CRF generation, and using SDTM specifications for code generation can greatly simplify the process. Moreover, using familiar tools like Excel can further ease the process, thereby enhancing efficiency and reducing errors.