During a datastep processing, a temporary table is created. It is usually the result of data extraction/manipulation from either a database, a SAS dataset, or an external raw file. A datastep processing consists of a compilation phase and an execution phase. During the compilation phase, each of the statements within the data step are scanned for syntax errors. The descriptor portion of SAS dataset is created at the end of compilation phase. The input buffer and the PDV are also created at the end of compilation phase.
Program Data Vector
The Program Data Vector is a logical area of memory that is created during the data step processing. SAS builds a SAS dataset by reading one observation at a time into the PDV.
The program data vector contains two types of variables.
. Permanent (dataset and computed variables)
. Temporary (automatic and option defined).
Please note that there are 2 types of temporary variable :
. Automatic (_N_ and _ERROR_)
. Option defined (e.g., first.by-variable, last.byvariable, in=variable, end=variable).
Using the DROP / KEEP statement
The drop statement indicates which variables have to be dropped from the output dataset. It applies to all the output datasets in a datastep.
The keep Statement indicates which variables have to be kept in the output datasets.
All the variables that are not on the KEEP statement or that are on the DROP statement are available until the end of the datastep processing.
KEEP= / DROP= data set option
The DROP= option is used in SET statement. It specifies the variables not to be read from the input dataset to the Program Sata Vector. When the DROP= option is used within the first line of the datastep code (the data statement), it lists the variables on the Program Data Vector that are not to be written to the output dataset.
The KEEP= option can also be used in SET statement. It lists those variables that are to be read from the input dataset to the Program Data Vector. When KEEP= option is used within the first line of the datastep code (the data statement), it specifies variables to be written from the Program Data Vector to the output dataset.
The terminology is also important. When the KEEP / DROP is used within a statement it is called an option. Otherwise, it is called a statement.
Also, the same terminology rules apply for WHERE and IF.
WHERE and IF
Where statement selects the observations before they are read into the PDV.
Therefore, when the WHERE statement is used with a BY statement, the WHERE statement is executed first.
The variables that are created in a datastep cannot be used with a WHERE statement. WHERE statement cannot be used to select records from an external file that contains raw data (For example: An XML file).
Overall, the WHERE statement improves the efficiency of a SAS program because SAS is not required to read all the observations from the input dataset.
The IF statement selects the observations when they are already in the PDV.
When the IF statement is used with a BY statement, the BY statement is processed first.
Because the IF statement works with values after they are integrated into the PDV, the IF statement can select observations based on the variables created in a datastep. The IF statement can also be used to select records from external file through PDV (For example: An XML file).