Regular Expressions: A Powerful Tool for Text String Manipulation in SAS
Regular expressions, also known as RX, are a powerful tool for searching and manipulating text strings. In particular, Perl regular expressions (PRX) introduced in SAS version 9 provide an easy-to-use and versatile solution for complex string manipulation tasks, especially when dealing with unstructured text strings.
While SAS already has a rich set of string manipulation functions, some patterns in text are so complex that traditional string character functions are not enough to handle them. This is where PRX comes in handy. PRX allows for searching and extracting multiple pattern matches in a text string in a single step and can also make several string replacements.
In addition to the traditional RX functions like RXPARSE, RXCHANGE, and RXMATCH, SAS version 9 introduces PRX functions and call routines, such as PRXPARSE, PRXCHANGE, PRXMATCH, CALL PRXCHANGE, and CALL PRXSUBSTR. These functions can be extended to a SAS Macro environment through the use of %SYSFUNC and %SYSCALL, making PRX a highly useful tool for text string manipulation in SAS.
Simple Word Matching The simplest form of regular expression is a word or a string of characters. A regular expression consisting of a word matches any string containing that word. For example, /world/ would search for any string that contains the exact word “world” anywhere inside it.
Using Character Classes A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regular expression. They are denoted by square brackets [….] with the set of characters to be matched inside. For example, /[bcr]at/ would match ‘bat’, ‘cat’, and ‘rat’. There are several abbreviations for common character classes such as \d, \s, \w, \D, \S, and \W. The period ‘.’ matches exactly one character.
Alternation and Grouping The alternation metacharacter “|” allows a regular expression to match different possible words or character strings. This could be used to match a whole regular expression. If one just wants to alternate part of a regular expression, grouping metacharacters ( ) need to be added as well. Grouping allows parts of a regular expression to be treated as a single unit. For example, /c(a|o)t/ would match ‘cat’ and ‘cot’.
Matching Repetitions The quantifier metacharacters ?, *, +, and {} allow the determination of the number of repeats of a portion of a regular expression considered to be a match. Quantifiers are put immediately after the character, character class, or grouping to be specified. For example, ? matches 1 or 0 times, * matches 0 or more times, + matches 1 or more times, and {} specifies the exact number of times to match.
Position Matching Perl has another set of special characters ^, $, \b, and \B that do not match any character at all, but represent a particular place in a string. These special characters allow for matching text in specific locations of a string, which is a major advantage of using regular expressions over other text matching functions. For example, ^ matches the beginning of a line, $ matches the end of a line, \b matches a word boundary, and \B matches a non-word boundary.
In conclusion, Perl regular expressions provide a flexible and powerful way to perform pattern matching and string manipulation tasks. With the knowledge of these basic features, one can start to construct their own regular expressions to solve a variety of text-related problems.