In a pattern match, you may match any character that has--or that does not have--a particular property. There are four ways to specify character classes. You may specify a character classes in the traditional way using square brackets and enumerating the possible characters, or you may use any of three mnemonic shortcuts: the classic Perl classes, the new Perl Unicode properties, or the standard POSIX classes. Each of these shortcuts matches only one character from its set. Quantify them to match larger expanses, such as \d+ to match one or more digits. (An easy mistake is to think that \w matches a word. Use \w+ to match a word.)
An enumerated list of characters in square brackets is called a character class and matches any one of the characters in the list. For example, [aeiouy] matches a letter that can be a vowel in English. (For Welsh add a "w", for Scottish an "r".) To match a right square bracket, either backslash it or place it first in the list.
Character ranges may be indicated using a hyphen and the a-z notation. Multiple ranges may be combined; for example, [0-9a-fA-F] matches one hex "digit". You may use a backslash to protect a hyphen that would otherwise be interpreted as a range delimiter, or just put it at the beginning or end of the class (a practice which is arguably less readable but more traditional).
A caret (or circumflex, or hat, or up arrow) at the front of the character class inverts the class, causing it to match any single character not in the list. (To match a caret, either don't put it first, or better, escape it with a backslash.) For example, [^aeiouy] matches any character that isn't a vowel. Be careful with character class negation, though, because the universe of characters is expanding. For example, that character class matches consonants--and also matches spaces, newlines, and anything (including vowels) in Cyrillic, Greek, or nearly any other script, not to mention every idiograph in Chinese, Japanese, and Korean. And someday maybe even Cirth, Tengwar, and Klingon. (Linear B and Etruscan, for sure.) So it might be better to specify your consonants explicitly, such as [cbdfghjklmnpqrstvwxyz], or [b-df-hj-np-tv-z] for short. (This also solves the issue of "y" needing to be in two places at once, which a set complement would preclude.)
Normal character metasymbols are supported inside a character class, (see "Specific Characters"), such as \n, \t, \cX, \NNN, and \N{NAME}. Additionally, you may use \b within a character class to mean a backspace, just as it does in a double-quoted string. Normally, in a pattern match, it means a word boundary. But zero-width assertions don't make any sense in character classes, so here \b returns to its normal meaning in strings. You may also use any predefined character class described later in the chapter (classic, Unicode, or POSIX), but don't try to use them as endpoints of a range--that doesn't make sense, so the "-" will be interpreted literally.
All other metasymbols lose their special meaning inside square brackets. In particular, you can't use any of the three generic wildcards: ".", \X, or \C. The first often surprises people, but it doesn't make much sense to use the universal character class within a restricted one, and you often want to match a literal dot as part of a character class--when you're matching filenames, for instance. It's also meaningless to specify quantifiers, assertions, or alternation inside a character class, since the characters are interpreted individually. For example, [fee|fie|foe|foo] means the same thing as [feio|].
Since the beginning, Perl has provided a number of character class shortcuts. These are listed in Table 5-8. All of them are backslashed alphabetic metasymbols, and in each case, the uppercase version is the negation of the lowercase version. The meanings of these are not quite as fixed as you might expect; the meanings can be influenced by locale settings. Even if you don't use locales, the meanings can change whenever a new Unicode standard comes out, adding scripts with new digits and letters. (To keep the old byte meanings, you can always use bytes. For explanations of the utf8 meanings, see "Unicode Properties" later in this chapter. In any case, the utf8 meanings are a superset of the byte meanings.)
Symbol | Meaning | As Bytes | As utf8 |
---|---|---|---|
\d | Digit | [0-9] | \p{IsDigit} |
\D | Nondigit | [^0-9] | \P{IsDigit} |
\s | Whitespace | [ \t\n\r\f] | \p{IsSpace} |
\S | Nonwhitespace | [^ \t\n\r\f] | \P{IsSpace} |
\w | Word character | [a-zA-Z0-9_] | \p{IsWord} |
\W | Non-(word character) | [^a-zA-Z0-9_] | \P{IsWord} |
(Yes, we know most words don't have numbers or underscores in them; \w is for matching "words" in the sense of tokens in a typical programming language. Or Perl, for that matter.)
These metasymbols may be used either outside or inside square brackets, that is, either standalone or as part of a constructed character class:
if ($var =~ /\D/) { warn "contains non-digit" } if ($var =~ /[^\w\s.]/) { warn "contains non-(word, space, dot)" }
Unicode properties are available using \p{PROP} and its set complement, \P{PROP}. For the rare properties with one-character names, braces are optional, as in \pN to indicate a numeric character (not necessarily decimal--Roman numerals are numeric characters too). These property classes may be used by themselves or combined in a constructed character class:
Some properties are directly defined in the Unicode standard, and some properties are composites defined by Perl, based on the standard properties. Zl and Zp are standard Unicode properties representing line separators and paragraph separators, while IsAlpha is defined by Perl to be a property class combining the standard properties Ll, Lu, Lt, and Lo, (that is, letters that are lowercase, uppercase, titlecase, or other). As of version 5.6.0 of Perl, you need to use utf8 for these properties to work. This restriction will be relaxed in the future.if ($var =~ /^\p{IsAlpha}+$/) { print "all alphabetic" } if ($var =~ s/[\p{Zl}\p{Zp}]/\n/g) { print "fixed newline wannabes" }
There are a great many properties. We'll list the ones we know about, but the list is necessarily incomplete. New properties are likely to be in new versions of Unicode, and you can even define your own properties. More about that later.
The Unicode Consortium produces the online resources that turn into the various files Perl uses in its Unicode implementation. For more about these files, see Chapter 15, "Unicode". You can get a nice overview of Unicode in the document PATH_TO_PERLLIB/unicode/Unicode3.html where PATH_TO_PERLLIB is what is printed out by:
Most Unicode properties are of the form \p{IsPROP}. The Is is optional, since it's so common, but you may prefer to leave it in for readability.perl -MConfig -le 'print $Config{privlib}'
First, Table 5-9 lists Perl's composite properties. They're defined to be reasonably close to the standard POSIX definitions for character classes.
Property | Equivalent |
---|---|
IsASCII | [\x00-\x7f] |
IsAlnum | [\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}] |
IsAlpha | [\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}] |
IsCntrl | \p{IsC} |
IsDigit | \p{Nd} |
IsGraph | [^\pC\p{IsSpace}] |
IsLower | \p{IsLl} |
IsPrint | \P{IsC} |
IsPunct | \p{IsP} |
IsSpace | [\t\n\f\r\p{IsZ}] |
IsUpper | [\p{IsLu}\p{IsLt}] |
IsWord | [_\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}] |
IsXDigit | [0-9a-fA-F] |
Perl also provides the following composites for each of main categories of standard Unicode properties (see the next section):
Property | Meaning | Normative |
---|---|---|
IsC | Crazy control codes and such | Yes |
IsL | Letters | Partly |
IsM | Marks | Yes |
IsN | Numbers | Yes |
IsP | Punctuation | No |
IsS | Symbols | No |
IsZ | Separators (Zeparators?) | Yes |
Table 5-10 lists the most basic standard Unicode properties, derived from each character's category. No character is a member of more than one category. Some properties are normative; others are merely informative. See the Unicode Standard for the standard spiel on just how normative the normative information is, and just how informative the informative information isn't.
Property | Meaning | Normative |
---|---|---|
IsCc | Other, Control | Yes |
IsCf | Other, Format | Yes |
IsCn | Other, Not assigned | Yes |
IsCo | Other, Private Use | Yes |
IsCs | Other, Surrogate | Yes |
IsLl | Letter, Lowercase | Yes |
IsLm | Letter, Modifier | No |
IsLo | Letter, Other | No |
IsLt | Letter, Titlecase | Yes |
IsLu | Letter, Uppercase | Yes |
IsMc | Mark, Combining | Yes |
IsMe | Mark, Enclosing | Yes |
IsMn | Mark, Nonspacing | Yes |
IsNd | Number, Decimal digit | Yes |
IsNl | Number, Letter | Yes |
IsNo | Number, Other | Yes |
IsPc | Punctuation, Connector | No |
IsPd | Punctuation, Dash | No |
IsPe | Punctuation, Close | No |
IsPf | Punctuation, Final quote | No |
IsPi | Punctuation, Initial quote | No |
IsPo | Punctuation, Other | No |
IsPs | Punctuation, Open | No |
IsSc | Symbol, Currency | No |
IsSk | Symbol, Modifier | No |
IsSm | Symbol, Math | No |
IsSo | Symbol, Other | No |
IsZl | Separator, Line | Yes |
IsZp | Separator, Paragraph | Yes |
IsZs | Separator, Space | Yes |
Another useful set of properties has to do with whether a given character can be decomposed (either canonically or compatibly) into other simpler characters. Canonical decomposition doesn't lose any formatting information. Compatibility decomposition may lose formatting information such as whether a character is a superscript.
Property | Information Lost |
---|---|
IsDecoCanon | Nothing |
IsDecoCompat |
Something (one of the following) |
IsDCcircle |
Circle around character |
IsDCfinal |
Final position preference (Arabic) |
IsDCfont |
Variant font preference |
IsDCfraction |
Vulgar fraction characteristic |
IsDCinitial |
Initial position preference (Arabic) |
IsDCisolated |
Isolated position preference (Arabic) |
IsDCmedial |
Medial position preference (Arabic) |
IsDCnarrow | Narrow characteristic |
IsDCnoBreak | Nonbreaking preference on space or hyphen |
IsDCsmall | Small characteristic |
IsDCsquare |
Square around CJK character |
IsDCsub | Subscription |
IsDCsuper | Superscription |
IsDCvertical | Rotation (horizontal to vertical) |
IsDCwide | Wide characteristic |
IsDCcompat | Identity (miscellaneous) |
Here are some properties of interest to people doing bidirectional rendering:
Property | Meaning |
---|---|
IsBidiL | Left-to-right (Arabic, Hebrew) |
IsBidiLRE | Left-to-right embedding |
IsBidiLRO | Left-to-right override |
IsBidiR | Right-to-left |
IsBidiAL | Right-to-left Arabic |
IsBidiRLE | Right-to-left embedding |
IsBidiRLO | Right-to-left override |
IsBidiPDF | Pop directional format |
IsBidiEN | European number |
IsBidiES | European number separator |
IsBidiET | European number terminator |
IsBidiAN | Arabic number |
IsBidiCS | Common number separator |
IsBidiNSM | Nonspacing mark |
IsBidiBN | Boundary neutral |
IsBidiB | Paragraph separator |
IsBidiS | Segment separator |
IsBidiWS | Whitespace |
IsBidiON | Other Neutrals |
IsMirrored | Reverse when used right-to-left |
The following properties classify various syllabaries according to vowel sounds:
For example, \p{IsSylA} would match \N{KATAKANA LETTER KA} but not \N{KATAKANA LETTER KU}.IsSylA IsSylE IsSylO IsSylWAA IsSylWII IsSylAA IsSylEE IsSylOO IsSylWC IsSylWO IsSylAAI IsSylI IsSylU IsSylWE IsSylWOO IsSylAI IsSylII IsSylV IsSylWEE IsSylWU IsSylC IsSylN IsSylWA IsSylWI IsSylWV
Now that we've basically told you all these Unicode 3.0 properties, we should point out that a few of the more esoteric ones aren't implemented in version 5.6.0 of Perl because its implementation was based in part on Unicode 2.0, and things like the bidirectional algorithm were still being worked out. However, by the time you read this, the missing properties may well be implemented, so we listed them anyway.
Some Unicode properties are of the form \p{InSCRIPT}. (Note the distinction between Is and In.) The In properties are for testing block ranges of a particular SCRIPT. If you have a character, and you wonder whether it were written in Greek script, you could test with:
That works by checking whether a character is "in" the valid range of that script type. This may be negated with \P{InSCRIPT} to find out whether something isn't in a particular script's block, such as \P{InDingbats} to test whether a string contains a non-dingbat. Block properties include the following:print "It's Greek to me!\n" if chr(931) =~ /\p{InGreek}/;
Not to mention jawbreakers like these:InArabic InCyrillic InHangulJamo InMalayalam InSyriac InArmenian InDevanagari InHebrew InMongolian InTamil InArrows InDingbats InHiragana InMyanmar InTelugu InBasicLatin InEthiopic InKanbun InOgham InThaana InBengali InGeorgian InKannada InOriya InThai InBopomofo InGreek InKatakana InRunic InTibetan InBoxDrawing InGujarati InKhmer InSinhala InYiRadicals InCherokee InGurmukhi InLao InSpecials InYiSyllables
And the winner is:InAlphabeticPresentationForms InHalfwidthandFullwidthForms InArabicPresentationForms-A InHangulCompatibilityJamo InArabicPresentationForms-B InHangulSyllables InBlockElements InHighPrivateUseSurrogates InBopomofoExtended InHighSurrogates InBraillePatterns InIdeographicDescriptionCharacters InCJKCompatibility InIPAExtensions InCJKCompatibilityForms InKangxiRadicals InCJKCompatibilityIdeographs InLatin-1Supplement InCJKRadicalsSupplement InLatinExtended-A InCJKSymbolsandPunctuation InLatinExtended-B InCJKUnifiedIdeographs InLatinExtendedAdditional InCJKUnifiedIdeographsExtensionA InLetterlikeSymbols InCombiningDiacriticalMarks InLowSurrogates InCombiningHalfMarks InMathematicalOperators InCombiningMarksforSymbols InMiscellaneousSymbols InControlPictures InMiscellaneousTechnical InCurrencySymbols InNumberForms InEnclosedAlphanumerics InOpticalCharacterRecognition InEnclosedCJKLettersandMonths InPrivateUse InGeneralPunctuation InSuperscriptsandSubscripts InGeometricShapes InSmallFormVariants InGreekExtended InSpacingModifierLetters
See PATH_TO_PERLLIB/unicode/In/*.pl to get an up-to-date listing of all of these character block properties. Note that these In properties are only testing to see if the character is in the block of characters allocated for that script. There is no guarantee that all characters in that range are defined; you also need to test against one of the Is properties discussed earlier to see if the character is defined. There is also no guarantee that a particular language doesn't use characters outside its assigned block. In particular, many European languages mix extended Latin characters with Latin-1 characters.InUnifiedCanadianAboriginalSyllabics
But hey, if you need a particular property that isn't provided, that's not a big problem. Read on.
To define your own property, you need to write a subroutine with the name of the property you want (see Chapter 6, "Subroutines"). The subroutine should be defined in the package that needs the property (see Chapter 10, "Packages"), which means that if you want to use it in multiple packages, you'll either have to import it from a module (see Chapter 11, "Modules"), or inherit it as a class method from the package in which it is defined (see Chapter 12, "Objects").
Once you've got that all settled, the subroutine should return data in the same format as the files in PATH_TO_PERLLIB/unicode/Is directory. That is, just return a list of characters or character ranges in hexadecimal, one per line. If there is a range, the two numbers are separated by a tab. Suppose you wanted a property that would be true if your character is in the range of either of the Japanese syllabaries, known as hiragana and katakana. (Together they're known as kana). You can just put in the two ranges like this:
Alternatively, you could define it in terms of existing property names:sub InKana { return <<'END'; 3040 309F 30A0 30FF END }
You can also do set subtraction using a "-" prefix. Suppose you only wanted the actual characters, not just the block ranges of characters. You could weed out all the undefined ones like this:sub InKana { return <<'END'; +utf8::InHiragana +utf8::InKatakana END }
You can also start with a complemented character set using the "!" prefix:sub IsKana { return <<'END'; +utf8::InHiragana +utf8::InKatakana -utf8::IsCn END }
Perl itself uses exactly the same tricks to define the meanings of its "classic" character classes (like \w) when you include them in your own custom character classes (like [-.\w\s]). You might think that the more complicated you get with your rules, the slower they will run, but in fact, once Perl has calculated the bit pattern for a particular 64-bit swatch of your property, it caches it so it never has to recalculate the pattern again. (It does it in 64-bit swatches so that it doesn't even have to decode your utf8 to do its lookups.) Thus, all character classes, built-in or custom, run at essentially the same speed (fast) once they get going.sub IsNotKana { return <<'END'; !utf8::InHiragana -utf8::InKatakana +utf8::IsCn END }
Unlike Perl's other character class shortcuts, the POSIX-style character-class syntax notation, [:CLASS:], is available for use only when constructing other character classes, that is, inside an additional pair of square brackets. For example, /[.,[:alpha:][:digit:]]/ will search for one character that is either a literal dot (because it's in a character class), a comma, an alphabetic character, or a digit.
The POSIX classes available as of revision 5.6 of Perl are shown in Table 5-11.
Class | Meaning |
---|---|
alnum |
Any alphanumeric, that is, an alpha or a digit. |
alpha |
Any letter. (That's a lot more letters than you think, unless you're thinking Unicode, in which case it's still a lot.) |
ascii | Any character with an ordinal value between 0 and 127. |
cntrl |
Any control character. Usually characters that don't produce output as such, but instead control the terminal somehow; for example, newline, form feed, and backspace are all control characters. Characters with an ord value less than 32 are most often classified as control characters. |
digit |
A character representing a decimal digit, such as 0 to 9. (Includes other characters under Unicode.) Equivalent to \d. |
graph |
Any alphanumeric or punctuation character. |
lower |
A lowercase letter. |
Any alphanumeric or punctuation character or space. |
|
punct |
Any punctuation character. |
space |
Any space character. Includes tab, newline, form feed, and carriage return (and a lot more under Unicode.) Equivalent to \s. |
upper |
Any uppercase (or titlecase) letter. |
word |
Any identifier character, either an alnum or underline. |
xdigit |
Any hexadecimal digit. Though this may seem silly ([0-9a-fA-F] works just fine), it is included for completeness. |
You can negate the POSIX character classes by prefixing the class name with a ^ following the [:. (This is a Perl extension.) For example:
POSIX | Classic |
---|---|
[:^digit:] | \D |
[:^space:] | \S |
[:^word:] | \W |
If the use utf8 pragma is not requested, but the use locale pragma is, the classes correlate directly with the equivalent functions in the C library's isalpha(3) interface (except for word, which is a Perl extension, mirroring \w).
If the utf8 pragma is used, POSIX character classes are exactly equivalent to the corresponding Is properties listed in Table 5-9. For example [:lower:] and \p{Lower} are equivalent, except that the POSIX classes may only be used within constructed character classes, whereas Unicode properties have no such restriction and may be used in patterns wherever Perl shortcuts like \s and \w may be used.
The brackets are part of the POSIX-style [::] construct, not part of the whole character class. This leads to writing patterns like /^[[:lower:][:digit:]]+$/, to match a string consisting entirely of lowercase letters or digits (plus an optional trailing newline). In particular, this does not work:
That's because it's not inside a character class. Rather, it is a character class, the one representing the characters ":", "i", "t", "g", and "d". Perl doesn't care that you specified ":" twice.42 =~ /^[:digit:]$/ # WRONG
Here's what you need instead:
The POSIX character classes [.cc.] and [=cc=] are recognized but produce an error indicating they are not supported. Trying to use any POSIX character class in older verions of Perl is likely to fail miserably, and perhaps even silently. If you're going to use POSIX character classes, it's best to require a new version of Perl by saying:42 =~ /^[[:digit:]]+$/
use 5.6.0;
Copyright © 2001 O'Reilly & Associates. All rights reserved.