You want to use regular expressions on a string containing more than one line, but the special characters .
(any character but newline), ^
(start of string), and $
(end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once.
Use /m
, /s
, or both as pattern modifiers. /s
lets .
match newline (normally it doesn't). If the string had more than one line in it, then /foo.*bar/s
could match a "foo"
on one line and a "bar"
on a following line. This doesn't affect dots in character classes like [#%.]
, since they are regular periods anyway.
The /m
modifier lets ^
and $
match next to a newline. /^=head[1-7]$/m
would match that pattern not just at the beginning of the record, but anywhere right after a newline as well.
A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. To match across newlines, you need to make .
match a newline; it ordinarily does not. In cases where newlines are important and you've read more than one line into a string, you'll probably prefer to have ^
and $
match beginning- and end-of-line, not just beginning- and end-of-string.
The difference between /m
and /s
is important: /m
makes ^
and $
match next to a newline, while /s
makes .
match newlines. You can even use them together - they're not mutually exclusive options.
Example 6.2 creates a filter to strip HTML tags out of each file in @ARGV
and send the results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because @ARGV
has several arguments in it. In this case, each read would get a whole file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just .*
for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using .*?
in conjunction with /s
solves these problems - at least in this case.
#!/usr/bin/perl # killtags - very bad html tag killer undef $/; # each read is whole file while (<>) { # get one whole file at a time s/<.*?>//gs; # strip tags (terribly) print; # print file to STDOUT }
Because this is just a single character, it would be much faster to use s/<[^>]*>//gs,
but that's still a naßve approach: It doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<IMG
SRC="here.gif"
ALT="<<Ooh
la
la!>>">
). Recipe 20.6 explains how to avoid these problems.
Example 6.3 takes a plain text document and looks for lines at the start of paragraphs that look like "Chapter
20:
Better
Living
Through
Chemisery"
. It wraps these with an appropriate HTML level one header. Because the pattern is relatively complex, we use the /x
modifier so we can embed whitespace and comments.
#!/usr/bin/perl # headerfy: change certain chapter headers to html $/ = ''; while ( <> ) { # fetch a paragraph s{ \A # start of record ( # capture in $1 Chapter # text string \s+ # mandatory whitespace \d+ # decimal number \s* # optional whitespace : # a real colon . * # anything not a newline till end of line ) }{<H1>$1</H1>}gx; print; }
Here it is as a one-liner from the command line if those extended comments just get in the way of understanding:
% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile
This problem is interesting because we need to be able to specify both start-of-record and end-of-line in the same pattern. We could normally use ^
for start-of-record, but we need $
to indicate not only end-of-record, but also end-of-line as well. We add the /m
modifier, which changes both ^
and $
. So instead of using ^
to match beginning-of-record, we use \A
instead. (We're not using it here, but in case you're interested, the version of $
that always matches end-of-record even in the presence of /m
is \Z
.)
The following example demonstrates using both /s
and /m
together. That's because we want ^
to match the beginning of any line in the paragraph and also want dot to be able to match a newline. (Because they are unrelated, using them together is simply the sum of the parts. If you have the questionable habit of using "single line" as a mnemonic for /s
and "multiple line" for /m
, then you may think you can't use them together.) The predefined variable $.
represents the record number of the last read file. The predefined variable $ARGV
is the file automatically opened by implicit <ARGV>
processing.
$/ = ''; # paragraph read mode for readline access while (<ARGV>) { while (m#^START(.*?)^END#sm) { # /s makes . span line boundaries # /m makes ^ match near newlines print "chunk $. in $ARGV has <<$1>>\n"; } }
If you've already committed to using the /m
modifier, you can use \A
and \Z
to get the old meanings of ^
and $
respectively. But what if you've used the /s
modifier and want to get the original meaning of .
? You can use [^\n]
. If you don't care to use /s
but want the notion of matching any character, you could construct a character class that matches any one byte, such as [\000-\377]
or even [\d\D]
. You can't use [.\n]
because .
is not special in a character class.
The $/
variable in perlvar (1) and in the "Special Variables" section of Chapter 2 of Programming Perl; the /s
and /m
modifiers in perlre (1) and "the fine print" section of Chapter 2 of Programming Perl; the "String Anchors" section of Mastering Regular Expressions; we talk more about the special variable $/
in Chapter 8