pcre2test (1) - Linux Manuals
pcre2test: a program for testing Perl-compatible regular expressions.
NAME
pcre2test - a program for testing Perl-compatible regular expressions.SYNOPSIS
pcre2test [options] [input file [output file]]
pcre2test is a test program for the PCRE2 regular expression libraries, but it can also be used for experimenting with regular expressions. This document describes the features of the test program; for details of the regular expressions themselves, see the pcre2pattern documentation. For details of the PCRE2 library function calls and their options, see the pcre2api documentation.
The input for pcre2test is a sequence of regular expression patterns and subject strings to be matched. There are also command lines for setting defaults and controlling some special actions. The output shows the result of each match attempt. Modifiers on external or internal command lines, the patterns, and the subject lines specify PCRE2 function options, control how the subject is processed, and what output is produced.
As the original fairly simple PCRE library evolved, it acquired many different features, and as a result, the original pcretest program ended up with a lot of options in a messy, arcane syntax, for testing all the features. The move to the new PCRE2 API provided an opportunity to re-implement the test program as pcre2test, with a cleaner modifier syntax. Nevertheless, there are still many obscure modifiers, some of which are specifically designed for use in conjunction with the test script and data files that are distributed as part of PCRE2. All the modifiers are documented here, some without much justification, but many of them are unlikely to be of use except when testing the libraries.
PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
Different versions of the PCRE2 library can be built to support character strings that are encoded in 8-bit, 16-bit, or 32-bit code units. One, two, or all three of these libraries may be simultaneously installed. The pcre2test program can be used to test all the libraries. However, its own input and output are always in 8-bit format. When testing the 16-bit or 32-bit libraries, patterns and subject strings are converted to 16- or 32-bit format before being passed to the library functions. Results are converted back to 8-bit code units for output.
In the rest of this document, the names of library functions and structures are given in generic form, for example, pcre_compile(). The actual names used in the libraries have a suffix _8, _16, or _32, as appropriate.
INPUT ENCODING
Input to pcre2test is processed line by line, either by calling the C library's fgets() function, or via the libreadline library (see below). The input is processed using using C's string functions, so must not contain binary zeroes, even though in Unix-like environments, fgets() treats any bytes other than newline as data characters. In some Windows environments character 26 (hex 1A) causes an immediate end of file, and no further data is read.
For maximum portability, therefore, it is safest to avoid non-printing characters in pcre2test input files. There is a facility for specifying a pattern's characters as hexadecimal pairs, thus making it possible to include binary zeroes in a pattern for testing purposes. Subject lines are processed for backslash escapes, which makes it possible to include any data value.
COMMAND LINE OPTIONS
- -8
- If the 8-bit library has been built, this option causes it to be used (this is the default). If the 8-bit library has not been built, this option causes an error.
- -16
- If the 16-bit library has been built, this option causes it to be used. If only the 16-bit library has been built, this is the default. If the 16-bit library has not been built, this option causes an error.
- -32
- If the 32-bit library has been built, this option causes it to be used. If only the 32-bit library has been built, this is the default. If the 32-bit library has not been built, this option causes an error.
- -b
- Behave as if each pattern has the /fullbincode modifier; the full internal binary form of the pattern is output after compilation.
- -C
- Output the version number of the PCRE2 library, and all available information about the optional features that are included, and then exit with zero exit code. All other options are ignored.
- -C option
-
Output information about a specific build-time option, then exit. This
functionality is intended for use in scripts such as RunTest. The
following options output the value and set the exit code as indicated:
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
0x15 or 0x25
0 if used in an ASCII environment
exit code is always 0
linksize the configured internal link size (2, 3, or 4)
exit code is set to the link size
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY
exit code is always 0
bsr the default setting for what \R matches:
ANYCRLF or ANY
exit code is always 0 The following options output 1 for true or 0 for false, and set the exit code to the same value:
backslash-C \C is supported (not locked out)
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre2-16 the 16-bit library was built
pcre2-32 the 32-bit library was built
pcre2-8 the 8-bit library was built
unicode Unicode support is available If an unknown option is given, an error message is output; the exit code is 0.
- -d
- Behave as if each pattern has the debug modifier; the internal form and information about the compiled pattern is output after compilation; -d is equivalent to -b -i.
- -dfa
- Behave as if each subject line has the dfa modifier; matching is done using the pcre2_dfa_match() function instead of the default pcre2_match().
- -help
- Output a brief summary these options and then exit.
- -i
- Behave as if each pattern has the /info modifier; information about the compiled pattern is given after compilation.
- -jit
- Behave as if each pattern line has the jit modifier; after successful compilation, each pattern is passed to the just-in-time compiler, if available.
- -pattern modifier-list
- Behave as if each pattern line contains the given modifiers.
- -q
- Do not output the version number of pcre2test at the start of execution.
- -S size
- On Unix-like systems, set the size of the run-time stack to size megabytes.
- -subject modifier-list
- Behave as if each subject line contains the given modifiers.
- -t
- Run each compile and match many times with a timer, and output the resulting times per compile or match. When JIT is used, separate times are given for the initial compile and the JIT compile. You can control the number of iterations that are used for timing by following -t with a number (as a separate item on the command line). For example, "-t 1000" iterates 1000 times. The default is to iterate 500,000 times.
- -tm
- This is like -t except that it times only the matching phase, not the compile phase.
- -T -TM
- These behave like -t and -tm, but in addition, at the end of a run, the total times for all compiles and matches are output.
- -version
- Output the PCRE2 version number and then exit.
DESCRIPTION
If pcre2test is given two filename arguments, it reads from the first and writes to the second. If the first name is "-", input is taken from the standard input. If pcre2test is given only one argument, it reads from that file and writes to stdout. Otherwise, it reads from stdin and writes to stdout.
When pcre2test is built, a configuration option can specify that it should be linked with the libreadline or libedit library. When this is done, if the input is from a terminal, it is read using the readline() function. This provides line-editing and history facilities. The output from the -help option states whether or not readline() will be used.
The program handles any number of tests, each of which consists of a set of input lines. Each set starts with a regular expression pattern, followed by any number of subject lines to be matched against that pattern. In between sets of test data, command lines that begin with # may appear. This file format, with some restrictions, can also be processed by the perltest.sh script that is distributed with PCRE2 as a means of checking that the behaviour of PCRE2 and Perl is the same.
When the input is a terminal, pcre2test prompts for each line of input, using "re>" to prompt for regular expression patterns, and "data>" to prompt for subject lines. Command lines starting with # can be entered only in response to the "re>" prompt.
Each subject line is matched separately and independently. If you want to do multi-line matches, you have to use the \n escape sequence (or \r or \r\n, etc., depending on the newline setting) in a single line of input to encode the newline sequences. There is no limit on the length of subject lines; the input buffer is automatically extended if it is too small. There are replication features that makes it possible to generate long repetitive pattern or subject lines without having to supply them explicitly.
An empty line or the end of the file signals the end of the subject lines for a test, at which point a new pattern or command line is expected if there is still input to be read.
COMMAND LINES
In between sets of test data, a line that begins with # is interpreted as a command line. If the first character is followed by white space or an exclamation mark, the line is treated as a comment, and ignored. Otherwise, the following commands are recognized:
Subsequent patterns automatically have the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP
options set, which locks out the use of the PCRE2_UTF and PCRE2_UCP options and
the use of (*UTF) and (*UCP) at the start of patterns. This command also forces
an error if a subsequent pattern contains any occurrences of \P, \p, or \X,
which are still supported when PCRE2_UTF is not set, but which require Unicode
property support to be included in the library.
This is a trigger guard that is used in test files to ensure that UTF or
Unicode property tests are not accidentally added to files that are used when
Unicode support is not included in the library. Setting PCRE2_NEVER_UTF and
PCRE2_NEVER_UCP as a default can also be obtained by the use of #pattern;
the difference is that #forbid_utf cannot be unset, and the automatic
options are not displayed in pattern information, to avoid cluttering up test
output.
This command is used to load a set of precompiled patterns from a file, as
described in the section entitled "Saving and restoring compiled patterns"
below.
When PCRE2 is built, a default newline convention can be specified. This
determines which characters and/or character pairs are recognized as indicating
a newline in a pattern or subject string. The default can be overridden when a
pattern is compiled. The standard test files contain tests of various newline
conventions, but the majority of the tests expect a single linefeed to be
recognized as a newline by default. Without special action the tests would fail
when PCRE2 is compiled with either CR or CRLF as the default newline.
The #newline_default command specifies a list of newline types that are
acceptable as the default. The types must be one of CR, LF, CRLF, ANYCRLF, or
ANY (in upper or lower case), for example:
If the default newline is in the list, this command has no effect. Otherwise,
except when testing the POSIX API, a newline modifier that specifies the
first newline convention in the list (LF in the above example) is added to any
pattern that does not already have a newline modifier. If the newline
list is empty, the feature is turned off. This command is present in a number
of the standard test input files.
When the POSIX API is being tested there is no way to override the default
newline convention, though it is possible to set the newline convention from
within the pattern. A warning is given if the posix modifier is used when
#newline_default would set a default for the non-POSIX API.
This command sets a default modifier list that applies to all subsequent
patterns. Modifiers on a pattern can change these settings.
The appearance of this line causes all subsequent modifier settings to be
checked for compatibility with the perltest.sh script, which is used to
confirm that Perl gives the same results as PCRE2. Also, apart from comment
lines, none of the other command lines are permitted, because they and many
of the modifiers are specific to pcre2test, and should not be used in
test files that are also processed by perltest.sh. The #perltest
command helps detect tests that are accidentally put in the wrong file.
This command is used to manipulate the stack of compiled patterns, as described
in the section entitled "Saving and restoring compiled patterns"
below.
This command is used to save a set of compiled patterns to a file, as described
in the section entitled "Saving and restoring compiled patterns"
below.
This command sets a default modifier list that applies to all subsequent
subject lines. Modifiers on a subject line can change these settings.
Modifier lists are used with both pattern and subject lines. Items in a list
are separated by commas followed by optional white space. Trailing whitespace
in a modifier list is ignored. Some modifiers may be given for both patterns
and subject lines, whereas others are valid only for one or the other. Each
modifier has a long name, for example "anchored", and some of them must be
followed by an equals sign and a value, for example, "offset=12". Values cannot
contain comma characters, but may contain spaces. Modifiers that do not take
values may be preceded by a minus sign to turn off a previous setting.
A few of the more common modifiers can also be specified as single letters, for
example "i" for "caseless". In documentation, following the Perl convention,
these are written with a slash ("the /i modifier") for clarity. Abbreviated
modifiers must all be concatenated in the first item of a modifier list. If the
first item is not recognized as a long modifier name, it is interpreted as a
sequence of these abbreviations. For example:
This is a pattern line whose modifier list starts with two one-letter modifiers
(/i and /g). The lower-case abbreviated modifiers are the same as used in Perl.
A pattern line must start with one of the following characters (common symbols,
excluding pattern meta-characters):
This is interpreted as the pattern's delimiter. A regular expression may be
continued over several input lines, in which case the newline characters are
included within it. It is possible to include the delimiter within the pattern
by escaping it with a backslash, for example
If you do this, the escape and the delimiter form part of the pattern, but
since the delimiters are all non-alphanumeric, this does not affect its
interpretation. If the terminating delimiter is immediately followed by a
backslash, for example,
then a backslash is added to the end of the pattern. This is done to provide a
way of testing the error condition that arises if a pattern finishes with a
backslash, because
is interpreted as the first line of a pattern that starts with "abc/", causing
pcre2test to read the next line as a continuation of the regular expression.
A pattern can be followed by a modifier list (details below).
Before each subject line is passed to pcre2_match() or
pcre2_dfa_match(), leading and trailing white space is removed, and the
line is scanned for backslash escapes. The following provide a means of
encoding non-printing characters in a visible way:
The use of \x{hh...} is not dependent on the use of the utf modifier on
the pattern. It is recognized always. There may be any number of hexadecimal
digits inside the braces; invalid values provoke error messages.
Note that \xhh specifies one byte rather than one character in UTF-8 mode;
this makes it possible to construct invalid UTF-8 sequences for testing
purposes. On the other hand, \x{hh} is interpreted as a UTF-8 character in
UTF-8 mode, generating more than one byte if the value is greater than 127.
When testing the 8-bit library not in UTF-8 mode, \x{hh} generates one byte
for values less than 256, and causes an error for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes.
In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This makes it
possible to construct invalid UTF-32 sequences for testing purposes.
There is a special backslash sequence that specifies replication of one or more
characters:
This makes it possible to test long strings without having to provide them as
part of the file. For example:
is converted to "abcabcabcabc". This feature does not support nesting. To
include a closing square bracket in the characters, code it as \x5D.
A backslash followed by an equals sign marks the end of the subject string and
the start of a modifier list. For example:
If the subject string is empty and \= is followed by whitespace, the line is
treated as a comment line, and is not used for matching. For example:
A backslash followed by any other non-alphanumeric character just escapes that
character. A backslash followed by anything else causes an error. However, if
the very last character in the line is a backslash (and there is no modifier
list), it is ignored. This gives a way of passing an empty line as data, since
a real empty line terminates the data input.
There are several types of modifier that can appear in pattern lines. Except
where noted below, they may also be used in #pattern commands. A
pattern's modifier list can add to or override default modifiers that were set
by a previous #pattern command.
The following modifiers set options for pcre2_compile(). The most common
ones have single-letter abbreviations. See
pcre2api
for a description of their effects.
MODIFIER SYNTAX
PATTERN SYNTAX
SUBJECT LINE SYNTAX
PATTERN MODIFIERS
Setting compilation options
allow_empty_class
alt_bsux
alt_circumflex