pcretest (1) - Linux Manuals
pcretest: a program for testing Perl-compatible regular expressions.
NAME
pcretest - a program for testing Perl-compatible regular expressions.SYNOPSIS
pcretest [options] [input file [output file]]
pcretest was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular expressions. This document describes the features of the test program; for details of the regular expressions themselves, see the pcrepattern documentation. For details of the PCRE library function calls and their options, see the pcreapi , pcre16 and pcre32 documentation.
The input for pcretest is a sequence of regular expression patterns and strings to be matched, as described below. The output shows the result of each match. Options on the command line and the patterns control PCRE options and exactly what is output.
As PCRE has evolved, it has acquired many different features, and as a result, pcretest now has rather a lot of obscure options for testing every possible feature. Some of these options are specifically designed for use in conjunction with the test script and data files that are distributed as part of PCRE, and are unlikely to be of use otherwise. They are all documented here, but without much justification.
PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
From release 8.30, two separate PCRE libraries can be built. The original one supports 8-bit character strings, whereas the newer 16-bit library supports character strings encoded in 16-bit units. From release 8.32, a third library can be built, supporting character strings encoded in 32-bit units. The pcretest program can be used to test all three libraries. However, it is itself still an 8-bit program, reading 8-bit input and writing 8-bit output. When testing the 16-bit or 32-bit library, the patterns and data strings are converted to 16- or 32-bit format before being passed to the PCRE library functions. Results are converted to 8-bit for output.
References to functions and structures of the form pcre[16|32]_xx below mean "pcre_xx when using the 8-bit library or pcre16_xx when using the 16-bit library".
COMMAND LINE OPTIONS
- -8
- If both the 8-bit library has been built, this option causes the 8-bit library to be used (which is the default); if the 8-bit library has not been built, this option causes an error.
- -16
- If both the 8-bit or the 32-bit, and the 16-bit libraries have been built, this option causes the 16-bit library to be used. If only the 16-bit library has been built, this is the default (so has no effect). If only the 8-bit or the 32-bit library has been built, this option causes an error.
- -32
- If both the 8-bit or the 16-bit, and the 32-bit libraries have been built, this option causes the 32-bit library to be used. If only the 32-bit library has been built, this is the default (so has no effect). If only the 8-bit or the 16-bit library has been built, this option causes an error.
- -b
- Behave as if each pattern has the /B (show byte code) modifier; the internal form is output after compilation.
- -C
- Output the version number of the PCRE library, and all available information about the optional features that are included, and then exit. All other options are ignored.
- -C option
-
Output information about a specific build-time option, then exit. This
functionality is intended for use in scripts such as RunTest. The
following options output the value indicated:
ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
0x15 or 0x25
0 if used in an ASCII environment
linksize the internal link size (2, 3, or 4)
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY The following options output 1 for true or zero for false:
ebcdic compiled for an EBCDIC environment
jit just-in-time support is available
pcre16 the 16-bit library was built
pcre32 the 32-bit library was built
pcre8 the 8-bit library was built
ucp Unicode property support is available
utf UTF-8 and/or UTF-16 and/or UTF-32 support is available - -d
- Behave as if each pattern has the /D (debug) modifier; the internal form and information about the compiled pattern is output after compilation; -d is equivalent to -b -i.
- -dfa
- Behave as if each data line contains the \D escape sequence; this causes the alternative matching function, pcre[16|32]_dfa_exec(), to be used instead of the standard pcre[16|32]_exec() function (more detail is given below).
- -help
- Output a brief summary these options and then exit.
- -i
- Behave as if each pattern has the /I modifier; information about the compiled pattern is given after compilation.
- -M
- Behave as if each data line contains the \M escape sequence; this causes PCRE to discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings by calling pcre[16|32]_exec() repeatedly with different limits.
- -m
- Output the size of each compiled pattern after it has been compiled. This is equivalent to adding /M to each regular expression. The size is given in bytes for both libraries.
- -o osize
- Set the number of elements in the output vector that is used when calling pcre[16|32]_exec() or pcre[16|32]_dfa_exec() to be osize. The default value is 45, which is enough for 14 capturing subexpressions for pcre[16|32]_exec() or 22 different matches for pcre[16|32]_dfa_exec(). The vector size can be changed for individual matching calls by including \O in the data line (see below).
- -p
- Behave as if each pattern has the /P modifier; the POSIX wrapper API is used to call PCRE. None of the other options has any effect when -p is set. This option can be used only with the 8-bit library.
- -q
- Do not output the version number of pcretest at the start of execution.
- -S size
- On Unix-like systems, set the size of the run-time stack to size megabytes.
- -s or -s+
- Behave as if each pattern has the /S modifier; in other words, force each pattern to be studied. If -s+ is used, all the JIT compile options are passed to pcre[16|32]_study(), causing just-in-time optimization to be set up if it is available, for both full and partial matching. Specific JIT compile options can be selected by following -s+ with a digit in the range 1 to 7, which selects the JIT compile modes as follows:
1 normal match only
2 soft partial match only
3 normal match and soft partial match
4 hard partial match only
6 soft and hard partial match
7 all three modes (default) If -s++ is used instead of -s+ (with or without a following digit), the text "(JIT)" is added to the first output line after a match or no match when JIT-compiled code was actually used.
Note that there are pattern options that can override -s, either specifying no studying at all, or suppressing JIT compilation.
If the /I or /D option is present on a pattern (requesting output about the compiled pattern), information about the result of studying is not included when studying is caused only by -s and neither -i nor -d is present on the command line. This behaviour means that the output from tests that are run with and without -s should be identical, except when options that output information about the actual running of a match are set.
The -M, -t, and -tm options, which give information about resources used, are likely to produce different output with and without -s. Output may also differ if the /C option is present on an individual pattern. This uses callouts to trace the the matching process, and this may be different between studied and non-studied patterns. If the pattern contains (*MARK) items there may also be differences, for the same reason. The -s command line option can be overridden for specific patterns that should never be studied (see the /S pattern modifier below).
- -t
- Run each compile, study, and match many times with a timer, and output resulting time per compile or match (in milliseconds). Do not set -m with -t, because you will then get the size output a zillion times, and the timing will be distorted. You can control the number of iterations that are used for timing by following -t with a number (as a separate item on the command line). For example, "-t 1000" would iterate 1000 times. The default is to iterate 500000 times.
- -tm
- This is like -t except that it times only the matching phase, not the compile or study phases.
DESCRIPTION
If pcretest is given two filename arguments, it reads from the first and writes to the second. If it is given only one filename argument, it reads from that file and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and prompts for each line of input, using "re>" to prompt for regular expressions, and "data>" to prompt for data lines.
When pcretest is built, a configuration option can specify that it should be linked with the libreadline library. When this is done, if the input is from a terminal, it is read using the readline() function. This provides line-editing and history facilities. The output from the -help option states whether or not readline() will be used.
The program handles any number of sets of input on a single input file. Each set starts with a regular expression, and continues with any number of data lines to be matched against the pattern.
Each data line is matched separately and independently. If you want to do multi-line matches, you have to use the \n escape sequence (or \r or \r\n, etc., depending on the newline setting) in a single line of input to encode the newline sequences. There is no limit on the length of data lines; the input buffer is automatically extended if it is too small.
An empty line signals the end of the data lines, at which point a new regular expression is read. The regular expressions are given enclosed in any non-alphanumeric delimiters other than backslash, for example:
White space before the initial delimiter is ignored. A regular expression may
be continued over several input lines, in which case the newline characters are
included within it. It is possible to include the delimiter within the pattern
by escaping it, for example
If you do so, the escape and the delimiter form part of the pattern, but since
delimiters are always non-alphanumeric, this does not affect its interpretation.
If the terminating delimiter is immediately followed by a backslash, for
example,
then a backslash is added to the end of the pattern. This is done to provide a
way of testing the error condition that arises if a pattern finishes with a
backslash, because
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
A pattern may be followed by any number of modifiers, which are mostly single
characters, though some of these can be qualified by further characters.
Following Perl usage, these are referred to below as, for example, "the
/i modifier", even though the delimiter of the pattern need not always be
a slash, and no slash is used when writing modifiers. White space may appear
between the final pattern delimiter and the first modifier, and between the
modifiers themselves. For reference, here is a complete list of modifiers. They
fall into several groups that are described in detail in the following
sections.
PATTERN MODIFIERS