pcreapi (3) - Linux Manuals
pcreapi: Perl-compatible regular expressions
NAME
PCRE - Perl-compatible regular expressions
#include <
pcre.h>PCRE NATIVE API BASIC FUNCTIONS
pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr); pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr); pcre_extra *pcre_study(const pcre *code, int options, const char **errptr); void pcre_free_study(pcre_extra *extra); int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize); int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, int *workspace, int wscount);
PCRE NATIVE API STRING EXTRACTION FUNCTIONS
int pcre_copy_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, char *buffer, int buffersize); int pcre_copy_substring(const char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, int buffersize); int pcre_get_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, const char **stringptr); int pcre_get_stringnumber(const pcre *code, const char *name); int pcre_get_stringtable_entries(const pcre *code, const char *name, char **first, char **last); int pcre_get_substring(const char *subject, int *ovector, int stringcount, int stringnumber, const char **stringptr); int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, const char ***listptr); void pcre_free_substring(const char *stringptr); void pcre_free_substring_list(const char **stringptr);
PCRE NATIVE API AUXILIARY FUNCTIONS
int pcre_jit_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, pcre_jit_stack *jstack); pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize); void pcre_jit_stack_free(pcre_jit_stack *stack); void pcre_assign_jit_stack(pcre_extra *extra, pcre_jit_callback callback, void *data); const unsigned char *pcre_maketables(void); int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where); int pcre_refcount(pcre *code, int adjust); int pcre_config(int what, void *where); const char *pcre_version(void); int pcre_pattern_to_host_byte_order(pcre *code, pcre_extra *extra, const unsigned char *tables);
PCRE NATIVE API INDIRECTED FUNCTIONS
void *(*pcre_malloc)(size_t); void (*pcre_free)(void *); void *(*pcre_stack_malloc)(size_t); void (*pcre_stack_free)(void *); int (*pcre_callout)(pcre_callout_block *); int (*pcre_stack_guard)(void);
PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. To avoid too much complication, this document describes the 8-bit versions of the functions, with only occasional references to the 16-bit and 32-bit libraries.
The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with pcre16_ or pcre32_ instead of pcre_. For every option that has UTF8 in its name (for example, PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values.
References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There are also some wrapper functions (for the 8-bit library only) that correspond to the POSIX regular expression API, but they do not give access to all the functionality. They are described in the pcreposix documentation. Both of these APIs define a set of C function calls. A C++ wrapper (again for the 8-bit library only) is also distributed with PCRE. It is documented in the pcrecpp page.
The native API C function prototypes are defined in the header file pcre.h, and on Unix-like systems the (8-bit) library itself is called libpcre. It can normally be accessed by adding -lpcre to the command for linking an application that uses PCRE. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release numbers for the library. Applications can use these to include support for different releases of PCRE.
In a Windows environment, if you want to statically link an application program against a non-dll pcre.a file, you must define PCRE_STATIC before including pcre.h or pcrecpp.h, because otherwise the pcre_malloc() and pcre_free() exported functions will be declared __declspec(dllimport), with unwanted results.
The functions pcre_compile(), pcre_compile2(), pcre_study(), and pcre_exec() are used for compiling and matching regular expressions in a Perl-compatible manner. A sample program that demonstrates the simplest way of using them is provided in the file called pcredemo.c in the PCRE source distribution. A listing of this program is given in the pcredemo documentation, and the pcresample documentation describes how to compile and run it.
Just-in-time compiler support is an optional feature of PCRE that can be built in appropriate hardware environments. It greatly speeds up the matching performance of many patterns. Simple programs can easily request that it be used if available, by setting an option that is ignored when it is not relevant. More complicated programs might need to make use of the functions pcre_jit_stack_alloc(), pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control the JIT code's memory usage.
From release 8.32 there is also a direct interface for JIT execution, which gives improved performance. The JIT-specific functions are discussed in the pcrejit documentation.
A second matching function, pcre_dfa_exec(), which is not Perl-compatible, is also provided. This uses a different algorithm for the matching. The alternative algorithm finds all possible matches (at a given point in the subject), and scans the subject just once (unless there are lookbehind assertions). However, this algorithm does not return captured substrings. A description of the two matching algorithms and their advantages and disadvantages is given in the pcrematching documentation.
In addition to the main compiling and matching functions, there are convenience functions for extracting captured substrings from a subject string that is matched by pcre_exec(). They are:
pcre_free_substring() and pcre_free_substring_list() are also
provided, to free the memory used for extracted strings.
The function pcre_maketables() is used to build a set of character tables
in the current locale for passing to pcre_compile(), pcre_exec(),
or pcre_dfa_exec(). This is an optional facility that is provided for
specialist use. Most commonly, no special tables are passed, in which case
internal tables that are generated when PCRE is built are used.
The function pcre_fullinfo() is used to find out information about a
compiled pattern. The function pcre_version() returns a pointer to a
string containing the version of PCRE and its date of release.
The function pcre_refcount() maintains a reference count in a data block
containing a compiled pattern. This is provided for the benefit of
object-oriented applications.
The global variables pcre_malloc and pcre_free initially contain
the entry points of the standard malloc() and free() functions,
respectively. PCRE calls the memory management functions via these variables,
so a calling program can replace them if it wishes to intercept the calls. This
should be done before calling any PCRE functions.
The global variables pcre_stack_malloc and pcre_stack_free are also
indirections to memory management functions. These special functions are used
only when PCRE is compiled to use the heap for remembering data, instead of
recursive function calls, when running the pcre_exec() function. See the
pcrebuild
documentation for details of how to do this. It is a non-standard way of
building PCRE, for use in environments that have limited stacks. Because of the
greater use of memory management, it runs more slowly. Separate functions are
provided so that special-purpose external code can be used for this case. When
used, these functions always allocate memory blocks of the same size. There is
a discussion about PCRE's stack usage in the
pcrestack
documentation.
The global variable pcre_callout initially contains NULL. It can be set
by the caller to a "callout" function, which PCRE will then call at specified
points during a matching operation. Details are given in the
pcrecallout
documentation.
The global variable pcre_stack_guard initially contains NULL. It can be
set by the caller to a function that is called by PCRE whenever it starts
to compile a parenthesized part of a pattern. When parentheses are nested, PCRE
uses recursive function calls, which use up the system stack. This function is
provided so that applications with restricted stacks can force a compilation
error if the stack runs out. The function should return zero if all is well, or
non-zero to force an error.
PCRE supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (linefeed)
character, the two-character sequence CRLF, any of the three preceding, or any
Unicode newline sequence. The Unicode newline sequences are the three just
mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
(paragraph separator, U+2029).
Each of the first three conventions is used by at least one operating system as
its standard newline sequence. When PCRE is built, a default can be specified.
The default default is LF, which is the Unix standard. When PCRE is run, the
default can be overridden, either when a pattern is compiled, or when it is
matched.
At compile time, the newline convention can be specified by the options
argument of pcre_compile(), or it can be specified by special text at the
start of the pattern itself; this overrides any other settings. See the
pcrepattern
page for details of the special character sequences.
In the PCRE documentation the word "newline" is used to mean "the character or
pair of characters that indicate a line break". The choice of newline
convention affects the handling of the dot, circumflex, and dollar
metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
recognized line ending sequence, the match position advancement for a
non-anchored pattern. There is more detail about this in the
section on pcre_exec() options
below.
The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches, which is
controlled in a similar way, but by separate options.
The PCRE functions can be used in multi-threading applications, with the
proviso that the memory management functions pointed to by pcre_malloc,
pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout and stack-checking functions pointed to by pcre_callout and
pcre_stack_guard, are shared by all threads.
The compiled form of a regular expression is not altered during matching, so
the same compiled pattern can safely be used by several threads at once.
If the just-in-time optimization feature is being used, it needs separate
memory stack areas for each thread. See the
pcrejit
documentation for more details.
The compiled form of a regular expression can be saved and re-used at a later
time, possibly by a different program, and even on a host other than the one on
which it was compiled. Details are given in the
pcreprecompile
documentation, which includes a description of the
pcre_pattern_to_host_byte_order() function. However, compiling a regular
expression with one version of PCRE for use with a different version is not
guaranteed to work and may cause crashes.
int pcre_config(int what, void *where);
The function pcre_config() makes it possible for a PCRE client to
discover which optional features have been compiled into the PCRE library. The
pcrebuild
documentation has more details about these optional features.
The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable into
which the information is placed. The returned value is zero on success, or the
negative error code PCRE_ERROR_BADOPTION if the value in the first argument is
not recognized. The following information is available:
The output is an integer that is set to one if UTF-8 support is available;
otherwise it is set to zero. This value should normally be given to the 8-bit
version of this function, pcre_config(). If it is given to the 16-bit
or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
The output is an integer that is set to one if UTF-16 support is available;
otherwise it is set to zero. This value should normally be given to the 16-bit
version of this function, pcre16_config(). If it is given to the 8-bit
or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
The output is an integer that is set to one if UTF-32 support is available;
otherwise it is set to zero. This value should normally be given to the 32-bit
version of this function, pcre32_config(). If it is given to the 8-bit
or 16-bit version of this function, the result is PCRE_ERROR_BADOPTION.
The output is an integer that is set to one if support for Unicode character
properties is available; otherwise it is set to zero.
The output is an integer that is set to one if support for just-in-time
compiling is available; otherwise it is set to zero.
The output is a pointer to a zero-terminated "const char *" string. If JIT
support is available, the string contains the name of the architecture for
which the JIT compiler is configured, for example "x86 32bit (little endian +
unaligned)". If JIT support is not available, the result is NULL.
The output is an integer whose value specifies the default character sequence
that is recognized as meaning "newline". The values that are supported in
ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for
ANYCRLF, and -1 for ANY. In EBCDIC environments, CR, ANYCRLF, and ANY yield the
same values. However, the value for LF is normally 21, though some EBCDIC
environments use 37. The corresponding values for CRLF are 3349 and 3365. The
default should normally correspond to the standard sequence for your operating
system.
The output is an integer whose value indicates what character sequences the \R
escape sequence matches by default. A value of 0 means that \R matches any
Unicode line ending sequence; a value of 1 means that \R matches only CR, LF,
or CRLF. The default can be overridden when a pattern is compiled or matched.
The output is an integer that contains the number of bytes used for internal
linkage in compiled regular expressions. For the 8-bit library, the value can
be 2, 3, or 4. For the 16-bit library, the value is either 2 or 4 and is still
a number of bytes. For the 32-bit library, the value is either 2 or 4 and is
still a number of bytes. The default value of 2 is sufficient for all but the
most massive patterns, since it allows the compiled pattern to be up to 64K in
size. Larger values allow larger regular expressions to be compiled, at the
expense of slower matching.
The output is an integer that contains the threshold above which the POSIX
interface uses malloc() for output vectors. Further details are given in
the
pcreposix
documentation.
The output is a long integer that gives the maximum depth of nesting of
parentheses (of any kind) in a pattern. This limit is imposed to cap the amount
of system stack used when a pattern is compiled. It is specified when PCRE is
built; the default is 250. This limit does not take into account the stack that
may already be used by the calling application. For finer control over
compilation stack usage, you can set a pointer to an external checking function
in pcre_stack_guard.
The output is a long integer that gives the default limit for the number of
internal matching function calls in a pcre_exec() execution. Further
details are given with pcre_exec() below.
The output is a long integer that gives the default limit for the depth of
recursion when calling the internal matching function in a pcre_exec()
execution. Further details are given with pcre_exec() below.
The output is an integer that is set to one if internal recursion when running
pcre_exec() is implemented by recursive function calls that use the stack
to remember their state. This is the usual way that PCRE is compiled. The
output is zero if PCRE was compiled to use blocks of data on the heap instead
of recursive function calls. In this case, pcre_stack_malloc and
pcre_stack_free are called to manage memory blocks on the heap, thus
avoiding the use of the stack.
Either of the functions pcre_compile() or pcre_compile2() can be
called to compile a pattern into an internal form. The only difference between
the two interfaces is that pcre_compile2() has an additional argument,
errorcodeptr, via which a numerical error code can be returned. To avoid
too much repetition, we refer just to pcre_compile() below, but the
information applies equally to pcre_compile2().
The pattern is a C string terminated by a binary zero, and is passed in the
pattern argument. A pointer to a single block of memory that is obtained
via pcre_malloc is returned. This contains the compiled code and related
data. The pcre type is defined for the returned block; this is a typedef
for a structure whose contents are not externally defined. It is up to the
caller to free the memory (via pcre_free) when it is no longer required.
Although the compiled code of a PCRE regex is relocatable, that is, it does not
depend on memory location, the complete pcre data block is not
fully relocatable, because it may contain a copy of the tableptr
argument, which is an address (see below).
The options argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that are
compatible with Perl, but some others as well) can also be set and unset from
within the pattern (see the detailed description in the
pcrepattern
documentation). For those options that can be different in different parts of
the pattern, the contents of the options argument specifies their
settings at the start of compilation and execution. The PCRE_ANCHORED,
PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
PCRE_NO_START_OPTIMIZE options can be set at the time of matching as well as at
compile time.
If errptr is NULL, pcre_compile() returns NULL immediately.
Otherwise, if compilation of a pattern fails, pcre_compile() returns
NULL, and sets the variable pointed to by errptr to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. Normally, the offset from the start of the pattern to the
data unit that was being processed when the error was discovered is placed in
the variable pointed to by erroffset, which must not be NULL (if it is,
an immediate error is given). However, for an invalid UTF-8 or UTF-16 string,
the offset is that of the first data unit of the failing character.
Some errors are not detected until the whole pattern has been scanned; in these
cases, the offset passed back is the length of the pattern. Note that the
offset is in data units, not characters, even in a UTF mode. It may sometimes
point into the middle of a UTF-8 or UTF-16 character.
If pcre_compile2() is used instead of pcre_compile(), and the
errorcodeptr argument is not NULL, a non-zero error code number is
returned via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
If the final argument, tableptr, is NULL, PCRE uses a default set of
character tables that are built when PCRE is compiled, using the default C
locale. Otherwise, tableptr must be an address that is the result of a
call to pcre_maketables(). This value is stored with the compiled
pattern, and used again by pcre_exec() and pcre_dfa_exec() when the
pattern is matched. For more discussion, see the section on locale support
below.
This code fragment shows a typical straightforward call to pcre_compile():
The following names for option bits are defined in the pcre.h header
file:
If this bit is set, the pattern is forced to be "anchored", that is, it is
constrained to match only at the first matching point in the string that is
being searched (the "subject string"). This effect can also be achieved by
appropriate constructs in the pattern itself, which is the only way to do it in
Perl.
If this bit is set, pcre_compile() automatically inserts callout items,
all with number 255, before each pattern item. For discussion of the callout
facility, see the
pcrecallout
documentation.
These options (which are mutually exclusive) control what the \R escape
sequence matches. The choice is either to match only CR, LF, or CRLF, or to
match any Unicode newline sequence. The default is specified when PCRE is
built. It can be overridden from within the pattern, or by setting an option
when a compiled pattern is matched.
If this bit is set, letters in the pattern match both upper and lower case
letters. It is equivalent to Perl's /i option, and it can be changed within a
pattern by a (?i) option setting. In UTF-8 mode, PCRE always understands the
concept of case for characters whose values are less than 128, so caseless
matching is always possible. For characters with higher values, the concept of
case is supported if PCRE is compiled with Unicode property support, but not
otherwise. If you want to use caseless matching for characters 128 and above,
you must ensure that PCRE is compiled with Unicode property support as well as
with UTF-8 support.
If this bit is set, a dollar metacharacter in the pattern matches only at the
end of the subject string. Without this option, a dollar also matches
immediately before a newline at the end of the string (but not before any other
newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
There is no equivalent to this option in Perl, and no way to set it within a
pattern.
If this bit is set, a dot metacharacter in the pattern matches a character of
any value, including one that indicates a newline. However, it only ever
matches one character, even if newlines are coded as CRLF. Without this option,
a dot does not match when the current position is at a newline. This option is
equivalent to Perl's /s option, and it can be changed within a pattern by a
(?s) option setting. A negative class such as [^a] always matches newline
characters, independent of the setting of this option.
If this bit is set, names used to identify capturing subpatterns need not be
unique. This can be helpful for certain types of pattern when it is known that
only one instance of the named subpattern can ever be matched. There are more
details of named subpatterns below; see also the
pcrepattern
documentation.
If this bit is set, most white space characters in the pattern are totally
ignored except when escaped or inside a character class. However, white space
is not allowed within sequences such as (?> that introduce various
parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
However, ignorable white space is permitted between an item and a following
quantifier and between a quantifier and a following + that indicates
possessiveness.
White space did not used to include the VT character (code 11), because Perl
did not treat this character as white space. However, Perl changed at release
5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
PCRE_EXTENDED also causes characters between an unescaped # outside a character
class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
equivalent to Perl's /x option, and it can be changed within a pattern by a
(?x) option setting.
Which characters are interpreted as newlines is controlled by the options
passed to pcre_compile() or by a special sequence at the start of the
pattern, as described in the section entitled
"Newline conventions"
in the pcrepattern documentation. Note that the end of this type of
comment is a literal newline sequence in the pattern; escape sequences that
happen to represent a newline do not count.
This option makes it possible to include comments inside complicated patterns.
Note, however, that this applies only to data characters. White space characters
may never appear within special character sequences in a pattern, for example
within the sequence (?( that introduces a conditional subpattern.
This option was invented in order to turn on additional functionality of PCRE
that is incompatible with Perl, but it is currently of very little use. When
set, any backslash in a pattern that is followed by a letter that has no
special meaning causes an error, thus reserving these combinations for future
expansion. By default, as in Perl, a backslash followed by a letter with no
special meaning is treated as a literal. (Perl can, however, be persuaded to
give an error for this, by running it with the -w option.) There are at present
no other features controlled by this option. It can also be set by a (?X)
option setting within a pattern.
If this option is set, an unanchored pattern is required to match before or at
the first newline in the subject string, though the matched text may continue
over the newline.
If this option is set, PCRE's behaviour is changed in some ways so that it is
compatible with JavaScript rather than Perl. The changes are as follows:
(1) A lone closing square bracket in a pattern causes a compile-time error,
because this is illegal in JavaScript (by default it is treated as a data
character). Thus, the pattern AB]CD becomes illegal when this option is set.
(2) At run time, a back reference to an unset subpattern group matches an empty
string (by default this causes the current matching alternative to fail). A
pattern such as (\1)(a) succeeds when this option is set (assuming it can find
an "a" in the subject), whereas it fails by default, for Perl compatibility.
(3) \U matches an upper case "U" character; by default \U causes a compile
time error (Perl uses \U to upper case subsequent characters).
(4) \u matches a lower case "u" character unless it is followed by four
hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, \u causes a compile time error (Perl uses it to upper
case the following character).
(5) \x matches a lower case "x" character unless it is followed by two
hexadecimal digits, in which case the hexadecimal number defines the code point
to match. By default, as in Perl, a hexadecimal number is always expected after
\x, but it may have zero, one, or two digits (so, for example, \xz matches a
binary zero character followed by z).
By default, for the purposes of matching "start of line" and "end of line",
PCRE treats the subject string as consisting of a single line of characters,
even if it actually contains newlines. The "start of line" metacharacter (^)
matches only at the start of the string, and the "end of line" metacharacter
($) matches only at the end of the string, or before a terminating newline
(except when PCRE_DOLLAR_ENDONLY is set). Note, however, that unless
PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a
newline. This behaviour (for ^, $, and dot) is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
match immediately following or immediately before internal newlines in the
subject string, respectively, as well as at the very start and end. This is
equivalent to Perl's /m option, and it can be changed within a pattern by a
(?m) option setting. If there are no newlines in a subject string, or no
occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.
This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
creator of the pattern from switching to UTF interpretation by starting the
pattern with (*UTF). This may be useful in applications that process patterns
from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
causes an error.
These options override the default newline definition that was chosen when PCRE
was built. Setting the first or the second specifies that a newline is
indicated by a single character (CR or LF, respectively). Setting
PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character
CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three
preceding sequences should be recognized. Setting PCRE_NEWLINE_ANY specifies
that any Unicode newline sequence should be recognized.
In an ASCII/Unicode environment, the Unicode newline sequences are the three
just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form
feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
(paragraph separator, U+2029). For the 8-bit library, the last two are
recognized only in UTF-8 mode.
When PCRE is compiled to run in an EBCDIC (mainframe) environment, the code for
CR is 0x0d, the same as ASCII. However, the character code for LF is normally
0x15, though in some EBCDIC environments 0x25 is used. Whichever of these is
not LF is made to correspond to Unicode's NEL character. EBCDIC codes are all
less than 256. For more details, see the
pcrebuild
documentation.
The newline setting in the options word uses three bits that are treated
as a number, giving eight possibilities. Currently only six are used (default
plus the five values above). This means that if you set more than one newline
option, the combination may or may not be sensible. For example,
PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but
other combinations may yield unused numbers and cause an error.
The only time that a line break in a pattern is specially recognized when
compiling is when PCRE_EXTENDED is set. CR and LF are white space characters,
and so are ignored in this mode. Also, an unescaped # outside a character class
indicates a comment that lasts until after the next line break sequence. In
other circumstances, line break sequences in patterns are treated as literal
data.
The newline option that is set at compile time becomes the default that is used
for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
If this option is set, it disables the use of numbered capturing parentheses in
the pattern. Any opening parenthesis that is not followed by ? behaves as if it
were followed by ?: but named parentheses can still be used for capturing (and
they acquire numbers in the usual way). There is no equivalent of this option
in Perl.
If this option is set, it disables "auto-possessification". This is an
optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some of them are never taken. You can set
this option if you want the matching functions to do a full unoptimized search
and run all the callouts, but it is mainly provided for testing purposes.
This is an option that acts at matching time; that is, it is really an option
for pcre_exec() or pcre_dfa_exec(). If it is set at compile time,
it is remembered with the compiled pattern and assumed at matching time. This
is necessary if you want to use JIT execution, because the JIT compiler needs
to know whether or not this option is set. For details see the discussion of
PCRE_NO_START_OPTIMIZE
below.
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII characters
are recognized, but if PCRE_UCP is set, Unicode properties are used instead to
classify characters. More details are given in the section on
generic character types
in the
pcrepattern
page. If you set PCRE_UCP, matching one of the items it affects takes much
longer. The option is available only if PCRE has been compiled with Unicode
property support.
This option inverts the "greediness" of the quantifiers so that they are not
greedy by default, but become greedy if followed by "?". It is not compatible
with Perl. It can also be set by a (?U) option setting within the pattern.
This option causes PCRE to regard both the pattern and the subject as strings
of UTF-8 characters instead of single-byte strings. However, it is available
only when PCRE is built to include UTF support. If not, the use of this option
provokes an error. Details of how this option changes the behaviour of PCRE are
given in the
pcreunicode
page.
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
automatically checked. There is a discussion about the
validity of UTF-8 strings
in the
pcreunicode
page. If an invalid UTF-8 sequence is found, pcre_compile() returns an
error. If you already know that your pattern is valid, and you want to skip
this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK option.
When it is set, the effect of passing an invalid UTF-8 string as a pattern is
undefined. It may cause your program to crash or loop. Note that this option
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress
the validity checking of subject strings only. If the same string is being
matched many times, the option can be safely set for the second and subsequent
matchings to improve performance.
The following table lists the error codes than may be returned by
pcre_compile2(), along with the error messages that may be returned by
both compiling functions. Note that error messages are always 8-bit ASCII
strings, even in 16-bit or 32-bit mode. As PCRE has developed, some error codes
have fallen out of use. To avoid confusion, they have not been re-used.
NEWLINES
MULTITHREADING
SAVING PRECOMPILED PATTERNS FOR LATER USE
CHECKING BUILD-TIME OPTIONS
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre *pcre_compile2(const char *pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
COMPILATION ERROR CODES