class RegularExpressions

Regular expressions.

Regular expressions are used to describe character patterns in strings. So you can look for all paragraphs in a tagged text no matter that style the paragraph uses:

<ParaStyle:Style1>...
<ParaStyle:Stil1>...
<ParaStyle:>
<ParaStyle:2\>3>

Search <ParaStyle:(<0x[0-9A-F]{4,4}>|\\\\>|\\\\<|[^<>])*>

You can use regular expression in the following situations:

strreplace
strstr
strstrpos
Text layout condition Contains
Text layout rule Search and replace

There are many dialects of regular expressions. Using the priint:comet InDesign^® Plug-Ins you can use two of them:

GNU conform regular expressions
PCRE, (Perl compatible regular expressions). since v3.3.1 R4000 and at least CS5

Since v3.4 R9000 and CS5 regulare expresssions are parsed by PCRE only. Only CS4 is using the old implementation for GNU compatible REs.

GNU-conform regular expressions supporting the base functionality of regular expressions ((Characters and substrings, Counter, Word boundaries und Sub expressions).

PCRE supports the full functionalty expected from modern strings matchers. You may use the following feature for example:

Lookarounds
Conditions
Unicode characters ([äöü]{0,7} or \x{00E6}) The \u00E6 syntax is not supported by PCRE.
Unicode Character Properties (\p{Ll} or \p{Thai})
Modifiers (z.B. (?i))
Comments

Pleace take care to escape \ in strings by trailing \.

If you not familar with regular expressions - you can find a lot of descriptions and examples in the net. If you want learn something about the theorie of regular expression, search for Formlar languages, regular expressions are a subfamily of the formal languages.

An excellent description of PCRE you can find here..
Extremely helpful is the Online regex tester..

To give strreplace and strstrpos a hint, that the search string is a regular expression, use the prefixes

regexp: or pcre:

Since v3.4 R9000 and CS5 the prefix regexp: points to PCRE too.

Characters and substrings in regular expressions are expressed by itself.

Search all 1234's in a given string

456 1234 6748 441234567 64641329 4321 4321 999

Search 1234

Of course, a simple search can solve this problem too. Things getting harder if you allow any order of the digits 1, 2, 3, 4. (You may search for the 24 combinations ...)

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search [1234][1234][1234][1234]

Every [] expression matches exactly one of the characters inside the brackets. The brackets may contain any number of Ascii characters or Ascii character ranges. So [a-zA-Z_] will mach any letter and the Underscore. Use [^...] if the character shall not match. [^ \t\r\n] will match any character except the blank, the tab and the line delimiters.

Regular expression may find any UTF8 character. But the expression itself must not contain Ascii characters greater 127 like ä, ö, ü.

If you want look for letters used to describe a regular expressions, use the backslash to escape it.
And take care to escpae this backslash in cScript string itself!

Using character ranges in the above expression, will make thing not really better:

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search [1-4][1-4][1-4][1-4]

Even counters will lead to better expressions :

	Description	Example
?	0 or one time	ab?a (aa, aba)
+	one or more times	ab+a (aba, abba, abbba, ...) a(b\|c)+d (abd, acd, abbcd, ...)
*	any times	*aba (aa, aba, abba, abbba, ...) a(b\|c)d* (ad, abd, acd, abcbcbcd, ...)
{n}	exactly n times	ab{3}c (abbbc)
{n, m}	n to m times	ab{2, 3}a (abba, abbba)

Using counters the above expression looks well.

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search [1-4]{4}

With \b you can describe word boundaries. If you want find word onl in the above expression, simple add \b at the beginning and the end of the expression.

Look for all words of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

\b[1-4]{4}\b

Any part of a regular expression can set in parenthesis. You can refer to this sub parts of the expression using the following descriptors:

	Domain	Description
\0	replace string	Current value of the complete expression
\1, \2, ..., \9	regular expression and replace string	Current value of the first, second, ... sub expression Using \1, ... \9 in the regulare expression you can search for numbers beginning and ending with the same digit : (\b([0-9])[0-9]*\1\b).
\pC[0-9]	replace string	Repeat the letter C as often as the current value is long. Ascii of C must be in the range [1-127]
\u[0-9]	replace string	Upper cased current value
\l[0-9]	replace string	Lower cased current value
\r[0-9]	replace string	Reversed current value

Find all 3-digit substrings enclosed by letters and swap this letters:

000v5664w358x00l345m50v523w1f789g6040h928i01
000v5664x358w00m345l50w523v1g789f6040i928h01

Search ([a-z])([1-9]{3})([a-z])
Replace \3\2\1

Find a words containing a r and replace it by the same word but in upper cases:

Search \b([^[:space:][:punct:]]*r[^[:space:][:punct:]]*)\b
Replace \u0

Replace all paragraph tags of a tagged text by a HTML comment including the upper cased style name.

Serach <ParaStyle:((([^\>])|(\\>))*)>
Replace <--\u1-->

Look for all tags of a tagged text :

Search (<[a-zA-Z][a-zA-Z0-9_]*:(<)?)((([^><])|(\\>)|(\\<))*)(>)*

PCRE handles the search for regular expressions in three steps:

Compile (pcre_compile) : This step compiles a regular expression into an internal form.
Study (pcre_study) : This step studies a compiled pattern, to see if additional information can be extracted that might speed up matching.
Execute (pcre_exec) : This step matches a compiled regular expression against a given subject string, using a matching algorithm that is similar to Perl's.

Each of the three steps, that automatically executed for every search for regular expressions using PCRE, can receive additional options. In the following you can see the available options. For more information about the options, please search the web for the option names (e.g., PCRE_BSR_ANYCRLF).

The option names are not defined in cScript, please use the corresponding numbers. To use several options, you can add the numbers by | (logical or). The options are used exclusively for the processing regular expressions with PCRE.

This step compiles a regular expression into an internal form. The option PCRE_UTF8 is activated always by default. Using Windows, also the option PCRE_UCP is activated..

Optionname Value Description

PCRE_ANCHORED 0x00000010 Force pattern anchoring

PCRE_AUTO_CALLOUT 0x00004000 Compile automatic callouts

PCRE_BSR_ANYCRLF 0x00800000 \R matches only CR, LF, or CRLF

PCRE_BSR_UNICODE 0x01000000 \R matches all Unicode line endings

PCRE_CASELESS 0x00000001 Do caseless matching

PCRE_DOLLAR_ENDONLY 0x00000020 $ not to match newline at end

PCRE_DOTALL 0x00000004 . matches anything including NL

PCRE_DUPNAMES 0x00080000 Allow duplicate names for subpatterns

PCRE_EXTENDED 0x00000008 Ignore white space and # comments

PCRE_EXTRA 0x00000040 PCRE extra features (not much use currently)

PCRE_FIRSTLINE 0x00040000 Force matching to be before newline

PCRE_JAVASCRIPT_COMPAT 0x02000000 JavaScript compatibility

PCRE_MULTILINE 0x00000002 ^ and $ match newlines within data

PCRE_NEVER_UTF 0x00010000 Lock out UTF, e.g. via (*UTF)

PCRE_NEWLINE_ANY 0x00400000 Recognize any Unicode newline sequence

PCRE_NEWLINE_ANYCRLF 0x00500000 Recognize CR, LF, and CRLF as newline sequences

PCRE_NEWLINE_CR 0x00100000 Set CR as the newline sequence

PCRE_NEWLINE_CRLF 0x00300000 Set CRLF as the newline sequence

PCRE_NEWLINE_LF 0x00200000 Set LF as the newline sequence

PCRE_NO_AUTO_CAPTURE 0x00001000 Disable numbered capturing parentheses (named ones available)

PCRE_NO_START_OPTIMIZE 0x04000000 Disable match-time start optimizations

PCRE_NO_UTF8_CHECK 0x00002000 Do not check the pattern for UTF-8 validity

PCRE_UCP 0x20000000 Use Unicode properties for \d, \w, etc.

PCRE_UNGREEDY 0x00000200 Invert greediness of quantifiers

PCRE_UTF8 0x00000800 Run in pcre_compile() UTF-8 mode

This step studies a compiled pattern, to see if additional information can be extracted that might speed up matching.

Optionname Value Description

PCRE_STUDY_JIT_COMPILE 0x0001 Requests just-in-time compilation if possible.

This step matches a compiled regular expression against a given subject string, using a matching algorithm that is similar to Perl's.

Optionname Value Description

PCRE_ANCHORED 0x00000010 Match only at the first position
PCRE_BSR_ANYCRLF 0x00800000 \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE 0x01000000 \R matches all Unicode line endings
PCRE_NEWLINE_ANY 0x00400000 Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF 0x00500000 Recognize CR, LF, & CRLF as newline sequences
PCRE_NEWLINE_CR 0x00100000 Recognize CR as the only newline sequence
PCRE_NEWLINE_CRLF 0x00300000 Recognize CRLF as the only newline sequence
PCRE_NEWLINE_LF 0x00200000 Recognize LF as the only newline sequence
PCRE_NOTBOL 0x00000080 Subject string is not the beginning of a line
PCRE_NOTEOL 0x00000100 Subject string is not the end of a line
PCRE_NOTEMPTY 0x00000400 An empty string is not a valid match
PCRE_NOTEMPTY_ATSTART 0x10000000 An empty string at the start of the subject is not a valid match

PCRE_NO_START_OPTIMIZE 0x04000000 Do not do "start-match" optimizations

PCRE_NO_UTF8_CHECK 0x00002000 Do not check the subject for UTF-8 validity

PCRE_PARTIAL 0x00008000 ) Return PCRE_ERROR_PARTIAL for a partial

PCRE_PARTIAL_SOFT 0x00008000 ) match if no full matches are found

PCRE_PARTIAL_HARD 0x08000000 Return PCRE_ERROR_PARTIAL for a partial match if that is found before a full match

Since

Available
priint:comet InDesign^® Plug-Ins, comet_pdf

Alphabetic index HTML hierarchy of classes or Java

Optionname	Value	Description
PCRE_ANCHORED	0x00000010	Force pattern anchoring
PCRE_AUTO_CALLOUT	0x00004000	Compile automatic callouts
PCRE_BSR_ANYCRLF	0x00800000	\R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE	0x01000000	\R matches all Unicode line endings
PCRE_CASELESS	0x00000001	Do caseless matching
PCRE_DOLLAR_ENDONLY	0x00000020	$ not to match newline at end
PCRE_DOTALL	0x00000004	. matches anything including NL
PCRE_DUPNAMES	0x00080000	Allow duplicate names for subpatterns
PCRE_EXTENDED	0x00000008	Ignore white space and # comments
PCRE_EXTRA	0x00000040	PCRE extra features (not much use currently)
PCRE_FIRSTLINE	0x00040000	Force matching to be before newline
PCRE_JAVASCRIPT_COMPAT	0x02000000	JavaScript compatibility
PCRE_MULTILINE	0x00000002	^ and $ match newlines within data
PCRE_NEVER_UTF	0x00010000	Lock out UTF, e.g. via (*UTF)
PCRE_NEWLINE_ANY	0x00400000	Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF	0x00500000	Recognize CR, LF, and CRLF as newline sequences
PCRE_NEWLINE_CR	0x00100000	Set CR as the newline sequence
PCRE_NEWLINE_CRLF	0x00300000	Set CRLF as the newline sequence
PCRE_NEWLINE_LF	0x00200000	Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE	0x00001000	Disable numbered capturing parentheses (named ones available)
PCRE_NO_START_OPTIMIZE	0x04000000	Disable match-time start optimizations
PCRE_NO_UTF8_CHECK	0x00002000	Do not check the pattern for UTF-8 validity
PCRE_UCP	0x20000000	Use Unicode properties for \d, \w, etc.
PCRE_UNGREEDY	0x00000200	Invert greediness of quantifiers
PCRE_UTF8	0x00000800	Run in pcre_compile() UTF-8 mode