Regular expressions.

Regular expressions.

Regular expressions are used to describe character patterns in strings. So you can look for all paragraphs in a tagged text no matter that style the paragraph uses:

<ParaStyle:Style1>...
<ParaStyle:Stil1>...
<ParaStyle:>
<ParaStyle:2\>3>

Search    <ParaStyle:(<0x[0-9A-F]{4,4}>|\\\\>|\\\\<|[^<>])*>

You can use regular expression in the following situations:

There are many dialects of regular expressions. Using the priint:comet InDesign® Plug-Ins you can use two of them:

Since v3.4 R9000 and CS5 regulare expresssions are parsed by PCRE only. Only CS4 is using the old implementation for GNU compatible REs.

GNU-conform regular expressions supporting the base functionality of regular expressions ((Characters and substrings, Counter, Word boundaries und Sub expressions).

PCRE supports the full functionalty expected from modern strings matchers. You may use the following feature for example: Pleace take care to escape \ in strings by trailing \.

If you not familar with regular expressions - you can find a lot of descriptions and examples in the net. If you want learn something about the theorie of regular expression, search for Formlar languages, regular expressions are a subfamily of the formal languages.

An excellent description of PCRE you can find here..
Extremely helpful is the Online regex tester..

To give strreplace and strstrpos a hint, that the search string is a regular expression, use the prefixes

    regexp: or pcre:

Since v3.4 R9000 and CS5 the prefix regexp: points to PCRE too.

Characters and substrings in regular expressions are expressed by itself.

Search all 1234's in a given string

456 1234 6748 441234567 64641329 4321 4321 999

Search   1234

Of course, a simple search can solve this problem too. Things getting harder if you allow any order of the digits 1, 2, 3, 4. (You may search for the 24 combinations ...)

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search    [1234][1234][1234][1234]

Every [] expression matches exactly one of the characters inside the brackets. The brackets may contain any number of Ascii characters or Ascii character ranges. So [a-zA-Z_] will mach any letter and the Underscore. Use [^...] if the character shall not match. [^ \t\r\n] will match any character except the blank, the tab and the line delimiters.

Regular expression may find any UTF8 character. But the expression itself must not contain Ascii characters greater 127 like ä, ö, ü.

If you want look for letters used to describe a regular expressions, use the backslash to escape it.
And take care to escpae this backslash in cScript string itself!

Using character ranges in the above expression, will make thing not really better:

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search    [1-4][1-4][1-4][1-4]

Even counters will lead to better expressions :

Description Example
? 0 or one time ab?a (aa, aba)
+ one or more times ab+a (aba, abba, abbba, ...)

a(b|c)+d (abd, acd, abbcd, ...)

* any times ab*a (aa, aba, abba, abbba, ...)

a(b|c)*d (ad, abd, acd, abcbcbcd, ...)

{n} exactly n times ab{3}c (abbbc)
{n, m} n to m times ab{2, 3}a (abba, abbba)

Using counters the above expression looks well.

Look for all occurrances of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

Search    [1-4]{4}

With \b you can describe word boundaries. If you want find word onl in the above expression, simple add \b at the beginning and the end of the expression.

Look for all words of 1234 in any order of the digits:

456 1234 6748 441234567 64641329 4321 999

\b[1-4]{4}\b

Any part of a regular expression can set in parenthesis. You can refer to this sub parts of the expression using the following descriptors:

Domain Description
\0 replace string Current value of the complete expression
\1, \2, ..., \9 regular expression and replace string

Current value of the first, second, ... sub expression

Using \1, ... \9 in the regulare expression you can search for numbers beginning and ending with the same digit : (\b([0-9])[0-9]*\1\b).

\pC[0-9] replace string Repeat the letter C as often as the current value is long. Ascii of C must be in the range [1-127]
\u[0-9] replace string Upper cased current value
\l[0-9] replace string Lower cased current value
\r[0-9] replace string Reversed current value

Find all 3-digit substrings enclosed by letters and swap this letters:

000v5664w358x00l345m50v523w1f789g6040h928i01
000v5664x358w00m345l50w523v1g789f6040i928h01

Search    ([a-z])([1-9]{3})([a-z])
Replace   \3\2\1

Find a words containing a r and replace it by the same word but in upper cases:

Search   \b([^[:space:][:punct:]]*r[^[:space:][:punct:]]*)\b
Replace  \u0

Replace all paragraph tags of a tagged text by a HTML comment including the upper cased style name.

Serach   <ParaStyle:((([^\>])|(\\>))*)>
Replace  <--\u1-->

Look for all tags of a tagged text :

Search    (<[a-zA-Z][a-zA-Z0-9_]*:(<)?)((([^><])|(\\>)|(\\<))*)(>)*

PCRE handles the search for regular expressions in three steps:

  1. Compile (pcre_compile) : This step compiles a regular expression into an internal form.
  2. Study (pcre_study) : This step studies a compiled pattern, to see if additional information can be extracted that might speed up matching.
  3. Execute (pcre_exec) : This step matches a compiled regular expression against a given subject string, using a matching algorithm that is similar to Perl's.

Each of the three steps, that automatically executed for every search for regular expressions using PCRE, can receive additional options. In the following you can see the available options. For more information about the options, please search the web for the option names (e.g., PCRE_BSR_ANYCRLF).

The option names are not defined in cScript, please use the corresponding numbers. To use several options, you can add the numbers by | (logical or). The options are used exclusively for the processing regular expressions with PCRE.

This step compiles a regular expression into an internal form. The option PCRE_UTF8 is activated always by default. Using Windows, also the option PCRE_UCP is activated..
Optionname Value Description
PCRE_ANCHORED 0x00000010 Force pattern anchoring
PCRE_AUTO_CALLOUT 0x00004000 Compile automatic callouts
PCRE_BSR_ANYCRLF 0x00800000 \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE 0x01000000 \R matches all Unicode line endings
PCRE_CASELESS 0x00000001 Do caseless matching
PCRE_DOLLAR_ENDONLY 0x00000020 $ not to match newline at end
PCRE_DOTALL 0x00000004 . matches anything including NL
PCRE_DUPNAMES 0x00080000 Allow duplicate names for subpatterns
PCRE_EXTENDED 0x00000008 Ignore white space and # comments
PCRE_EXTRA 0x00000040 PCRE extra features (not much use currently)
PCRE_FIRSTLINE 0x00040000 Force matching to be before newline
PCRE_JAVASCRIPT_COMPAT 0x02000000 JavaScript compatibility
PCRE_MULTILINE 0x00000002 ^ and $ match newlines within data
PCRE_NEVER_UTF 0x00010000 Lock out UTF, e.g. via (*UTF)
PCRE_NEWLINE_ANY 0x00400000 Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF 0x00500000 Recognize CR, LF, and CRLF as newline sequences
PCRE_NEWLINE_CR 0x00100000 Set CR as the newline sequence
PCRE_NEWLINE_CRLF 0x00300000 Set CRLF as the newline sequence
PCRE_NEWLINE_LF 0x00200000 Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE 0x00001000 Disable numbered capturing parentheses (named ones available)
PCRE_NO_START_OPTIMIZE 0x04000000 Disable match-time start optimizations
PCRE_NO_UTF8_CHECK 0x00002000 Do not check the pattern for UTF-8 validity
PCRE_UCP 0x20000000 Use Unicode properties for \d, \w, etc.
PCRE_UNGREEDY 0x00000200 Invert greediness of quantifiers
PCRE_UTF8 0x00000800 Run in pcre_compile() UTF-8 mode

This step studies a compiled pattern, to see if additional information can be extracted that might speed up matching.
Optionname Value Description
PCRE_STUDY_JIT_COMPILE 0x0001 Requests just-in-time compilation if possible.

This step matches a compiled regular expression against a given subject string, using a matching algorithm that is similar to Perl's.
Optionname Value Description
PCRE_ANCHORED 0x00000010 Match only at the first position
PCRE_BSR_ANYCRLF 0x00800000 \R matches only CR, LF, or CRLF
PCRE_BSR_UNICODE 0x01000000 \R matches all Unicode line endings
PCRE_NEWLINE_ANY 0x00400000 Recognize any Unicode newline sequence
PCRE_NEWLINE_ANYCRLF 0x00500000 Recognize CR, LF, & CRLF as newline sequences
PCRE_NEWLINE_CR 0x00100000 Recognize CR as the only newline sequence
PCRE_NEWLINE_CRLF 0x00300000 Recognize CRLF as the only newline sequence
PCRE_NEWLINE_LF 0x00200000 Recognize LF as the only newline sequence
PCRE_NOTBOL 0x00000080 Subject string is not the beginning of a line
PCRE_NOTEOL 0x00000100 Subject string is not the end of a line
PCRE_NOTEMPTY 0x00000400 An empty string is not a valid match
PCRE_NOTEMPTY_ATSTART 0x10000000 An empty string at the start of the subject is not a valid match
PCRE_NO_START_OPTIMIZE 0x04000000 Do not do "start-match" optimizations
PCRE_NO_UTF8_CHECK 0x00002000 Do not check the subject for UTF-8 validity
PCRE_PARTIAL 0x00008000 ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT 0x00008000 ) match if no full matches are found
PCRE_PARTIAL_HARD 0x08000000 Return PCRE_ERROR_PARTIAL for a partial match if that is found before a full match


priint:comet InDesign® Plug-Ins, comet_pdf

Alphabetic index HTML hierarchy of classes or Java