Regular expressions are used to describe character patterns in strings. So you can look for all paragraphs in a tagged text no matter that style the paragraph uses:
<ParaStyle:Style1>...
<ParaStyle:Stil1>...
<ParaStyle:>
<ParaStyle:2\>3>
Search <ParaStyle:(<0x[0-9A-F]{4,4}>|\\\\>|\\\\<|[^<>])*>
You can use regular expression in the following situations:
Since v3.4 R9000 and CS5 regulare expresssions are parsed by PCRE only. Only CS4 is using the old implementation for GNU compatible REs.
GNU-conform regular expressions supporting the base functionality of regular expressions ((Characters and substrings, Counter, Word boundaries und Sub expressions).
PCRE supports the full functionalty expected from modern strings matchers. You may use the following feature for example:
If you not familar with regular expressions - you can find a lot of descriptions and examples in the net.
If you want learn something about the theorie of regular expression, search for Formlar languages,
regular expressions are a subfamily of the formal languages.
An excellent description of PCRE you can find here..
Extremely helpful is the Online regex tester..
To give strreplace and strstrpos a hint, that the search string is a regular expression, use the prefixes
regexp: or pcre:
Since v3.4 R9000 and CS5 the prefix regexp: points to PCRE too.
Characters and substrings in regular expressions are expressed by itself.
Search all 1234's in a given string
456 1234 6748 441234567 64641329 4321 4321 999
Search 1234
Of course, a simple search can solve this problem too. Things getting harder if you allow any order of the digits 1, 2, 3, 4. (You may search for the 24 combinations ...)
Look for all occurrances of 1234 in any order of the digits:
456 1234 6748 441234567 64641329 4321 999
Search [1234][1234][1234][1234]
Every [] expression matches exactly one of the characters inside the brackets. The brackets may contain any number of Ascii characters or Ascii character ranges. So [a-zA-Z_] will mach any letter and the Underscore. Use [^...] if the character shall not match. [^ \t\r\n] will match any character except the blank, the tab and the line delimiters.
Regular expression may find any UTF8 character. But the expression itself must not contain Ascii characters greater 127 like ä, ö, ü.
If you want look for letters used to describe a regular expressions, use the backslash to escape it.
And take care to escpae this backslash in cScript string itself!
Using character ranges in the above expression, will make thing not really better:
Look for all occurrances of 1234 in any order of the digits:
456 1234 6748 441234567 64641329 4321 999
Search [1-4][1-4][1-4][1-4]
Even counters will lead to better expressions :
Description | Example | |
? | 0 or one time | ab?a (aa, aba) |
+ | one or more times | ab+a (aba, abba, abbba, ...)
a(b|c)+d (abd, acd, abbcd, ...) |
* | any times | ab*a (aa, aba, abba, abbba, ...)
a(b|c)*d (ad, abd, acd, abcbcbcd, ...) |
{n} | exactly n times | ab{3}c (abbbc) |
{n, m} | n to m times | ab{2, 3}a (abba, abbba) |
Using counters the above expression looks well.
Look for all occurrances of 1234 in any order of the digits:
456 1234 6748 441234567 64641329 4321 999
Search [1-4]{4}
With \b you can describe word boundaries. If you want find word onl in the above expression, simple add \b at the beginning and the end of the expression.
Look for all words of 1234 in any order of the digits:
456 1234 6748 441234567 64641329 4321 999
\b[1-4]{4}\b
Any part of a regular expression can set in parenthesis. You can refer to this sub parts of the expression using the following descriptors:
Domain | Description | |
\0 | replace string | Current value of the complete expression |
\1, \2, ..., \9 | regular expression and replace string |
Current value of the first, second, ... sub expression Using \1, ... \9 in the regulare expression you can search for numbers beginning and ending with the same digit : (\b([0-9])[0-9]*\1\b). |
\pC[0-9] | replace string | Repeat the letter C as often as the current value is long. Ascii of C must be in the range [1-127] |
\u[0-9] | replace string | Upper cased current value | \l[0-9] | replace string | Lower cased current value |
\r[0-9] | replace string | Reversed current value |
Find all 3-digit substrings enclosed by letters and swap this letters:
000v5664w358x00l345m50v523w1f789g6040h928i01
000v5664x358w00m345l50w523v1g789f6040i928h01
Search ([a-z])([1-9]{3})([a-z])
Replace \3\2\1
Find a words containing a r and replace it by the same word but in upper cases:
Search \b([^[:space:][:punct:]]*r[^[:space:][:punct:]]*)\b
Replace \u0
Replace all paragraph tags of a tagged text by a HTML comment including the upper cased style name.
Serach <ParaStyle:((([^\>])|(\\>))*)>
Replace <--\u1-->
Look for all tags of a tagged text :
Search (<[a-zA-Z][a-zA-Z0-9_]*:(<)?)((([^><])|(\\>)|(\\<))*)(>)*
PCRE handles the search for regular expressions in three steps:
Each of the three steps, that automatically executed for every search for regular expressions using PCRE, can receive additional options. In the following you can see the available options. For more information about the options, please search the web for the option names (e.g., PCRE_BSR_ANYCRLF).
The option names are not defined in cScript, please use the corresponding numbers. To use several options, you can add the numbers by | (logical or). The options are used exclusively for the processing regular expressions with PCRE b>.
This step compiles a regular expression into an internal form. The option PCRE_UTF8 is activated always by default. Using Windows, also the option PCRE_UCP is activated..
Optionname | Value | Description |
PCRE_ANCHORED | 0x00000010 | Force pattern anchoring |
PCRE_AUTO_CALLOUT | 0x00004000 | Compile automatic callouts |
PCRE_BSR_ANYCRLF | 0x00800000 | \R matches only CR, LF, or CRLF |
PCRE_BSR_UNICODE | 0x01000000 | \R matches all Unicode line endings |
PCRE_CASELESS | 0x00000001 | Do caseless matching |
PCRE_DOLLAR_ENDONLY | 0x00000020 | $ not to match newline at end |
PCRE_DOTALL | 0x00000004 | . matches anything including NL |
PCRE_DUPNAMES | 0x00080000 | Allow duplicate names for subpatterns |
PCRE_EXTENDED | 0x00000008 | Ignore white space and # comments |
PCRE_EXTRA | 0x00000040 | PCRE extra features (not much use currently) |
PCRE_FIRSTLINE | 0x00040000 | Force matching to be before newline |
PCRE_JAVASCRIPT_COMPAT | 0x02000000 | JavaScript compatibility |
PCRE_MULTILINE | 0x00000002 | ^ and $ match newlines within data |
PCRE_NEVER_UTF | 0x00010000 | Lock out UTF, e.g. via (*UTF) |
PCRE_NEWLINE_ANY | 0x00400000 | Recognize any Unicode newline sequence |
PCRE_NEWLINE_ANYCRLF | 0x00500000 | Recognize CR, LF, and CRLF as newline sequences |
PCRE_NEWLINE_CR | 0x00100000 | Set CR as the newline sequence |
PCRE_NEWLINE_CRLF | 0x00300000 | Set CRLF as the newline sequence |
PCRE_NEWLINE_LF | 0x00200000 | Set LF as the newline sequence |
PCRE_NO_AUTO_CAPTURE | 0x00001000 | Disable numbered capturing parentheses (named ones available) |
PCRE_NO_START_OPTIMIZE | 0x04000000 | Disable match-time start optimizations |
PCRE_NO_UTF8_CHECK | 0x00002000 | Do not check the pattern for UTF-8 validity |
PCRE_UCP | 0x20000000 | Use Unicode properties for \d, \w, etc. |
PCRE_UNGREEDY | 0x00000200 | Invert greediness of quantifiers |
PCRE_UTF8 | 0x00000800 | Run in pcre_compile() UTF-8 mode |
This step studies a compiled pattern, to see if additional information can be extracted that might speed up matching.
Optionname | Value | Description |
PCRE_STUDY_JIT_COMPILE | 0x0001 | Requests just-in-time compilation if possible. |
This step matches a compiled regular expression against a given subject string, using a matching algorithm that is similar to Perl's.
Optionname | Value | Description |
PCRE_ANCHORED | 0x00000010 | Match only at the first position |
PCRE_BSR_ANYCRLF | 0x00800000 | \R matches only CR, LF, or CRLF |
PCRE_BSR_UNICODE | 0x01000000 | \R matches all Unicode line endings |
PCRE_NEWLINE_ANY | 0x00400000 | Recognize any Unicode newline sequence |
PCRE_NEWLINE_ANYCRLF | 0x00500000 | Recognize CR, LF, & CRLF as newline sequences |
PCRE_NEWLINE_CR | 0x00100000 | Recognize CR as the only newline sequence |
PCRE_NEWLINE_CRLF | 0x00300000 | Recognize CRLF as the only newline sequence |
PCRE_NEWLINE_LF | 0x00200000 | Recognize LF as the only newline sequence |
PCRE_NOTBOL | 0x00000080 | Subject string is not the beginning of a line |
PCRE_NOTEOL | 0x00000100 | Subject string is not the end of a line |
PCRE_NOTEMPTY | 0x00000400 | An empty string is not a valid match |
PCRE_NOTEMPTY_ATSTART | 0x10000000 | An empty string at the start of the subject is not a valid match |
PCRE_NO_START_OPTIMIZE | 0x04000000 | Do not do "start-match" optimizations |
PCRE_NO_UTF8_CHECK | 0x00002000 | Do not check the subject for UTF-8 validity |
PCRE_PARTIAL | 0x00008000 | ) Return PCRE_ERROR_PARTIAL for a partial |
PCRE_PARTIAL_SOFT | 0x00008000 | ) match if no full matches are found |
PCRE_PARTIAL_HARD | 0x08000000 | Return PCRE_ERROR_PARTIAL for a partial match if that is found before a full match |
Alphabetic index HTML hierarchy of classes or Java