PHP Classes

PHP Regex Advanced: Match MSDOS/UNIX patterns with regular expressions

Recommend this page to a friend!
  Info   View files Example   View files View files (7)   DownloadInstall with Composer Download .zip   Reputation   Support forum   Blog (1)    
Ratings Unique User Downloads Download Rankings
StarStarStarStar 74%Total: 364 This week: 1All time: 6,866 This week: 560Up
Version License PHP version Categories
advanced-regex 1.0.8BSD License5.5PHP 5, Text processing
Description 

Author

This class can match MSDOS or UNIX patterns with regular expressions among other features. Currently it can:

- Take a given pattern in MSDOS or UNIX format and convert it to regular expressions determine if a given string matches that pattern.
- Take a list of regular expressions and string to evaluate and return string that identifies the sequence of regular expressions that the input string matches.
- Take a list of strings and develops a regular expression that would match those strings.
- Use regular expressions that can specify several times the same named capture groups
- Meta-matching groups of lines based on a set of supplied regular expressions. Meta-matching is very powerful since a single regular expression can be used to match groups of lines in a file (each line being itself described by its own regular expression)

Innovation Award
PHP Programming Innovation award nominee
September 2015
Number 11


Prize: One downloadable copy of PhpED Professional
Regular expressions are powerful for matching and extracting patterns of text from strings. However, sometimes it is hard to figure the right pattern to match the strings we need.

This class provides a solution to automatically create regular expressions that match a set of given strings with common patterns.

Manuel Lemos
Picture of Christian Vigh
  Performance   Level  
Name: Christian Vigh <contact>
Classes: 32 packages by
Country: France France
Age: 58
All time rank: 13810 in France France
Week rank: 11 Up1 in France France Up
Innovation award
Innovation award
Nominee: 20x

Winner: 3x

Example

<?php
// test Regex class methods
// this script should be run in command-line mode
include ( 'Regex.class.php' ) ;

if (
php_sapi_name ( ) == 'cli' )
    {
   
$eol = PHP_EOL ;
   
$tab = "\t" ;
     }
else
    {
   
$eol = "<br/>" ;
   
$tab = str_repeat ( "&nbsp;", 8 ) ;
     }


// IsRegex() method example
$re =
   [
   
'/foo bar/', // valid
   
'#foo bar#imsx', // valid
   
'/foo bar' // invalid
   
] ;

foreach (
$re as $item )
    echo (
"Regex::IsRegex ( $item ) = " . ( ( Regex::IsRegex ( $item ) ) ? 'true' : 'false' ) . $eol ) ;

echo (
$eol ) ;

// Matches() example
$matches =
   [
    [
'file.txt', '*.txt' ], // match
   
[ 'file.txt', 'file.*' ], // match
   
[ 'dir/file.txt', '*.txt' ], // no match
   
[ 'dir/file.txt', 'dir/*.txt' ], // match
   
[ 'dir/file.txt', '*/*.txt' ], // match
   
[ 'dir/file.txt', '*\\file.txt' ] // match
   
] ;

foreach (
$matches as $match )
    echo (
"Regex::Matches ( {$match [0]}, {$match [1]} ) = " . ( ( Regex::Matches ( $match [0], $match [1] ) ) ? 'true' : 'false' ) . $eol ) ;

echo (
$eol ) ;

// DevelopExpression() example
$expressions =
   [
   
'file[0-9].txt', // Gives 'file0.txt' to 'file9.txt'
   
'file[a-c][0-2].bin' // Gives 'filea0.bin' to 'filea2.bin', 'fileb0.bin' to 'fileb2.bin', etc.
   
] ;


foreach (
$expressions as $expression )
    {
    echo (
"Regex::DevelopExpression ( $expression ) = $eol$tab" ) ;
   
$developed_expressions = Regex::DevelopExpression ( $expression ) ;
    echo (
rtrim ( str_replace ( "\n", "$eol$tab", print_r ( $developed_expressions, true ) ) ) ) ;
    echo (
"$eol" ) ;
     }

echo (
$eol ) ;

// PregMatchEx() example
$subject = "a:1 b:2" ;
$re = '/(?P<sequence> (?P<letter> [a-z]) : (?P<digit> [0-9])) \s (?P<sequence> (?P<letter> [a-z]) : (?P<digit> [0-9]))/imsx' ;

echo (
"Regex::PregMatchEx ( $subject, $re, WIPE ) : " ) ;
$result = Regex::PregMatchEx ( $re, $subject, $match, PREG_WIPE_MATCHES | PREG_OFFSET_CAPTURE ) ;
echo (
"\t" . str_replace ( "\n", "$eol$tab", print_r ( $match, true ) ) ) ;
echo (
$eol ) ;

echo (
"Regex::PregMatchEx ( $subject, $re, NOWIPE ) : " ) ;
$result = Regex::PregMatchEx ( $re, $subject, $match, PREG_OFFSET_CAPTURE ) ;
echo (
"\t" . str_replace ( "\n", "$eol$tab", print_r ( $match, true ) ) ) ;
echo (
$eol ) ;


// MetaPregMatchEx() example
$regex_list =
   [
   
'1' => '/message start/',
   
'2' => '/log: \s* (?P<logmessage> .*)/imsx',
   
'3' => '/message end/'
   
] ;
$sequence = '/ \1 \2* \3 /imsx' ;
$lines =
   [
   
'message start',
   
'log: this is log message 1',
   
'log: this is log message 2',
   
'message end'
   
] ;

echo (
"Regex::MetaPregMatchEx ( ) = " ) ;
$status = Regex::MetaPregMatchEx ( $sequence, $regex_list, $lines ) ;
echo ( (
$status ) ? 'true' : 'false' ) ;


Details

INTRODUCTION

The Regex class provides static methods that encapsulate normal php regex functions (preg_*). It is intended to provide additional functionalities such as :

- Matching filenames using an Msdos/Unix-style convention (eg, "*.txt") - Developing wildcard expressions ; for example, the expression "file[1-3].txt" could yield to an array containing : [ "file1.txt", "file2.txt", "file3.txt" ] - Using preg matching functions and be able to specify several times the same named group captures - Using a meta-regular expression and a set of individual regular expressions to match sequences of lines

METHODS

Msdos/Unix style wildcards

Regex::Matches ( $file, $pattern, $case_sensitive = false )

Checks if a filename corresponds to a wildcard mask.

The authorized wildcards are the following :

- "?" : represents zero or one character, excepted the path delimiter. - "*" : represents zero or more characters, excepted the path delimiter. - character class : represents a range of characters. For example, "[A-Z]" means "all uppercase letters", while "[^a-z]" means "anything but lowercase letters"

Returns true if $pattern matches $file, false otherwise.

WildcardToRegex ( $pattern, $escaped_chars = "" )

Converts an msdos/unix-style wildcard to a regular expression.

Regular expressions

MetaPregMatchEx ( $sequence, $regex_list, $subject_array, &$matches = null, $flags = 0, $match_all = false, $missing_matches = [] )

A meta-matching artefact for regular expressions.

Suppose you have to scan a sequence of lines, such as in a log file. You want to recognize which sequence follows which pattern.

A sequence in an example log file could be, for example :

- A line containing "message start" - Any number of lines starting with "log:" and followed by any sequence of characters - A line containing "message end"

The following example gives a layout of such a log file :

 message start
 log: message 1
 log: message 2
 ...
 log: message n
 message end

The purpose is to check whether a sequence of lines would match this scheme ; a set of regular expressions would be first needeed to match every particular line in a sequence :

 $regex_list =
    [
 		'1' => '/message start/',
 		'2' => 'log: \s(?P<logmessage> .),
 		'3' => '/message end/'
     ] ;

Then, to match a set of lines containing 'message start', having an unlimited number of lines starting with 'log:', then ending with a line containing 'message end', you would have to provide a regular expression using a backreference-style syntax referencing the keys of our $regex_list array, which would give :

 $sequence	=  '\1 \2* \3' ;

meaning :

- The first line must be the one identified by '\1', ie 'message start' - There can be any number of lines identified by '\2', ie starting with 'log:' - The last line must be 'message end'

Note that each $regex_list item is a regular expression which can contain group captures, either named or not.

If it does not contain re delimiters, then '/ /imsx' is assumed, so do not forget that spaces will not be significant.

Thus, checking if a set of lines (in an array) matches the regular expressions specified in $sequence and defined in $regex_list, a simple call will be enough :

 
 $status = Regex::MetaPregMatchEx ( $sequence, $regex_list, $lines ) ;

In this method, the sequence parameter is a regular expression containing preg backreference-style constructs that refer to array keys in the $regex_list array.

The following preg-style backreferences are supported ('x' stands for a sequence of digits, 'name' for a group capture name) :

- \x - \gx - \g{x} - (?P=name) - \k<name> - \k'name' - \k{name} - \g{name}

The regex_list parameter is an associative array whose keys are backreference ids (either the 'x' or the 'name' string described in the $sequence parameter help) and whose values are regular expressions.

Each entry is meant to match one or more lines of a sequence of lines.

If no delimiter encloses the regex, then a default delimiter '/' will be used, and the 'imsx' -preg options will be automatically added before performing the match.

The $subject_array parameter is an array of input lines to be matched against the specified sequence.

The $matches parameter is a reference to an array which will receive the individual matches.

Each entry is an associative array having the following keys :

- 'reference' : the original string reference. - 'regex' : the regex that matched the line. - 'matches' : array of matches. Note that since the method uses PregMatchEx(), an additional level of indirection is added with regards to self::PregMatch, since several captures can have the same name.

$flags is a combination of PREG_... flags.

$missing_matches, if specified, will receive the indexes of the non-matching lines.

The function will return true if the specified lines match the sequence, and false otherwise.

Regex::PregMatch ( $pattern, $subject, &$matches = null, $flags = 0, $offset = 0 )

Encapsulates the preg_match() function and optionnally wipes unnamed captures from the returned $matches array is the custom PREG\_WIPE\_MATCHES flag is specified.

Regex::PregMatchAll ( $pattern, $subject, &$matches = null, $flags = 0, $offset = 0 )

Encapsulates the preg_match\_all() function and optionnally wipes unnamed captures from the returned $matches array is the custom PREG\_WIPE\_MATCHES flag is specified.

Regex::PregReplace ( $pattern, $replacement, $subject, $limit = -1, $count = null )

Encapsulates the preg_replace() function.

RegexPregMatchEx ( $pattern, $subject, &$matches = null, $flags = 0, $offset = 0 )

An extended version of Regex::PregMatch() that allows for specifying multiple named captures of the same name.

The $matches array will hold the captured groups. Since named captures of the same name can be specified more than once, each array item will contain an additional level of indirection : an array for each matched item.

Thus, the elements of a capture group named <pattern\> will be accessible through the following expressions :

- $matches [ 'pattern' ] [0] will yield to the first expression matched by the named capture "pattern" - count ( $matches [ 'pattern' ] ) will give the number of expressions matched by the named capture "pattern"

Regex::PregMatchAllEx ( $pattern, $subject, &$matches = null, $flags = 0, $offset = 0 )

An extended version of Regex::PregMatchAll() that allows for specifying multiple named captures of the same name.

Empty subarrays or subarrays having an offset of -1 will be removed from the resulting matches.

Expression development

Regex::DevelopExpression ( $expression, $limit = 10000 )

Expands a factorized string expression.

Sometimes, it is necessary to represent a set of values with a factorized expression, such as the shell allows us to match a set of files using a pattern.

This method processes input strings that contains character classes and generates an array of values that represent all the possible combinations. For example :

"file[a-c].txt"

will expand to the following array of strings :

[ "filea.txt", "fileb.txt", filec.txt" ]

Currently, character classes can only be alphabetic or alphanumeric, such as is the following example :

"file[a-b][0-1]"

which will expand to :

[ "filea0", "filea1", "fileb0", "fileb1" ]

Numeric values can be zero-padded, using an optional integer width preceded by a slash, like in the following example :

"file[0-1]/4"

which will expand to :

[ "file0000", "file0001" ]

For alphabetic character classes, the case of the first character determines the case of the expanded result ; for example :

"file[A-c]"

will give :

[ "fileA", "fileB", "fileC" ]

Finally, angle brackets can be escaped using the backslash character.

The $limit parameter is set to an arbitrary value of 10000 to limit the number of results returned.

Various methods

Regex::CombinationsOf ( $array, $limit )

Takes an array containing values and nested arrays, and generates all the possible combinations, each nested array providing alternatives for the generation.

For example, the following input array :

[ [ 'a', 'b' ], 1, 2, [ 'x', 'y', 'z' ] ]

will generate the following results ;

   [
 		[0] => [ 'a', 1, 2, 'x' ] 
 		[1] => [ 'b', 1, 2, 'x' ] 
 		[2] => [ 'a', 1, 2, 'y' ] 
 		[3] => [ 'b', 1, 2, 'y' ] 
 		[4] => [ 'a', 1, 2, 'z' ] 
 		[5] => [ 'b', 1, 2, 'z' ] 
 	]

Note that the combination generation is computed from left to right, and that only one level of nesting is allowed.

The $limit parameter is set to an arbitrary value to limit the number of results returned.

Regex::IsRegex ( $expression )

Checks if the specified expression is a valid regular expression.

The $delimiter parameter indicates the delimiter character that is to be used for the regular expression. If not specified, the delimiter character will be taken from the first character of the specified regular expression.

IsRegex returns true if $expression is a valid regular expression, and false otherwise.

Note that the PCRE package allows for regular expressions to be delimited by the following symmetric characters : [], {}, <> and (). The IsRegex method behaves the same way.

Regex::MultiSubstrReplace ( $subject, $replacements )

Performs multiple substring replacements within the same string.

$subject is the string to be processed. $replacements is an array of arrays containing 3 elements :

- The string to be replaced in $subject - The replacement string - The offset, in $subject, of the string to be replaced.

This function can be used with an array based on a match list returned by the preg_match\_all() function. Internally, this function is used by the NormalizeMetaSequence() and RenumberNamedCaptures() methods.

Internal methods

This section lists methods that are used internally by the Regex class, but are made public in case of...

Regex::GroupNamedCaptures ( $match, $replacements )

Used by the Regex::Preg...Ex methods. Once capture groups have been renamed by the Regex::RenumberNamedCaptures() method to ensure unique capture names, this function is called to rename back the matched elements returned by the preg_... functions in order to regroup named captures together, using their original names.

The $match parameter is returned by one of the preg_... functions.

$replacements is an associative array returned by the Regex::RenumberNamedCaptures(), whose keys are the computed unique capture group names, and values are the original capture names (which can be present several times in the original regular expression).

Regex::NormalizeMetaSequence ( $sequence, $subsequences = null ) Normalizes a meta-sequence, which uses preg-like backreference syntax to reference regular expressions indexed by the backreference value in the $match_definitions array.

The method accepts all the backreference syntaxes that are recognized by the preg_replace function ('x' stands for a sequence of digits, 'name' for a group capture name) :

- \x - \gx - \g{x} - (?P=name) - \k<name> - \k'name' - \k{name} - \g{name}

All those forms are normalized in the input sequence as (\x) or (\name) (note the enclosing parentheses to prevent side effects when performing the match).

Regex::PregWipeMatches ( &$matches, $flags )

Removes unnamed captures from the result of a call to a preg_* function.

The $flags parameter is the one supplied to a preg_match...() function ; This is used to determine wether the PREG\_OFFSET\_CAPTURE flag has been specified. If yes, the two-elements arrays returned in $matches may be either an empty string, or an array where the second element (the offset) is -1. In this case, all the two-element arrays will be removed from the result.

Regex::RenumberNamedCaptures ( $pattern, &$correspondances = [], $prefix = 'match_' )

Reassigns unique identifiers to named captures within a regular expression. The new identifiers will have the form "prefix_x", where "prefix" is given by the $prefix parameter, and "x" a unique identifier starting from 0.

On output, the $correspondances array will hold an associative array whose keys are the new capture group names, and values the old ones.

Regex::ReplaceNamedPatterns ( $pattern, $subject, $replacements, $options = null )

Replace named patterns in a string. This function uses the result of self::PregMatchAll() to match named patterns with the supplied input array $replacements.

$pattern is a pattern matching subpart(s) of the specified subject string. subject is the string to be matched against. $replacements is an associative array whose keys are the pattern name (as specified in the(?P<name> re) parts of a regular expression) and whose values are also an associative array.

Each entry in the array have the following meaning :

- key : A regular expression specifying the value of the named pattern name. Do not put anchors nor delimiters in this pattern since they are automatically added. - value : The replacement value for the named pattern specified by the key value.

$options is a combination of PREG_... flags.


  Files folder image Files  
File Role Description
Accessible without login Plain text file DISCLAIMER Data Disclaimer
Accessible without login Plain text file example.php Example Example script
Accessible without login Plain text file LICENSE Lic. License file
Accessible without login Plain text file NOTICE Data Auxiliary data
Accessible without login HTML file README.html Doc. Readme file
Plain text file Regex.class.php Class The Regex class
Accessible without login Plain text file Regex.md Doc. Documentation

 Version Control Reuses Unique User Downloads Download Rankings  
 71%1
Total:364
This week:1
All time:6,866
This week:560Up
 User Ratings  
 
 All time
Utility:91%StarStarStarStarStar
Consistency:100%StarStarStarStarStarStar
Documentation:91%StarStarStarStarStar
Examples:91%StarStarStarStarStar
Tests:-
Videos:-
Overall:74%StarStarStarStar
Rank:107