TOC PREV NEXT INDEX

Put your logo here!


23 HLA Pattern Matching Routines (patterns.hhf)


The HLA Standard Library provides a set of string/pattern matching routines that are similar in use to those provided by the SNOBOL4 and Icon programming languages. These pattern matching routines support recursion and backtracking, allowing the specification of context-free grammars as well as regular expressions.

Note: all the HLA pattern matching routines are members of the pat namespace, hence you must preface each pattern matching routine with "pat." or HLA will report an error.

23.1 Pat.match and Pat.endmatch Syntax

The HLA pat.match and pat.endmatch macros provide the basic tools for pattern matching. These macro statement allow one of the following two syntaxes:

// Match syntax #1:
 

 
pat.match( StringValue );
 
	<< Sequence of match functions>>
 
	<< Code to execute on a successful match >>
 

 
   pat.if_failure
 

 
	<< Code to execute if the match fails >>
 

 
pat.endmatch;
 

 

StringValue is either an HLA string variable or a string constant.


 

 
// Match syntax #2:
 

 
pat.match( StartOfStr, EndOfStr );
 
	<< Sequence of match functions>>
 
	<< Code to execute on a successful match >>
 

 
   pat.if_failure
 

 
	<< Code to execute if the match fails >>
 

 
pat.endmatch;
 

 

The StartOfStr and EndOfStr parameters (in syntax #2) must be dword pointers to characters. StartOfStr points at the first character of a sequence of characters to match against. EndOfStr must point at the first byte beyond the last character in the sequence to consider.

The pat.match statement, along with many of the matching functions, pushes data onto the stack that may not be cleaned up until execution of the pat.endmatch statement. Therefore, you must never jump into a pat.match..pat.endmatch block. Likewise, unless you are prepared to clean up the stack yourself, you should not jump out of a pat.match..pat.endmatch block1.

During a normal match operation, the pat.match block executes the sequence of string matching functions. If all the functions in the list execute and successfully match their portion of the string, control falls through to the statements after the match sequence. This code should do whatever is necessary if the pattern matches.

On the other hand, if a failure occurs and the pattern matching routines cannot match the specified string, then control transfers to the pat.if_failure section and the associated statements execute. Like an IF..THEN..ELSE statement, the program automatically jumps over the pat.if_failure section if the "successful match" statements execute.

Consider the following example that matches a string containing a single HLA identifier:


 
pat.match( StrToTest );
 

 
  pat.oneCset( { 'a'..'z', 'A'..'Z', '_'} );
 
  pat.zeroOrMoreCset( { 'a'..'z', 'A'..'Z', '0'..'9', '_'});
 
  pat.EOS;
 

 
  stdout.put( "The string is a valid HLA identifier" nl );
 

 
 pat.if_failure
 

 
  stdout.put( "The string is not a valid HLA id" nl );
 

 
pat.endmatch;
 

 

The pat.oneCset function matches a single character in StrToTest that is a member of the character set appearing in the parameter list. This call requires that the first character of StrToTest be an alphabetic character or an underscore.

After pat.oneCset matches a character, the pattern matching routines advance a cursor into StrToTest so that it points just beyond the character matched by pat.oneCset. Indeed, all pattern matching routines operate in this manner, they maintain a cursor (in ESI) that points beyond the characters just matched. So had StrToTest contained the string "Hello", ESI would be pointing at the "e" in "Hello" immediately after the execution of the pat.oneCset pattern matching routine.

The HLA pattern matching routines also return EBX pointing at the first character matched by the routine. In the current example being considered, EBX would be returned pointing at the "H" in "Hello" by the pat.oneCset routine.

The pat.zeroOrMoreCset routine continues where pat.oneCset leaves off. It matches zero or more characters (starting at the location pointed at by ESI). In this particular example, pat.zeroOrMoreCset matches zero or more alphanumeric and underscore characters, hence the code will match "ello" in "Hello".

The pat.EOS macro matches the end of the string, just to make sure there aren't any other illegal (nonalphanumeric) characters in the string. Note that pat.zeroOrMoreCset stops upon encountering the first non-alphanumeric character. The remainder of the pattern (EOS, in this case) must verify that pat.zeroOrMoreCset didn't stop on an illegal character.

Had the StrToTest variable contained the string "Hello", then the pattern would successfully match the string and the program would print "The string is a valid HLA identifier" and continue execution after the pat.endmatch statement.

Because of the way HLA pattern matching routines implement backtracking, each matching routine may leave data on the stack when it successfully returns. This information is necessary to implement backtracking. Although the pat.endmatch code cleans up the stack upon exit, it is important to realize that stack is not static. In particular, you cannot push data on the stack before one pattern matching routine and expect to pop it off the stack when that matching routine returns. Instead, you'll pop the data that the matching routine left on the stack (which will probably crash the system if backtracking occurs). It is okay to manipulate the stack in the code section following all the matching functions (or in the failure section), but you must leave the stack intact between calls to pattern matching routines2.

23.2 Alternation

Another way to handle failure is with the pat.alternate macro. A pat.match..pat.endmatch macro invocation may optionally contain one or more pat.alternate sections before the (required) pat.if_failure section. The pat.alternate sections "intercept" failures from the previous section(s) and allow an attempt to rematch the string with a different pattern (somewhat like the ELSEIF clause of an IF..THEN..ELSEIF..ELSE..THEN statement). The following example demonstrates how you could use this:


 
pat.match( StrToTest );
 

 
  pat.oneCset( { 'a'..'z', 'A'..'Z', '_'} );
 
  pat.zeroOrMoreCset( { 'a'..'z', 'A'..'Z', '0'..'9', '_'});
 
  pat.EOS;
 

 
  stdout.put( "The string is a valid HLA identifier" nl );
 

 
 pat.alternate
 

 
  pat.oneOrMoreCset( {'0'..'9', '_'} );
 
  pat.EOS;
 

 
  stdout.put( "The string is a valid HLA unsigned integer constant" nl );
 

 
 pat.if_failure
 

 
  stdout.put( "The string is not a valid HLA id or integer constant" nl );
 

 
pat.endmatch;
 

 

In this example, if the pattern fails to match an HLA identifier, the pattern matching code attempts to see if it matches an integer constant (in the pat.alternate section). If this fails as well, then the whole pattern fails to match.

23.3 Pattern Matching Macros

The HLA patterns library implements several of the pattern matching routines as keyword macros within the pat.match macro. These include pat.EOS, pat.position, pat.atPos, pat.skip, pat.getPos, pat.fail, pat.fence, pat.zeroOrOnePat, pat.zeroOrMorePat, and pat.oneOrMorePat. The following sections describe each of these functions:

pat.EOS
 

The pat.EOS macro matches the end of the string. It succeeds if the current "cursor" value (ESI) is pointing at the end of the string to match. It fails otherwise.

pat.position( n )
 

This function repositions the cursor to character n in the string that pat.match is processing. This function fails if repositioning the cursor would move it outside the bounds of the string. Note that the index of the first character in the string is zero.

pat.atPos( n )
 

This function succeeds if the cursor is currently at position n in the string that pat.match is processing. It fails otherwise.

pat.skip( n )
 

This function advances the cursor n positions from its current location. This function succeeds if the new cursor position is within the bounds of the string; it fails otherwise.

pat.getPos( dest )
 

This function places the current cursor position in the specified destination operand. This function always succeeds.

pat.fail
 

This forces an immediate failure, backtracking if necessary.

pat.fence
 

This function cleans all the backtracking information off the stack. Any pattern matching function following fence will not be able to backtrack to the routines immediately preceding fence in the current pat.match statement.

pat.onePat;
 
	<< pattern matching statements >>
 
pat.endOnePat;
 

<< pattern matching statements >> are some statements that correspond to an HLA pattern sequence (it may contain pattern matching function calls, x86 code, and pat.alternate sections; it may not contain a pat.if_failure section or a pat.fence invocation). The program evaluates the pattern. If it succeeds, control falls to the next statement following the pat.pattern call. If it fails, then control transfers directly to the pat.if_failure section in the surrounding pat.match call.

This macro is primarily used to create "parenthetical patterns" as a convenience when creating complex patterns. Here's an example:


 
pat.match( SomeString );
 

 
	pat.onePat
 

 
			pat.matchStr( "Black" );
 

 
		pat.alternate
 

 
			pat.matchStr( "Blue" );
 

 
	pat.endOnePat;
 

 
	pat.onePat;
 

 
			pat.matchStr( "bird" );
 

 
		pat.alternate
 

 
			pat.matchStr( "berry" );
 

 
	pat.endOnePat;
 

 
	stdout.put
 
	( 
 
		"It was 'blackbird', 'bluebird', 'blackberry', or 'blueberry'",
 
		nl 
 
	);
 

 
  pat.if_failure
 

 
	stdout.put( "Failed to match the pattern" nl );
 

 
pat.endmatch;
 

 

This (rather famous) pattern matches the same thing as the regular expression:


 
	( "Black" | "Blue" )( "bird" | "berry" )
 

 

Immediately after the pat.endOnePat statement, EBX points at the start of the text associated with the pattern match between the pat.onePat and pat.endOnePat calls. Therefore, you can call functions like pat.extract to extract the entire string matched by the pattern between the pat.onePat and pat.endOnePat calls. This function fully supports backtracking, even across the patterns within the parenthetical pattern expression.

pat.zeroOrOnePat;
 
	<< pattern matching statements >>
 
pat.endZeroOrOnePat;
 

<< pattern matching statements >> are some statemennts that correspond to an HLA pattern sequence (it may contain pattern matching function calls, x86 code, and pat.alternate sections; it may not contain a pat.if_failure section or a pat.fence invocation). This call invokes the pattern matching function zero or one times to match additional characters in the current string. This function always succeeds since it can match zero times. This function fully supports backtracking.

pat.zeroOrMorePat;
 
	<< pattern matching statements >>
 
pat.endZeroOrMorePat
 

Pattern is sequence of pattern matching function calls (just like pat.pattern above; including allowing a pat.alternate section but not allowing a pat.if_failure section). This call invokes the pattern matching function zero or more times to match additional characters in the current string.

pat.oneOrMorePat
 
	<< pattern matching statements >>
 
pat.endOneOrMorePat
 

<< pattern matching statements >> are some statemennts that correspond to an HLA pattern sequence (it may contain pattern matching function calls, x86 code, and pat.alternate sections; it may not contain a pat.if_failure section or a pat.fence invocation). This call invokes the pattern matching function one or more times to match additional characters in the current string. It must match at least one occurrence of the pattern in order to succeed.

23.4 Pattern Matching Functions

The following sections describe each of the pattern matching functions provided by the HLA patterns module.

23.4.1 Character Set Matching Procedures

procedure pat.peekCset( cst:cset ); 
 

Succeeds if the following character is in the specified set. Fails otherwise. Does not advance the cursor if the character is an element of cst.

procedure pat.oneCset( cst:cset ); 
 

Succeeds, and advances the cursor, if the character at the cursor position is in cst. Fails otherwise.

procedure pat.upToCset( cst:cset ); 
 

Advances the cursor until it finds a character in cst. Fails if none ofthe characters following the cursor position (to the end of the string) are in cst.

procedure pat.zeroOrOneCset( cst:cset )
 

Optionally matches a single character in the string. If the following character is in the character set, this routine advances the cursor and signals success. If the following character is not in the string, this routine simply signals success.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match the character before returning. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. If the following match routine still fails, then this routine fails.

procedure pat.l_ZeroOrOneCset( cst:cset )
 

Optionally matches a single character in the string. If the following character is in the character set, this routine advances the cursor and signals success. If the following character is not in the string, this routine simply signals success.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will start by matching zero characters in the string. If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. If the following routine still fails, then this routine signals failure.

procedure pat.zeroOrMoreCset( cst:cset ); 
 

Matches zero or more characters in the specified character set.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up beyond the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ZeroOrMoreCset( cst:cset ); 
 

Matches zero or more characters in the specified character set.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.oneOrMoreCset( cst:cset ); 
 

Matches one or more characters in the specified character set. Immediately fails if there isn't at least one character in cst.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_OneOrMoreCset( cst:cset ); 
 

Matches one or more characters in the specified character set. Immediately fails if there isn't at least one character in cst.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., one). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNCset( cst:cset; n:uns32 ); 
 

Matches exactly n characters that are members of cst. If any of the next n characters are not in cst, this routines returns failure.

Note: The character at position (n+1) must not be a member of cst or this routine fails.

procedure pat.firstNCset( cst:cset; n:uns32 ); external;
 

Matches n characters that are members of cst.

Note: The character at position (n+1) may be a member of cst. Whether or not it is, this routine succeeds if the first n characters are members of cst.

procedure pat.norLessCset( cst:cset; n:uns32 ); 
 

This routine matches n or fewer characters belonging to the cst set. The character at position (n+1) does not affect the success or failure of this routine. Note that this routine returns true even if it matches zero characters.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorLessCset( cst:cset; n:uns32 ); 
 

This routine matches n or fewer characters belonging to the cst set. The character at position (n+1) does not affect the success or failure of this routine.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.norMoreCset( cst:cset; n:uns32 ); 
 

This routine matches at least n characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorMoreCset( cst:cset; n:uns32 ); 
 

This routine matches at least n characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.ntoMCset( cst:cset; n:uns32; m:uns32 ); 
 

This routine matches at least n characters and no more than m characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure. This routine does not fail if more than m characters belong to the set. However, it only matches through position m.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NtoMCset( cst:cset; n:uns32; m:uns32 ); external;
 

This routine matches at least n characters and no more than m characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure. This routine does not fail if more than m characters belong to the set. However, it only matches through position m.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNtoMCset( cst:cset; n:uns32; m:uns32 ); 
 

This routine matches at least n characters and no more than m characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure. This routine fails if more than m characters belong to the set.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ExactlyNtoMCset( cst:cset; n:uns32; m:uns32 ); external;
 

This routine matches at least n characters and no more than m characters belonging to the cst set. If fewer than n characters match the set, this routine returns failure. This routine fails if more than m characters belong to the set.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

23.4.2 Character Matching Procedures

procedure pat.peekChar( c:char );
 

This routine succeeds if the character pointed at by the cursor (ESI) is equal to c; it fails otherwise. This routine does not advance the cursor.

procedure pat.oneChar( c:char ); 
 

This routine succeeds if the character pointed at by the cursor (ESI) is equal to c; it fails otherwise. If it succeeds, this routine advances the cursor.

procedure pat.upToChar( c:char ); 
 

This routine matches all characters in a string from the cursor position up to the specified parameter. It fails if the specified character is not in the string. Note that this routine leaves the cursor pointing at the character specified by the parameter (i.e., it still remains to be matched).

procedure pat.zeroOrOneChar( c:char ); 
 

This routine matches zero or one occurrences of the character parameter.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match one character in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine.

procedure pat.l_ZeroOrOneChar( c:char ); 
 

This routine matches zero or one occurrences of the character parameter.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. If that fails, then this routine fails.

procedure pat.zeroOrMoreChar( c:char ); 
 

This routine matches zero or more occurrences of the character parameter. It leaves the cursor pointing at the end of the string or the first character that is not equal to c.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ZeroOrMoreChar( c:char ); 
 

This routine matches zero or more occurrences of the character parameter. It leaves the cursor pointing at the end of the string or the first character that is not equal to c.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.oneOrMoreChar( c:char ); 
 

This routine matches one or more occurrences of the character parameter. It leaves the cursor pointing at the end of the string or the first character that is not equal to c. It fails if there isn't at least one copy of c at the cursor position.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_OneOrMoreChar( c:char ); 
 

This routine matches one or more occurrences of the character parameter. It leaves the cursor pointing at the end of the string or the first character that is not equal to c. It fails if there isn't at least one copy of c at the cursor position.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., one). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNChar( c:char; n:uns32 ); 
 

This routine matches exactly n copies of the character c in the string. If more, or less, copies of c appear in the string, this routine fails.

procedure pat.firstNChar( c:char; n:uns32 ); 
 

This routine matches exactly n copies of the character c in the string. If less, copies of c appear in the string, this routine fails. If more copies of c appear in the string, this routine succeeds, however, it only matches the first n copies.

procedure pat.norLessChar( c:char; n:uns32 ); 
 

This procedure matches n or fewer copies of c in the current string. If additional copies of c appear in the string, this routine still succeeds but it only matches the first n copies.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorLessChar( c:char; n:uns32 ); 
 

This procedure matches n or fewer copies of c in the current string. If additional copies of c appear in the string, this routine still succeeds but it only matches the first n copies.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.norMoreChar( c:char; n:uns32 ); 
 

This procedure matches n or more copies of c in the current string. It fails if there are fewer than n copies of c.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorMoreChar( c:char; n:uns32 );

This procedure matches n or more copies of c in the current string. It fails if there are fewer than n copies of character c in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.ntoMChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position. This routine succeeds even if there are more than m copies of the character, however, it will only match the first m characters in the string.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NtoMChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position. This routine succeeds even if there are more than m copies of the character, however, it will only match the first m characters in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNtoMChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position. This routine fails if there are more than m copies of the character in the string.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ExactlyNtoMChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position. This routine fails if there are more than m copies of the character in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

23.4.3 Case Insensitive Character Matching Routines

These routines are semantically identical to the above routines with one difference- when they compare the characters they use a case insensitive comparison. Please see the descriptions above for an explanation of these routines.

procedure pat.peekiChar( c:char ); 
 

This routine succeeds if the character pointed at by the cursor (ESI) is equal to c using a case insensitive comparison; it fails otherwise. This routine does not advance the cursor.

procedure pat.oneiChar( c:char ); 
 

This routine succeeds if the character pointed at by the cursor (ESI) is equal to c using a case insensitive comparison it fails otherwise. If it succeeds, this routine advances the cursor.

procedure pat.upToiChar( c:char ); 
 

Using a case insensitive comparison, this routine matches all characters in a string from the cursor position up to the specified parameter. It fails if the specified character is not in the string. Note that this routine leaves the cursor pointing at the character specified by the parameter (i.e., it still remains to be matched).

procedure pat.zeroOrOneiChar( c:char ); 
 

This routine matches zero or one occurrences of the character parameter using a case insenstive comparison.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match one character in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine.

procedure pat.l_ZeroOrOneiChar( c:char ); 
 

This routine matches zero or one occurrences of the character parameter using a case insenstive comparison.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. If that fails, then this routine fails.

procedure pat.zeroOrMoreiChar( c:char ); 
 

This routine matches zero or more occurrences of the character parameter using a case insenstive comparison. It leaves the cursor pointing at the end of the string or the first character that is not equal to c.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ZeroOrMoreiChar( c:char ); 
 

This routine matches zero or more occurrences of the character parameter using a case insenstive comparison. It leaves the cursor pointing at the end of the string or the first character that is not equal to c.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.oneOrMoreiChar( c:char ); 
 

This routine matches one or more occurrences of the character parameter using a case insenstive comparison. It leaves the cursor pointing at the end of the string or the first character that is not equal to c. It fails if there isn't at least one copy of c at the cursor position.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_OneOrMoreiChar( c:char ); 
 

This routine matches one or more occurrences of the character parameter using a case insenstive comparison. It leaves the cursor pointing at the end of the string or the first character that is not equal to c. It fails if there isn't at least one copy of c at the cursor position.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., one). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNiChar( c:char; n:uns32 ); 
 

This routine matches exactly n copies of the character c in the string using a case insenstive comparison. If more, or less, copies of c appear in the string, this routine fails.

procedure pat.firstNiChar( c:char; n:uns32 ); 
 

This routine matches exactly n copies of the character c in the string using a case insenstive comparison. If less, copies of c appear in the string, this routine fails. If more copies of c appear in the string, this routine succeeds, however, it only matches the first n copies.

procedure pat.norLessiChar( c:char; n:uns32 ); 
 

This procedure matches n or fewer copies of c in the current string using a case insenstive comparison. If additional copies of c appear in the string, this routine still succeeds but it only matches the first n copies.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to the original cursor position (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorLessiChar( c:char; n:uns32 ); 
 

This procedure matches n or fewer copies of c in the current string using a case insenstive comparison. If additional copies of c appear in the string, this routine still succeeds but it only matches the first n copies.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., zero). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.norMoreiChar( c:char; n:uns32 ); 
 

This procedure matches n or more copies of c in the current string using a case insenstive comparison. It fails if there are fewer than n copies of c.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NorMoreiChar( c:char; n:uns32 ); 
 

This procedure matches n or more copies of c in the current string using a case insenstive comparison. It fails if there are fewer than n copies of character c in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond the end of the string (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.ntoMiChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position using a case insenstive comparison. This routine succeeds even if there are more than m copies of the character, however, it will only match the first m characters in the string.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_NtoMiChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position using a case insenstive comparison. This routine succeeds even if there are more than m copies of the character, however, it will only match the first m characters in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.exactlyNtoMiChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position using a case insenstive comparison. This routine fails if there are more than m copies of the character in the string.

This function uses an "aggressive" or "eager" pattern matching algorithm. It will attempt to match as many characters as possible in the string. If doing so would cause a following match routine to fail, this routine will backtrack one character and retry the following match routine. This continues until it backs up to position n (in which case this routine fails) or the following match routine(s) succeed.

procedure pat.l_ExactlyNtoMiChar( c:char; n:uns32; m:uns32 ); 
 

This procedure matches between n and m copies of the character c starting at the current cursor (ESI) position using a case insenstive comparison. This routine fails if there are more than m copies of the character in the string.

This function uses a "deferred" or "lazy" pattern matching algorithm. It will attempt to match as few characters as possible in the string (i.e., n). If doing so would cause a following match routine to fail, this routine will backtrack and advance one character and then retry the following match routine. This continues until it advances beyond position m (in which case this routine fails) or the following match routine(s) succeed.

23.4.4 String matching procedures

procedure pat.matchStr( s:string ); 
 

If the sequence of characters at the current cursor position (ESI) match the specified string, this routine succeeds, otherwise it fails.

procedure pat.matchiStr( s:string ); 
 

Like pat.matchStr, except this routine does a case insensitive comparison.

procedure pat.matchToStr( s:string ); 
 

This routine matches all characters up to, and including, the specified string. If it matches a string and a following pattern matching routine fails, this routine handles the backtracking and searches for the next string that matches. The backtracking is lazy insofar is this routine will always match the minimum number of occurrences of s in the string in order to succeed.

procedure pat.upToStr( s:string ); 
 

This routine matches all characters up to, but not including, the specified string. If it matches a string and a following pattern matching routine fails, this routine handles the backtracking and searches for the next string that matches. The backtracking is lazy insofar is this routine will always match the minimum number of occurrences of s in the string in order to succeed.

procedure pat.matchToiStr( s:string ); 
 

Like pat.matchToStr, except this routine does a case insensitive comparison.

procedure pat.upToiStr( s:string ); 
 

Like pat.upToStr, except this routine does a case insensitive comparison.

procedure pat.matchWord( s:string ); 
 

This routine is similar to pat.matchStr except that it requires a delimiter character after the string it matches. The delimiter character is a member of the WordDelims character set (internal to the patterns.hhf code). WordDelims is, by default, the character set "-{'a'..'z', 'A'..'Z', '0'..'9', '_'}" (that is, all character except the alphanumeric characters and the underscore). See the getWordDelims and setWordDelims procedures if you are interested in changing the word delimiters set.

procedure pat.matchiWord( s:string ); 
 

Just like pat.matchWord, except this routine does a case insensitive comparison.

procedure pat.getWordDelims( var cst:cset ); 
 

This function makes a copy of the internal WordDelims character set and places this copy in the specified cst parameter.

procedure pat.setWordDelims( cst:cset);
 

This function stores the value of the cst character set into the WordDelims character set. This allows you to change the WordDelims character set to your liking.

23.4.5 String Extraction Procedures

procedure pat.extract( s:string ); 
 

Whenever a pattern matching routine successfully matches zero or more characters in the string, the pattern matching routine returns a pointer to the start of the matched characters in EBX and a pointer to the position just beyond the last matched position in ESI. You may use the pat.a_extract procedure to create an HLA-compatible string of these matched characters. This routine will raise an exception if the destination string isn't big enough to hold the extracted characters.

procedure pat.a_extract( var s:string ); 
 

Whenever a pattern matching routine successfully matches zero or more characters in the string, the pattern matching routine returns a pointer to the start of the matched characters in EBX and a pointer to the position just beyond the last matched position in ESI. You may use the pat.a_extract procedure to create an HLA-compatible string of these matched characters. pat.a_extract will allocate storage for the string on the heap, copy the matched characters to this string, and then store a pointer to the new string in the string variable passed as a reference parameter to pat.a_extract.

Warning: pat.extract and pat.a_extract should only be called in the "success" section of a pat.match..pat.endmatch block. Any other invocation could create a problem. In general, you must ensure that EBX and ESI point at reasonable spots within the same string. Note that pattern match failure does not guarantee that EBX contains a reasonable value. Therefore, you should not use pat.extract or pat.a_extract at a point where string failure could have occurred unless you explicitly set up EBX (and, possibly, ESI) yourself.

23.4.6 Whitespace and End of String Matching Procedures

These convenient routines match a sequence of whitespace characters as well as the end of the current string. By default, these routines assume that whitespace consists of all the control characters, the ASCII space (#$20), and the del code (#$7f). You can change this definition using the pat.getWhiteSpace and pat.setWhiteSpace procedures.

procedure pat.getWhiteSpace( var cst:cset ); 
 

This function returns the current value of the internal WhiteSpace character set. It stores the result in the reference parameter.

procedure pat.setWhiteSpace( cst:cset); 
 

This procedure copies the specified character set to the internal WhiteSpace character set. All future whitespace matching procedures will use this new value when matching white space characters.

procedure pat.zeroOrMoreWS; 
 

This routine matches zero more more whitespace characters. This routine uses an "eager" matching algorithm.

procedure pat.oneOrMoreWS; 
 

This routine matches zero more more whitespace characters. This routine uses an "eager" matching algorithm.

procedure pat.WSorEOS; 
 

This routine matches a single whitespace character or the end of the string. It fails if there are characters left in the string and the character at the cursor position is not a white space character.

procedure pat.WSthenEOS; 
 

This routine matches zero or more white space characters that appear at the end of the current string. It fails if there are any other characters before the end of the string.

procedure pat.peekWS; 
 

This routine succeeds if the next character in the string is a whitespace character. However, it does not advance the cursor over the character.

procedure pat.peekWSorEOS; 
 

This routine succeeds if the next character in the string is a white space character or if there are no more characters in the string. It does not advance the cursor.

23.4.7 Match an Arbitrary Sequence of Characters

procedure pat.arb; 
 

This routine matches zero or more characters. It uses an "aggressive" or "eager" matching algorithm, immediately matching all the remaining characters in the string. If following matching routines fail, this routine backtracks one character at a time until reaching the initial starting position (in which case this routine fails) or the following matching routine(s) succeed.

procedure pat.l_arb; external;
 

This is a "lazy" or "deferred" version of the above routine. It matches zero characters and succeeds; if a following match routine fails, this routine backtracks by advancing the cursor one position for each failure. If this routine advances beyond the end of the string during backtracking, it reports failure.

23.4.8 Writing Your Own Pattern Matching Routines

Although HLA provides a wide variety of pattern matching functions, from which you can probably synthesize any pattern you desire, there are several reasons why you might want to write your own pattern matching routines. Some common reasons include: (1) You would like a more efficient pattern matching function than is possible by composing existing pattern matching functions. (2) You need a particular pattern matching routine to produce a side effect and the standard matching routines do not produce the desired side effect. A common example is a pattern matching routine that returns an attribute value for an item it matches. For example, a routine that matches a string of decimal digits may return the numeric equivalent of that string as an attribute of that pattern. (3) You need a pattern matching routine that considers other machine states (i.e., variable values) besides the string the pattern is processing. (4) You need to handle some context-sensitive issues. (5) You want to understand how the pattern matching algorithm works. Writing your own pattern matching functions can achieve all these goals and many more.

The first issue you must address when writing your own pattern matching routine is whether or not the routine supports back tracking. Generally, this decision depends upon whether the function matches strings that are always a fixed length or can match strings of differing lengths. For example, the pat.oneCset routine always matches a string of length one whereas the pat.zeroOrMoreCset function can match strings of any length. If a function can only match strings having a fixed length, then the function does not need to support back tracking. Generally, pattern matching functions that can match strings of varying lengths should support backtracking3. Since supporting back tracking is more work and less efficient, you should only support it when necessary.

Once you've decided that you're going to support back tracking in a matching function, the next issue that concerns you is whether the function supports eager evaluation or lazy/deferred evaluation. (Note: when writing general matching routines for library use, it's generally a good idea to supply two functions, one that supports eager evaluation and one that supports lazy/deferred evaluation.)

A function that supports eager evaluation tries to match the longest possible string when the program calls the function. If the function succeeds and a later matching functions fails (invoking the backtracking operation), then the matching function backs off the minimum number of characters that will still match. This process continues until the following code succeeds or the function backs off so much that it, too, fails.

If function that support lazy/deferred evaluations tries to match the shortest possible string. Once it matches the shortest string it can, it passes control on to the following pattern matching functions. If they fail and back tracking returns control to the function, it tries to match the next smallest string larger than the one it currently matches. This process repeats until the following match functions succeed or the current function fails to match anything.

Note that the choice of eager vs. lazy/deferred evaluation does not generally affect whether a pattern will match a given string4. It does, however, affect the efficiency of the pattern matching operation. Backtracking is a relatively slow operation. If an eager match causes the following pattern functions to fail until the current pattern matching function backs off to the shortest possible string it can match, the program will run much slower than one that uses lazy evaluation for the function (since it starts with the shortest possible string to begin with). On the other hand, if a function needs to match the longest possible string in order for the following matching functions to succeed, choosing lazy evaluation will run much more slowly than eager evaluation. Therefore, the choice of which form is best to use is completely data dependent. If you have no idea which evaluation form should be better, choose eager evaluation since it is more intuitive to those defining the pattern to match.

All pattern matching routines have two implicit parameters passed to them in the ESI and EDI registers. ESI is the current cursor position while EDI points at the byte immediately after the last character available for matching. That is, the characters between locations ESI and EDI-1 form the string to match against the pattern.

The primary purpose of a pattern matching function is to return "success" or "failure" depending upon whether the pattern matches the characters in the string (or however else you define "success" versus "failure"). In addition to returning success or failure, pattern matching functions must also return certain values in some of the registers. In particular, the function must preserve the value in EDI (that is, it must still point at the first byte beyond the end of the string to match). If the function succeeds, it must return EBX pointing at the start of the sequence it matched (i.e., EBX must contain the original value in ESI) and ESI must point at the first character beyond the string matched by the function (so the string matched is between addresses EBX and ESI-1). If the function fails, it must return the original values of ESI and EDI in these two registers. EBX's value is irrelevant if the function fails. Except for EBP, the routine need not preserve any other register values (and, in fact, a pattern matching function can use the other registers to return attribute values to the calling code)5.

Pattern matching routines that do not support backtracking are the easiest to create and understand. Therefore, it makes sense to begin with a discussion of those types of pattern matching routines.

A pattern matching routine that does not support backtracking succeeds by simply returning to its caller (with the registers containing the appropriate values noted above). If the function fails to match the characters between ESI and EDI-1, it must call the "pat._fail_" function passing the "pat.FailTo" object as its parameter, e.g.,

pat._fail_( pat.FailTo );
 

 

As a concrete example, consider the following implementation of the pat.matchStr function:


 
unit patterns;
 
#include( "pat.hhf" );
 

 
procedure pat.matchStr( s:string ); @nodisplay; @noframe;
 
begin matchStr;
 

 
    push( ebp );        // must do this ourselves since noframe
 
    mov( esp, ebp );    // is specified as an option.
 
    cld();
 
    
 
    // Move a copy of ESI into EBX since we need to return
 
    // the starting position in EBX if we succeed.
 

 
    mov( esi, ebx );
 

 
    // Compute the length of the remaining
 
    // characters in the sequence we are attempting
 
    // to match (i.e., EDI-ESI) and compare this against
 
    // the length of the string passed as a parameter.
 
    // If the parameter string is longer than the number
 
    // of characters left to match, then we can immediately
 
    // fail since there is no way the string is going to
 
    // to match the string parameter.
 

 
    mov( s, edx );
 
    mov( (type str.strRec [edx]).length, ecx );
 
    mov( edi, eax );
 
    sub( esi, eax );
 
    if( ecx > eax ) then
 

 
        // At this point, there aren't enough characters left
 
        // in the sequence to match s, so fail.
 
        
 
        pat._fail_( pat.FailTo );       
 

 
    endif;
 

 
    // Okay, compare the two strings up to the length of s
 
    // to see if they match.
 

 
    push( edi );
 
    mov( edx, edi );
 
    repe.cmpsb();
 
    pop( edi );
 
    if( @ne ) then
 

 
        // At this point, the strings are unequal, so fail.
 
        // Note that this code must restore ESI to its
 
        // original value if it returns failure.
 
        
 
        mov( ebx, esi );
 
        pat._fail_( pat.FailTo );       
 

 
    endif;
 
        
 
    // Since this routine doesn't have to handle backtracking,
 
    // a simple return indicates success.
 

 
    pop( ebp );
 
    ret();
 
        
 
end matchStr;
 
end patterns;
 

Program 4 Pat.matchstr Source Code

If your function needs to support back tracking, the code will be a little more complex. First of all, your function cannot return to its caller by using the RET instruction. To support backtracking, the function must leave its activation record on the stack when it returns. This is necessary so that when backtracking occurs, the function can pick up where it left off. It is up to the pat.match macro to clean up the stack after a sequence of pattern matching functions successfully match a string.

If a pattern matching function supports backtracking, it must preserve the values of ESP, ESI, and EDI upon initial entry into the code. It will also need to maintain the currrent cursor position during backtracking and it will need to reserve storage for a special "pat.FailRec" data structure. Therefore, almost every pattern matching routine you'll write that supports backtracking will have the following VAR objects:


 
var
 
	cursor:    misc.pChar; // Save last matched posn here.
 
	startPosn: misc.pChar; // Save start of str here.
 
	endStr:    misc.pChar; // End of string goes here.
 
	espSave:   dword;      // To clean stk after back trk.
 
	FailToSave:pat.FailRec;// Save global FailTo value here.
 

 

Warning: you must declare these variables in the VAR section; they must not be static objects.

Upon reentry from backtracking, the ESP register will not contain an appropriate value. It is your code's responsibility to clean up the stack when backtracking occurs. The easiest way to do this is to save a copy of ESP upon initial entry into your function (in the espSave variable above) and restore ESP from this value whenever backtracking returns control to your function (you'll see how this happens in a moment). Likewise, upon reentry into your function via backtracking, the registers are effectively scrambled. Therefore, you will need to save ESI's value into the startPosn variable and EDI's value into the endStr variable upon initial entry into the function. The startPosn variable contains the value that EBX must have whenever your function returns success. The cursor variable contains ESI's value after you've successfully matched some number of characters. This is the value you reload into ESI whenever backtracking occurs. The FailToSave data structure holds important pattern matching information. The pattern matching library automatically fills in this structure when you signal success; you are only responsible for supplying this storage, you do not have to initialize it.

You signal failure in a function that supports backtracking the same way you signaled failure in a routine that does not support backtracking: by invoking "pat._fail_( pat.FailTo );" Since your code is failing, the caller will clean up the stack (including removing the local variables you've just allocated and initialized). If the pattern matching system calls your pattern matching function after backtracking occurs, it will reenter your function at its standard entry point where you will, once again, allocate storage for the local variables above and initialize them as appropriate.

If your function succeeds, it usually signals success by invoking the pat._success_ macro. This macro invocation takes the following form:

pat._success_( FailToSave, FailToHere );
 

 

The first parameter is the pat.FailRec object you declared as a local variable in your function. The pat._success_ macro stores away important information into this object before returning control to the caller. The FailToHere symbol is a statement label in your function. If backtracking occurs, control transfers to this label in your function (i.e., this is the backtracking reentry point). The code at the FailToHere label must immediately reload ESP from espSave, EDI from endStr, EBX from startPosn, and ESI from cursor. Then it does whatever is necessary for the backtrack operation and attempts to succeed or fail again.

The pat._success_ macro (currently) takes the following form6:


 
	// The following macro is a utility for
 
	// the pattern matching procedures.
 
	// It saves the current global "FailTo"
 
	// value in the "FailRec" variable specified
 
	// as the first parameter and sets up
 
	// FailTo to properly return control into
 
	// the current procedure at the "FailTarget"
 
	// address.  Then it jumps indirectly through
 
	// the procedure's return address to transfer
 
	// control to the next (code sequential)
 
	// pattern matching routine.
 

 
	#macro _success_( _s_FTSave_, _s_FailTarget_ );
 
		
 
		// Preserve the old FailTo object in the local
 
		// FailTo variable.
 

 
		mov( pat.FailTo.ebpSave, _s_FTSave_.ebpSave );
 
		mov( pat.FailTo.jmpAdrs, _s_FTSave_.jmpAdrs );
 
		
 
		// Save current EBP and failto target address
 
		// in the global FailTo variable so backtracking
 
		// will return the the current routine.
 

 
		mov( ebp, pat.FailTo.ebpSave );
 
		mov( &_s_FailTarget_, pat.FailTo.jmpAdrs );
 

 
		// Push the return address onto the stack (so we
 
		// can return to the caller) and restore
 
		// EBP to the caller's value.  Then jump
 
		// back to the caller without cleaning up
 
		// the current routine's stack.
 

 
		push( [ebp+4] );
 
		mov( [ebp], ebp );
 
		ret();
 

 
	#endmacro
 

Program 5 Pat._success Macro Source Code

As you can see, this code copies the global pat.FailTo object into the FailToSave data structure you've created. The FailTo structure contains the EBP value and the reentry address of the most recent function that supports backtracking. You code must save these values in the event your code (ultimately) fails and needs to backtrack to some previous pattern matching function.

After preserving the old value of the global pat.FailTo variable, the code above copies EBP and the address of the FailToHere label you've specified into the global pat.FailTo object.

Finally, the code above returns to the user, without cleaning up the stack, by pushing the return address (so it's on the top of the stack) and restoring the caller's EBP value. The RET() instruction above returns control to the function's caller (note that the original return address is still on the stack, the pattern matching routines will never use it).

Should backtracking occur and the program reenters your pattern matching function, it will reenter at the address specified by the second parameter of the pat._success_ macro (as noted above). You should restore the appropriate register (as noted above) and use the value in the cursor variable to determine how to proceed with the backtracking operation. When doing eager evaluation, you will generally need to decrement the value obtained from cursor to back off on the length of the string your program has matched (failing if you decrement back to the value in startPosn). When doing lazy evaluation, you generally need to increment the value obtained from the cursor variable in order to match a longer string (failing if you increment cursor to the point it becomes equal to endStr).

When executing code in the reentry section of your procedure, the failure and success operations are a little different. Prior to failing, you must manually restore the value in pat.FailTo that pat._success_ saved into the FailToSave local variable. You must also restore ESI with the original starting position of the string. The following instruction sequence will accomplish this:


 
			// Need to restore FailTo address because it
 
			// currently points at us.  We want to jump
 
			// to the correct location.
 

 
			mov( startPosn, esi );
 
			mov( FailToSave.ebpSave, pat.FailTo.ebpSave );
 
			mov( FailToSave.jmpAdrs, pat.FailTo.jmpAdrs );
 
			pat._fail_( pat.FailTo );
 

 

Likewise, succeeding in the backtrack reentry section of your program is a little different. You do not want to invoke the pat._success_ macro because it will overwrite the FailToSave value with the global pat.FailTo. The global value, however, points at your routine; were you to overwrite this value you'd never be able to fail back to previous matching functions in the current pattern match. Therefore, you should always execute code like the following when succeeding in the reentry section of your code:


 
		mov( esi, cursor );  //Save current cursor value.
 
		push( [ebp+4] );     //Make a copy of the rtn adrs.
 
		mov( [ebp], ebp );   //Restore caller's EBP value.
 
		ret();               //Return to caller.
 

 

The following is the code for the pat.oneOrMoreCset routine (that does an eager evaluation) that demonstrates pattern matching with backtracking.


 
unit patterns;
 
#include( "pat.hhf" );
 

 
    
 
    
 
/************************************************************/
 
/*                                                          */
 
/* OneOrMoreCset-                                           */
 
/*                                                          */
 
/* Matches one or more characters in a string from          */
 
/* the specified character set.                             */
 
/*                                                          */
 
/* Disposition: Eager                                       */
 
/* BackTrackable:   Yes                                     */
 
/*                                                          */
 
/* Entry Parameters:                                        */
 
/*                                                          */
 
/*  ESI:    Pointer to sequence of characters to match.     */
 
/*  EDI:    Pointer to byte beyond last char to match.      */
 
/*  cst:    Character set to match with.                    */
 
/*                                                          */
 
/* Exit Parameters (if success):                            */
 
/*                                                          */
 
/*  EBX:    Points at the start of matched sequence.        */
 
/*  ESI:    Points at first character not in cst.           */
 
/*  EDI:    Unchanged from entry value.                     */
 
/*                                                          */
 
/* Exit Parameters (if failure):                            */
 
/*                                                          */
 
/*  EDI:    Unchanged from entry value.                     */
 
/*                                                          */
 
/* Unless noted, assume all other registers can be modified */
 
/* by this code.                                            */
 
/*                                                          */
 
/************************************************************/
 

 
        
 
procedure pat.oneOrMoreCset( cst:cset ); @nodisplay;
 
var
 
    cursor:     misc.pChar;     // Save last matched posn here.
 
    startPosn:  misc.pChar;     // Save start of str here.
 
    endStr:     misc.pChar;     // End of string goes here.
 
    espSave:    dword;          // To clean stk after back trk.
 
    FailToSave: pat.FailRec;    // Save global FailTo value here.
 
    
 
begin oneOrMoreCset;
 

 
    // If some routine after this one fails and transfers
 
    // control via backtracking to this code, the stack
 
    // will be a mess.  So save esp so we can clean up
 
    // the stack if backtracking is necessary.
 
    
 
    mov( esp, espSave );
 
    
 
    // Save the pointer to the start of the string
 
    // to match.  This is used as a "fence" value
 
    // to prevent backtracking past the start of
 
    // the string if things go really wrong.
 
    
 
    mov( esi, startPosn );
 
    mov( esi, ebx );
 

 
    // Save pointer to end of string to match.
 
    // This is needed to restore this value when
 
    // backtracking occurs.
 

 
    mov( edi, endStr );
 
    
 
    // Okay, eagerly match as many characters in
 
    // the character set as possible.
 
    
 
    xor( eax, eax );
 
    dec( esi );
 
    repeat
 
    
 
        inc( esi );                 // Move to next char in string.
 
        breakif( esi >= edi );      // Stop at end of string.
 
        mov( [esi], al );           // Get the char to test.
 
        bt( eax, (type dword cst)); // See if in cst.
 
        
 
    until( @nc );   // Carry is set if al in cst.
 
    
 
    // So we can easily back track, save a pointer
 
    // to the first non-matching character.
 
    
 
    mov( esi, cursor );
 
    
 
    
 
    // If we matched at least one character, then
 
    // succeed by jumping to the return address, without
 
    // cleaning up the stack (we need to leave our
 
    // activation record laying around in the event
 
    // backtracking is necessary).
 

 
    if( esi > ebx ) then
 
    
 
        pat._success_( FailToSave, FailToHere );
 
        
 
    endif;
 
    
 
    // If we get down here, we didn't match at
 
    // least one character.  So transfer control
 
    // to the previous routine that supported
 
    // backtracking.
 
    
 
    mov( startPosn, esi );
 
    pat._fail_( pat.FailTo );       
 

 

 

 

 
    // If someone after us fails and invokes
 
    // backtracking, control is transfered to
 
    // this point.  First, we need to restore
 
    // ESP to clean up the junk on the stack.
 
    // Then we back up one character, failing
 
    // if we move beyond the beginning of the
 
    // string.  If we don't fail, we jump to
 
    // the code following the call to this
 
    // routine (having backtracked one character).
 
    
 
    FailToHere:
 

 
        mov( espSave, esp );    // Clean up stack.
 
        
 
        mov( cursor, esi );     // Get last posn we matched.
 
        dec( esi );         // Back up to prev matched char.
 
        mov( endStr, edi );
 
        mov( startPosn, ebx );
 
        if( esi <= ebx ) then
 
        
 
            // We've backed up to the beginning of
 
            // the string.  So we won't be able to
 
            // match at least one character.        
 

 
            mov( ebx, esi );
 
            mov( FailToSave.ebpSave, pat.FailTo.ebpSave );
 
            mov( FailToSave.jmpAdrs, pat.FailTo.jmpAdrs );
 
            pat._fail_( pat.FailTo );
 
            
 
        endif;
 
    
 
        // If we drop down here, there is at least one
 
        // character left in the string that we've
 
        // matched, so call the next matching routine
 
        // (by jumping to the return address) to continue
 
        // the pattern match.
 
        
 
        mov( esi, cursor );
 
        mov( [ebp+4], eax );
 
        mov( [ebp], ebp );
 
        jmp( eax );
 

 
    
 
end oneOrMoreCset;
 

 

 
end patterns;
 

Program 6 Backtracking Demonstration (pat.oneOrMoreCset Source Code)

The following example code demonstrates the pat.l_OneOrMoreCset routine. This is the same routine as the code above except this code supports lazy/deferred evaluation rather than eager evaluation.


 
unit patterns;
 
#include( "pat.hhf" );
 

 
    
 
    
 
/************************************************************/
 
/*                                                          */
 
/* l_OneOrMoreCset-                                         */
 
/*                                                          */
 
/* Matches one or more characters in a string from          */
 
/* the specified character set.  Matches the shortest       */
 
/* possible string that yields (overall) success.           */
 
/*                                                          */
 
/* Disposition:     Deferred                                */
 
/* BackTrackable:   Yes                                     */
 
/*                                                          */
 
/* Entry Parameters:                                        */
 
/*                                                          */
 
/*  ESI:    Pointer to sequence of characters to match.     */
 
/*  EDI:    Pointer to byte beyond last char to match.      */
 
/*  cst:    Character set to match with.                    */
 
/*                                                          */
 
/* Exit Parameters (if success):                            */
 
/*                                                          */
 
/*  ESI:    Points at first character not in cst.           */
 
/*  EDI:    Unchanged from entry value.                     */
 
/*                                                          */
 
/* Exit Parameters (if failure):                            */
 
/*                                                          */
 
/*  EDI:    Unchanged from entry value.                     */
 
/*  ESI:    Unchanged from entry value.                     */
 
/*                                                          */
 
/* Unless noted, assume all other registers can be modified */
 
/* by this code.                                            */
 
/*                                                          */
 
/************************************************************/
 

 
        
 
        
 
procedure pat.l_OneOrMoreCset( cst:cset ); @nodisplay;
 
var
 
    cursor:     misc.pChar;     // Save last matched posn here.
 
    startPosn:  misc.pChar;     // Save start of str here.
 
    endStr:     misc.pChar;     // End of string goes here.
 
    espSave:    dword;          // To clean stk after back trk.
 
    FailToSave: pat.FailRec;    // Save global FailTo value here.
 
    
 
begin l_OneOrMoreCset;
 

 
    // If some routine after this one fails and transfers
 
    // control via backtracking to this code, the stack
 
    // will be a mess.  So save esp so we can clean up
 
    // the stack if backtracking is necessary.
 
    
 
    mov( esp, espSave );
 
    
 
    // Save the pointer to the start of the string
 
    // to match.  This is used as a "fence" value
 
    // to prevent backtracking past the start of
 
    // the string if things go really wrong.
 
    
 
    mov( esi, startPosn );
 
    mov( esi, ebx );
 

 
    // Save pointer to end of string to match.
 
    // This is needed to restore this value when
 
    // backtracking occurs.  If we're already
 
    // beyond the end of the chars to test, then
 
    // fail right away.
 

 
    mov( edi, endStr );
 
    if( esi >= edi ) then
 

 
        pat._fail_( pat.FailTo );
 

 
    endif;
 
    
 
    // Okay, this is a deferred version.  So match as
 
    // few characters as possible.  For this routine,
 
    // that means match exactly one character.
 
    
 
    xor( eax, eax );
 
    mov( [esi], al );           // Get the char to test.
 
    bt( eax, (type dword cst)); // See if in cst.
 
    if( @nc ) then
 

 
        pat._fail_( pat.FailTo );
 

 
    endif;
 

 
    // So we can easily back track, save a pointer
 
    // to the next character.
 
    
 
    inc( esi );
 
    mov( esi, cursor );
 
    
 
    // Save existing FailTo address and
 
    // point FailTo at our back tracking code,
 
    // then transfer control to the success
 
    // address (jump to our return address).
 

 
    pat._success_( FailToSave, FailToHere );
 

 

 

 

 
    // If someone after us fails and invokes
 
    // backtracking, control is transfered to
 
    // this point.  First, we need to restore
 
    // ESP to clean up the junk on the stack.
 
    // Then we need to advance one character
 
    // and see if the next char would match.
 
    
 
    FailToHere:
 

 
        mov( espSave, esp );        // Clean up stack.
 

 
        mov( cursor, esi );         // Get last posn we matched.
 
        mov( endStr, edi );         // Restore to original value.
 

 
        // If we've exceeded the maximum limit on the string,
 
        // or the character is not in cst, then fail.
 

 
        xor( eax, eax );
 
        if
 
        {
 
            cmp( esi, edi );
 
            jae true;
 
            mov( [esi], al );
 
            bt( eax, (type dword cst ));
 
            jc false;
 
        }
 

 
            // Need to restore FailTo address because it
 
            // currently points at us.  We want to jump
 
            // to the correct location.
 

 
            mov( startPosn, esi );
 
            mov( FailToSave.ebpSave, pat.FailTo.ebpSave );
 
            mov( FailToSave.jmpAdrs, pat.FailTo.jmpAdrs );
 
            pat._fail_( pat.FailTo );
 

 
        endif;
 

 
    
 
        // If we drop down here, there is at least one
 
        // character left in the string that we've
 
        // matched, so call the next matching routine
 
        // (by jumping to the return address) to continue
 
        // the pattern match.
 
        
 
        mov( startPosn, ebx );
 
        inc( esi );                     // Advanced to next posn
 
        mov( esi, cursor );             // save for backtracking,
 
        mov( [ebp+4], eax );            // and call next routine.
 
        mov( [ebp], ebp );
 
        jmp( eax );
 
    
 
end l_OneOrMoreCset;
 

 
end patterns;
 

Program 7 Backtracking Demonstration #2

1If an exception occurs, the exception handling code will clean up the stack, so exceptions are a legitimate way to prematurely leave a pat.match..pat.endmatch block.

2Note that it is okay to push data onto the stack, do some calculations, and then pop that data off the stack between calls to the pattern matching routines. However, you must ensure that the stack is unchanged since the last pattern matching routine (or since pat.match) or the pattern matching routines will malfunction.

3Although this is your decision. If for some reason you don't want to support backtracking in such functions, that is always an option you can choose.

4The one exception has to do with fences. If you set a fence after the pattern matching routine, then backtracking cannot return into the pattern matching function. In this one case, the choice of deferred vs. eager evaluation will have an impact on whether the whole pattern will match a given string.

5The HLA Standard Library Pattern Matching routines preserve EDX, so this is probably a good convention to follow so you don't surprise your users.

6This code was copied out of the "patterns.hhf" file at the time this document was written. You might want to take a look at the patterns.hhf header file to ensure that this code has not changed since this document was written.



TOC PREV NEXT INDEX