Python text scanner tutorial

PYTHON TEXT SCANNER TUTORIAL FULL

The trick here is that we construct the Lexicon inside the class scope of begin ( '' ) lexicon = Lexicon () ]) def _init_ ( self, file, name ): Scanner. nesting_level + 1 def end_comment ( self, text ): self. # Example 7 # class MyScanner ( Scanner ): def begin_comment ( self, text ): if self.

PYTHON TEXT SCANNER TUTORIAL FULL

Here’s an extended version which handles the full range of Pascal-style If someone feeds you a comment a few megabytes long one InĬontrast, the above example skips over the comment while never having to keep Plex has no way of knowing that, so if you try to recognise the whole comment asĪ single pattern, Plex happily buffers it all up before throwing it away. In the case of a comment, you’re not, but When Plex is scanning a token, it has toīuffer up all the characters read until the whole token is recognised, in case Using a separate state for comments, we avoid any such nightmare. Single pattern, but it would be a rather tricky and complicated one – something Not trying to handle nesting, we could have recognised the whole comment using a The scanner-state technique makes recognising comments rather easy. Position, and they match strings of different lengths, the longest one takes Longest match feature of Plex: if more than one token matches at a given input Note that the patterns which recognise the insides of the comment rely on the Re-entered and normal scanning continues. Their effect is to ignoreĮverything up to the next end-comment marker, whereupon the default state is Recognise the two tokens belonging to that state. Now the scanner is in the state called ‘comment’, and will only This is another special action, whose effect is to change the current state of When theīeginning of a comment is recognised, the action Begin(‘comment’) is invoked. In this state, only the first four tokens will match. Initially, a newly-created Scanner is in the default (In case you’re wondering, that second token list can Token in the token list, and takes two arguments: the nameof the state, andĪnother list of tokens. The State() constructor introduces a new scanner state. Rather unwieldy! In the next example, we’ll see a much better way. Have to have maybe_an_operator(), maybe_a_number(), maybe_a_keyword(), etc. If we wanted to recognise more tokens than just identifiers, we’d The value of the token otherwise the default value of None is returned, andĪt this point you’re probably thinking that the last bit is rather clumsy, and Procedure checks whether it’s inside a comment. When something that might be an identifier is recognised, the maybe_a_name() Here, the procedures begin_comment() and end_comment() maintain a count of theĬomment nesting level in an extra instance attribute attached to the Scanner. Returns None, scanning continues as if the IGNORE action hadbeen specified. If the procedure returnsĪnything other than None, it is returned as the value of the token. Recognised the token, and the text which was matched. When an action procedure is called, it is passed the scanner which has just nesting_level = 0 : return 'ident' lex = Lexicon () scn = Scanner ( lex. nesting_level - 1 def maybe_a_name ( scanner, text ): if scanner. nesting_level + 1 def end_comment ( scanner, text ): scanner. # Example 4 # def begin_comment ( scanner, text ): scanner. Patterns can be broken down into readableĬhunks and the parts commented, and general Python coding techniques can be Incidentally, this example also illustrates some of the advantages of using aĬonstructor-function approach to building regular expressions as opposed to the Here it is used to arrange for the reserved words and operators toĪll have unique token values, without having to explicitly list them all as

The TEXT special action causes the matched text to be returned as the value of.

Str() can take multiple strings as arguments, in which case it matches any one.

Matches p1 followed by p2, and p1 | p2 matches either p1or p2.

Patterns can be combined using the operators ‘+’ and ‘|’.

AnyBut(s) matches any single character (including a newline) which is not in.

Rep() matches zero or more repetitions of a pattern (as opposed to Rep1(),.

It matches any singleĬharacter which lies within one of the ranges defined by a pair. Range() takes a string containing pairs of characters.This example introduces some more features: # Example 3 # letter = Range ( "AZaz" ) digit = Range ( "09" ) name = letter + Rep ( letter | digit ) number = Rep1 ( digit ) space = Any ( " \t\n " ) comment = Str ( "" ) resword = Str ( "if", "then", "else", "end" ) lex = Lexicon ()