Lexing with tokenize
Python's tokenize library
was designed to lex Python source code and turn the elements into tokens, but
because tokens can be very generic, this tokenizer can be used for parsing
elements other than python source code too.
To begin to tokenize input, we import the generate_tokens method:
This method should be called with an argument which acts like the readline
function on a file object. In order to call this on a string, we need to create
a StringIO instance and pass its readline function:
>>> from StringIO import StringIO
>>> ts = list(generate_tokens(StringIO('3 * (4 + 5)').readline))
>>> ts
[(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'), (51, '*', (1, 2), (1, 3), '3 * (4 + 5)'), (51, '(', (1, 4), (1, 5), '3 * (4 + 5)'), (2, '4', (1, 5), (1, 6), '3 * (4 + 5)'), (51, '+', (1, 7), (1, 8), '3 * (4 + 5)'), (2, '5', (1, 9), (1, 10), '3 * (4 + 5)'), (51, ')', (1, 10), (1, 11), '3 * (4 + 5)'), (0, '', (2, 0), (2, 0), '')]
We'll use pprint to make this easier to read:
>>> from pprint import pformat
>>> print pformat(ts)
[(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'),
(51, '*', (1, 2), (1, 3), '3 * (4 + 5)'),
(51, '(', (1, 4), (1, 5), '3 * (4 + 5)'),
(2, '4', (1, 5), (1, 6), '3 * (4 + 5)'),
(51, '+', (1, 7), (1, 8), '3 * (4 + 5)'),
(2, '5', (1, 9), (1, 10), '3 * (4 + 5)'),
(51, ')', (1, 10), (1, 11), '3 * (4 + 5)'),
(0, '', (2, 0), (2, 0), '')]
Each token is a 5-tuple of the token code, the token string, the beginning and ending of the token, and the line on which it was found.
A Token Class
To make the output from the tokenizer easier to work with, we'll create a Token class.
class Token(object):
def __init__(self, code, value, start=None, stop=None, line=None):
self.code = code
self.value = value
self.start = start or (0,0)
self.stop = stop or (0,0)
self.line = line or ''
The name corresponding to the code in the tuple returned from the tokenizer
can be deciphered from the tok_name dict in the built-in
token module.
We define a name property to retrieve this:
And we use this in the __unicode__ and __str__ methods:
def __unicode__(self):
pos = u'-'.join(u'%d,%d' % x for x in [self.start, self.stop])
return u"%s %s '%s'" % (pos, self.name, self.value)
def test_strings():
token = Token(51, '*')
assert str(token) == "0,0-0,0 OP '*'"
assert unicode(token) == u"0,0-0,0 OP '*'"
We implement a __repr__ method:
def __repr__(self):
args = (self.code, self.value, self.start, self.stop, self.line)
return "Token(%r, %r, %r, %r, %r)" % args
And an __eq__ method:
def test_equality():
token1 = Token(51, '*', (1,1), (2,1))
token2 = Token(51, '*', (2,1), (3,1))
token3 = Token(2, '3', (2,1), (3,1))
assert token1 == token1
assert token1 == token2
assert token1 != token3
assert token2 != token3
Here's the class being used to print the token information we lexed earlier:
>>> from token_class import Token
>>> print "\n".join((unicode(Token(*t)) for t in ts))
1,0-1,1 NUMBER '3'
1,2-1,3 OP '*'
1,4-1,5 OP '('
1,5-1,6 NUMBER '4'
1,7-1,8 OP '+'
1,9-1,10 NUMBER '5'
1,10-1,11 OP ')'
2,0-2,0 ENDMARKER ''
Now we can write a tokenize method which wraps up this behavior:
>>> tokenize('3 * (4 + 5)')
[Token(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'), Token(51, '*', (1, 2), (1, 3), '3 * (4 + 5)'), Token(51, '(', (1, 4), (1, 5), '3 * (4 + 5)'), Token(2, '4', (1, 5), (1, 6), '3 * (4 + 5)'), Token(51, '+', (1, 7), (1, 8), '3 * (4 + 5)'), Token(2, '5', (1, 9), (1, 10), '3 * (4 + 5)'), Token(51, ')', (1, 10), (1, 11), '3 * (4 + 5)'), Token(0, '', (2, 0), (2, 0), '')]