Lexing with `tokenize`

Python's tokenize library was designed to lex Python source code and turn the elements into tokens, but because tokens can be very generic, this tokenizer can be used for parsing elements other than python source code too.

To begin to tokenize input, we import the generate_tokens method:

>>> from tokenize import generate_tokens

This method should be called with an argument which acts like the readline function on a file object. In order to call this on a string, we need to create a StringIO instance and pass its readline function:

>>> from StringIO import StringIO
>>> ts = list(generate_tokens(StringIO('3 * (4 + 5)').readline))
>>> ts
[(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'), (51, '*', (1, 2), (1, 3), '3 * (4 + 5)'), (51, '(', (1, 4), (1, 5), '3 * (4 + 5)'), (2, '4', (1, 5), (1, 6), '3 * (4 + 5)'), (51, '+', (1, 7), (1, 8), '3 * (4 + 5)'), (2, '5', (1, 9), (1, 10), '3 * (4 + 5)'), (51, ')', (1, 10), (1, 11), '3 * (4 + 5)'), (0, '', (2, 0), (2, 0), '')]

We'll use pprint to make this easier to read:

>>> from pprint import pformat
>>> print pformat(ts)
[(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'),
 (51, '*', (1, 2), (1, 3), '3 * (4 + 5)'),
 (51, '(', (1, 4), (1, 5), '3 * (4 + 5)'),
 (2, '4', (1, 5), (1, 6), '3 * (4 + 5)'),
 (51, '+', (1, 7), (1, 8), '3 * (4 + 5)'),
 (2, '5', (1, 9), (1, 10), '3 * (4 + 5)'),
 (51, ')', (1, 10), (1, 11), '3 * (4 + 5)'),
 (0, '', (2, 0), (2, 0), '')]

Each token is a 5-tuple of the token code, the token string, the beginning and ending of the token, and the line on which it was found.

A Token Class

To make the output from the tokenizer easier to work with, we'll create a Token class.

class Token(object):
    def __init__(self, code, value, start=None, stop=None, line=None):
        self.code = code
        self.value = value
        self.start = start or (0,0)
        self.stop = stop or (0,0)
        self.line = line or ''

The name corresponding to the code in the tuple returned from the tokenizer can be deciphered from the tok_name dict in the built-in token module.

We define a name property to retrieve this:

    @property
    def name(self):
        return tok_name[self.code]

And we use this in the __unicode__ and __str__ methods:

    def __unicode__(self):
        pos = u'-'.join(u'%d,%d' % x for x in [self.start, self.stop])
        return u"%s %s  '%s'" % (pos, self.name, self.value)

def test_strings():
    token = Token(51, '*')
    assert str(token) == "0,0-0,0 OP  '*'"
    assert unicode(token) == u"0,0-0,0 OP  '*'"

We implement a __repr__ method:

    def __repr__(self):
        args = (self.code, self.value, self.start, self.stop, self.line)
        return "Token(%r, %r, %r, %r, %r)" % args

def test_repr():
    token = Token(51, '*')
    assert repr(token) ==  "Token(51, '*', (0, 0), (0, 0), '')"

And an __eq__ method:

    def __eq__(self, other):
        return (self.code, self.value) == (other.code, other.value)

def test_equality():
    token1 = Token(51, '*', (1,1), (2,1))
    token2 = Token(51, '*', (2,1), (3,1))
    token3 = Token(2, '3', (2,1), (3,1))

    assert token1 == token1
    assert token1 == token2

    assert token1 != token3
    assert token2 != token3

Here's the class being used to print the token information we lexed earlier:

>>> from token_class import Token
>>> print "\n".join((unicode(Token(*t)) for t in ts))
1,0-1,1 NUMBER  '3'
1,2-1,3 OP  '*'
1,4-1,5 OP  '('
1,5-1,6 NUMBER  '4'
1,7-1,8 OP  '+'
1,9-1,10 NUMBER  '5'
1,10-1,11 OP  ')'
2,0-2,0 ENDMARKER  ''

Now we can write a tokenize method which wraps up this behavior:

>>> def tokenize(s):
...     return [Token(*t) for t in generate_tokens(StringIO(s).readline)]
...

>>> tokenize('3 * (4 + 5)')
[Token(2, '3', (1, 0), (1, 1), '3 * (4 + 5)'), Token(51, '*', (1, 2), (1, 3), '3 * (4 + 5)'), Token(51, '(', (1, 4), (1, 5), '3 * (4 + 5)'), Token(2, '4', (1, 5), (1, 6), '3 * (4 + 5)'), Token(51, '+', (1, 7), (1, 8), '3 * (4 + 5)'), Token(2, '5', (1, 9), (1, 10), '3 * (4 + 5)'), Token(51, ')', (1, 10), (1, 11), '3 * (4 + 5)'), Token(0, '', (2, 0), (2, 0), '')]

Lexing with tokenize

A Token Class

Lexing with `tokenize`