Lexical Analysis

Wikipedia defines this as the “conversion of text into (semantically or syntactically) meaningful lexical tokens belonging to categories defined by a lexer”¹. Urban Dictionary suggests the word I’ve been using as a verb to indicate the act of performing lexical analysis, ’lexing’, as something completely different and perhaps I should stop saying it².

If you read the book Compilers Principles, Techniques and Tools, 2nd Edition³, part 1.2 suggests that Lexical Analysis simply takes in a character stream and outputs a token stream. I highly recommend this book if anyone is interested in compiler design. It should be considered a fundamental for compiler study as it covers everything needed to write a working compiler.

The algorithm for ’lexing’ is trivial. Suppose you have a fixed size string to perform lexical analysis on int x = 1;, an index to the first character and some simple rules to follow:

If your index is a text character, keep moving the index forward and building a string by appending each character you see until you see a non-textual character, or character that can’t be used as part of an identifier if you’re lexing for a programming language.
If your index points to whitespace, keep moving the index forward. If you care about whitespace, you’d also record this in some way.
If your index is a number, keep moving the index forward and building a string representation of a number until you meet a non-numeric character.
If your index is non-textual and non-numeric such as an operator like =, +, -, *, etc, then emit a token representing that operator. You may need to ‘peek’ forward if you want to support operators of more than 1 character in size. Other characters like ; should also be handled like this and generate their own tokens.

A jizz covered library

Your requirements for tokens will be different depending on what your language looks like so these rules might change. The underlying logic for this pattern is to maintain a pointer into the stream and determine whether you consume the current character as a token on it’s own or remember the current character and consume forward to realise a larger token.

Tutorials usually show code that to me, seems impractical to maintain, such as nested switch statements. It won’t matter if you abstract this and create methods for handling operators of more than one character, or the various rules of identifiers, keywords, etc. It will end with the developer going, “I want to add another blah and that means I have to have another peek in this method, and another keyword in this dictionary…”

This is why some people opt for off the shelf tooling to solve this problem. They might rather write one list of regex and their associated token names. If you need to create a new token, add it to that single list and not worry about trying to fit that into the Lexer code.

You can see an example of Lex in action here. It appears very easy to use but this can be even simpler and you don’t need an entire tool to analyse a separate file to get the lexer constructed for you. I don’t want to use an off the shelf tool though and don’t really want to manage files just for the purpose of lexical analysis.

Here’s an example of what you could do in Golang for a regex powered Lexer:

type Token struct {
    Id       string
    Regex    *regexp.Regexp
}

func LexAsync(
	input io.Reader, 
	tokens []*Token, 
	bufferSize int,
	results chan string, 
	errs chan error) {
	if tokens == nil || len(tokens) == 0 {
		return
	}

	in := bufio.NewReader(input)

	buffer := make([]byte, bufferSize)

	count, err := in.Read(buffer)
	if err != nil || err == io.EOF || count == 0 {
		return
	}

	for !isZeroed(buffer) {
		couldConsume := false
		for i := len(tokens) - 1; i >= 0; i-- {
			token := tokens[i]
			if len(buffer) == 0 {
				break
			}

			match := token.Regex.Find(buffer)
			if match == nil {
				continue
			}

			couldConsume = true

			buffer, err = scrollInputBuffer(buffer, in, uint(len(match)))
			if err != nil {
				errs <- err
				break
			}
			
			results <- token.Id
			break
		}

		if !couldConsume {
			errs <- fmt.Errorf("UNKNOWN('%s')", string(buffer[0]))
			buffer, err = scrollInputBuffer(buffer, in, 1)
			if err != nil {
				errs <- err
				break
			}
		}
	}

	close(results)
	close(errs)
}

func isZeroed(buffer []byte) bool {
	for _, b := range buffer {
		if b != 0 {
			return false
		}
	}
	return true
}

func scrollInputBuffer(buffer []byte, input *bufio.Reader, count uint) ([]byte, error) {
	newData := make([]byte, count)

	_, err := input.Read(newData)
	if err != nil && err != io.EOF {
		return nil, err
	}

	newBuffer := make([]byte, 0)
	newBuffer = append(newBuffer, buffer[count:]...)
	newBuffer = append(newBuffer, newData...)

	return newBuffer, nil
}

This code comes from a parser generator I’m building and it’s not perfect and there is always room for improvement.

The intention is that a user would call it in a go routine and listen on the results and errs channels as a character stream is fed into it. Create an array filled with Token where its Id is the name of the token to be emitted, and the Regex is the rule to match to generate it.

The buffer size is the window into the reader that a regex may match against. It inflicts an arbitrary constraint on the maximum size that the underlying match of a token will be. So for example, a buffer size of 64 would mean that it could not Lex an ID longer than 64 bytes. This was done so that the regex didn’t have to run across the entire stream for all the tokens you might have. Less in memory at any point in time is good for performance in general, but I have not yet done any bench-marking or logical optimizations on this code. This is late-night coffee fueled code.

The reality is that you would need to extend the lexer for your own purposes. This has been stripped down for this post. You might need an improvement on this to tie it into something like a symbol table, or emit token matches with both the token type and matched value in a real use case. This should be very easy to implement; change the results channel type from string to your custom type and modify the points where tokens are emitted in the lexer.

The great thing about this is it allows the developer to only maintain a list of tokens and their associated regex patterns. I’m happy with this for now and although I want something high-performance, in the case of lexical analysis, I weigh up runtime performance with the ease of updates to the tokens and that easiness wins for now.