$$ \newcommand{\rmap}[3]{#1:#2\rightarrow #3} \newcommand{\lmap}[3]{#1:#2\leftarrow #3} \newcommand{\map}[3]{\rmap{#1}{#2}{#3}} \newcommand{\reals}[0]{\mathbb{R}} \newcommand{\xreals}[0]{\mathbb{R}\cup\{\infty\}} \newcommand{\ub}[1]{\rm{ub\ #1}} \newcommand{\lb}[1]{\rm{lb\ #1}} \newcommand{\glb}[1]{\rm{glb\ #1}} \newcommand{\lub}[1]{\rm{lub\ #1}} \newcommand{\ftom}[4]{\glb{ \left\{#2#1 |\, {}^*#2 = #3\ \rm{and}\ #2^* = #4\right\}}} \usepackage{mathrsfs} \newcommand{\alg}[1]{\mathscr{#1}} \newcommand{\complexes}[0]{\mathbb{C}} $$

A Simple Expression Parser

Abstract

Rather than the usual tokenizing and either precedence parsing or recursive descent parsing, expressions can be parsed, perhaps less efficiently but in a conceptually much simpler way, by a recursion, which converges because each step is guaranteed to remove at least one character from the original expression string.

Parsing

The algorithm takes a string, which is to say a list of characters, regarded as a list of characters and trees. Initially, it has only characters. Eventually, it has exactly one tree, if the expression was syntactically valid in the first place.

The base case of the recursion is then where the list has one tree, in which case that tree is returned. Precedence is reflected in the order in which structures are sought and condensed into trees.

First, function calls are detected, by looking for the first instance of alphabetic characters, followed by an open parenthesis. The matching close parenthesis is found by counting open and close parentheses encountered, until the counts match. The intervening list is recursively fed to the algorithm, the resulting tree is then a child of a new tree node labeled by the function name (the alphabet characters directly preceding that first open parenthesis), is spliced into the list where all of those characters had been, and the result is recursively fed back to the algorithm, and returned.

Next, parenthesized expressions are detected, in exactly the same way, except that there is no name, and no extra tree node. At this point, spaces can be removed.

Next, various operators are processed. This entails finding the operator, extracting its predecessor and its successor, and substituting a new tree node labeled by the operator, with its two arguments as children, and returning the result of running the algorithm on this list. First multiplication, then subtraction, and finally addition are processed in this way.

Notice that since each phase returns the result of calling the algorithm on its resulting list of characters and trees. This means that within any call of the algorithm, only the first thing detected will be handled, with everything else delegated recursively.

Check out the python reference implementation.

I'm Andrew Winkler. I hope you've enjoyed exploring these ideas with me. Thanks for your time. If you've got any questions, send an email to 4af502d7148512d4fee9@cloudmailin.net.

The Hertz Foundation has been a valuable support, not only during graduate school, but ever since. If you can, please consider supporting them with a donation.

How can I help support this work?

If you can support this work directly, please check out Patreon

As an Amazon Associate I earn from qualifying purchases.