It turns out that pattern matching and data structure decomposition are a wonderful tool for doing parsing of nonregular grammars (such as Markdown).
I’m working on writing a Markdown parser in the Elixir language. It’s something I do in new languages when I get the time, because parsing Markdown is nontrivial—the syntax is decidely nonregular.
Most of the Markdown parsers out there use regular expressions to scan and transform their input.1 I started off doing the same, but quickly bumped into all the nastiness that this approach entails.
Then I realized that the so-called block-level Markdown elements can be parsed differently. Rather than try to do it in one pass, I’m taking two.
On the first pass, I just look at each line, categorizing it. Typical categories might be blank line, rule, heading, and so on. But there are some things that can’t be decided by looking at just one line. For example, a setext heading looks like this:
1 2 3 4 5 6
That is, the heading is a blank line, a line of text, a line containing equals signs (or hyphens), and another blank line.
In the design I’m playing with, these kinds of headings (and other structures that require more context than a single line to parse) are recognized by the second pass through the lines.
The top level code looks something like this:
1 2 3 4 5
The first line of the function takes the list of lines and turns them into something that is streamable. At the same time, it prepends and appends a blank line, because empty lines act as delimiters for many block-level structures.
map line applies the function
type_of to each line.
First Pass—Assigning a Type to Each Line
type_of function returns a description of each line. For example, a
blank line will be returned as
[type: :blankline] and the atx-style
### Conclusion will be returned as
[type: :heading, level:
3, text: "conclusion"]. The code that performs this mapping is
1 2 3 4 5 6 7 8 9
There are probably 15 or so of these line types. Two that
we’ll be using in a minute are
:setext_underline_heading, a line containing only
equals signs or hyphens, and
:textline, a line containing text that doesn’t
match any of the other types.
1 2 3 4 5 6 7
Second Pass—Combining Lines into Blocks
So now we have a list of line types. We need to look for things such as the setext-style heading. And it turns out that structure decomposition lets us do this pretty painlessly. Here’s the code for the headings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
The first function is just a convenience, calling the real worker with an additional parameter to hold the result.
Then we have the
merge_compound that recursively processes the lines.
The interesting thing here is the function head that starts on line 5. Remember we said that an setext heading is a blank line, a line of text, the line of equals signs, and another blank line. See how structure decomposition and pattern matching let us express this directly in code:
1 2 3 4 5 6
This particular function will only be called if passed a list that starts with those four elements.
Then, having recognized it, the body of the function replaces the matched lines with new lines—a blank, a heading, and another blank. We add these three to the result list by calling ourselves recursively:
1 2 3 4 5 6 7
And so it goes for the other complex block types.
Pattern Matching Is Stream Parsing
In functional programming, your code is basically
You then use composition to break
some_function into successively
lower levels of abstraction until you are dealing with primitive
values and built-in functions, at which point the decomposition can
1 2 3 4 5 6 7
If you look at it this way your program is basically a pipeline. A stream of data enters it, gets analyzed, broken apart, piped through a sequence of functions, mapped into something different, and reassembled into the output.
That’s where pattern matching comes it, destructuring the stream of data and selecting what functions to apply.
And that’s the cool part. If you think of your code this way, then you’re actually using pattern matching to parse the stream of data as it passes through. If you can express your input as some kind of grammar, then your program is something that transforms data that matches that grammar.
This isn’t just some kind of fancy abstraction, because knowing the grammar you are working with can greatly simplify your code. For example, if you know that your business rules are context free, then you also know that the code that parses and processes the corresponding data can operate in isolation—it is automatically decoupled.
I’d written code this way in Elixir, but hadn’t really thought about it in terms of parsing until now. Now I’m starting to see more and more of what I do in terms of grammars and production rules.
An Aside—Listening to the Angel on your Shoulder (or the Dog on Your Rug)
And I kept at it. I wasn’t going to give in. It didn’t feel right, but I WAS NOT GOING TO BE BEATEN.
In the old days, I’d describe my programming problems to a rubber duck. Nowadays I’ve graduated to Moose, a twelve-year-old black lab whose ability to look disgusted at the ways of people is unsurpassed. When I grumbled about the problems I was having to Moose, he simply raised an eyebrow, turned over, and went back to sleep. Which, of course, is shorthand for “if the code is telling you that it’s hard to do it this way, stop doing it this way.”
So I threw away perhaps a day’s work and reimplemented it using pattern matching and streams in an hour or two.
It just feels a lot better.
And that’s one of the reasons to get into the habit of practicing code kata. The fact that it took a day to recognize I was heading into a swamp is very sad, but the kata let me make that mistake in a noncritical environment.
I was doing the Markdown kata for the nth time. This time I learned not to fall into the rut of doing it the same way each time, and I came away promising myself I’d listen to that feeling of disquiet sooner in future.