Guile Parser Combinators implement (monadic) parser combinators. What it means in pratice is that you create procedures that parse most of the time strings and output a structured data, most likely s-expr. They are said to be combinators because you can compose parser procedures. The monadic part is an implementation detail mostly.
Say you want to parse some input text something like markdown and turn it into sxml.
Be aware that guile-parser-combinators (unlike guile-log) doesn't have an error handling machinery but since you compose small parser to make a bigger parser it's easy to debug the small units and have a working parser.
The first part of this article introduce guile-parser-combinators concepts and then apply them to build a csv parser.
Fire an REPL inside the created directory:
You are ready!
Three kinds of parsers
All parsers can return two kind of values ofa or a . have a which is the output of the parser and a which is what remains to be parsed. results have no value and no stream.
There is three kind of parsers in guile-parser-combinators. In the following I described them in the case where the input is strings and the final output is an s-expr:
plain parser:. This allows to recognize and split input string into small units. This is the most basic parser. Example plain parser is which will succeed if the input stream starts with a char, otherwise it fails. It will return as value and return the input stream queue.
combinator parser: control parsers but the literature says otheriwse. Example combinator parser is which will succeed if any of its input parser succeed. Which means in this case that the input stream must start with or or char.. Those takes as input a parser and output another parser. This might seem strange. I think it's better to call them
output builder:. Those are not really parser as they never consume the input stream but instead shake around value.
All those parsers are further explained in what follows.
I lied a little previously, I said that plain parser take as input string and output string. Actually it's just a (second hand) view to understand how parser works. The actual implementation takes as input a stream and outputs a.
For instance the following is a valid parser, that will return athat succeed when the first char in the stream is a :
You can test it with the following:
The only problem of this parser is that it doesn't handle the case where the input stream is empty. Nonetheless start to get the idea.
To parse achar using guile-parser-combinator, you have to use the procedure which returns a parser that succeed if the first item of the input stream is a and fails otherwise:
As you can see in result, 's value is the char. You can't check easily that consumed the input stream, you'll have to believe me for now.
Here's the implementation of, as you can see, it return a procedure:
There is various procedure that return plain parsers:
Let's try the last procedure:
There is alsowhich is not a procedure that returns a parser, but a plain parser. It will consume any char found in the input stream:
At this point, if things are still fuzzy, it's ok because you still don't know how given only control parser!, and plain parsers how you can parse something. Things will get more interesting once you know about
Combinator (control?) parsers are procedures similar in principle toexcept they take as input other parser and are tailored to compose them using semantic useful in the context of parsing.
Here is the full list of control parsers:
I guess now you get the point.
Output builder parsers are special parser that takes a parser as argument and process its output before returning. You can useor to do so.
For instance, here is an example use of:
As you can seereturns .
wrapping with a csv parser
csv is a notoriously difficult text file format to parse because it comes in much different flavor. In what follows we will describe a parser for somekind of csv format that is straightforward.
For our csv parser we want the following two unit tests to pass:
Otherwise said, a csv will be a multiline string with columns separated by semi-colon chars.
To be able to parse such a format, we need to introduce another parser combinator called. What it does is that it only execute if parser fails. This is useful because a lot of time they are control characters in the input stream that identifies the start or end of a given text unit. In the case of the csv parser there is two control chars: the semi-colon and newlines.
The implementation ofis simple! Here is it:
Basically what it does, is that it checks thatparser fails on input stream and in that case execute .
(Otherwise the big picture is that the caller, rewinds the stream until it can find a branch usingthat succeeds... otherwise said, parser combinators try every parser until it find a parser that succeed or it fails).
Anyway, this allows to check for the presence of control chars before parsing something. There is an similarand you might think that it's equally useful but in practice when parsing is much more useful. It has to do with the fact, that negation allows to capture much more logic that plain equality. Negation is very powerful.
Parsing a column
In csv a column looks likewhere the semi-colon marks the end of the column. So a column is made of its value in this case and its end marker the semi-colon. A such, a column value can be anything but a semi-colon. This is a first parser:
But this parser only parse a single char, let's repeat the same parser several times, one or more times using(we consider that everycolumn has at least one char value).
And there we have parsed the first column value "abc" but it's not properly packed as a string, to fix that we can usewhich will lift the 's value with :
We can make of this the first parser unit of our csv parser, as it really parse a single column value and properly packs it:
As you can see, there's still elements in the stream. The semi-colon control character is not parsed yet, but it's really part of the column. So what will do next is parse that control char and remove it fromusing . Try to guess how before reading what's follows.
So we need to chain ourwith a . But before that is a simple , we define it to make the parser more readable and redefine it terms of it because that's actually the real semantic of parsing a column value:
Check thatstill works as expected.
Like I said earlier we need to chainwith to build . For that we can use :
As you can see, now we have a list with two values in. We don't care about the semi-colon in the output. It's only present in the input to mark the end of column. So we away:
That's is it, we have the definition of:
As you might have guess we can repeat one or more timeto parse a single row:
Neat isn't it?
But this doesn't really every kind of row. Because there is two kind of rows:
The last row that ends with an
The other rows that ends with a newline
Let's define a:
Now we can (almost) define:
As you can see, there is some #t garbage in the's value. let's get rid of it using :
Wee!!! We have complete row! Guess what parser combinator you have to use to parse several rows...
Time is up:
Let's define a small csv document and try to parse it:
Ooops! There is a single list in the output; there's a bug in a parser! There is no such thing ascolumn. So there must be a bug in the parser... The solution is... to not parse columns starting with a newline! So let's use again:
Our final csv parser looks like the following:
There is a handyprocedure in guile-parser-combinators that allows to define the following procedure:
That's all folks!