This is, to me, an odd way to approach parsing. I get the impression the author is relatively inexperienced with Rust and the PL ideas it builds on.
A few notes:
* The AST would, I believe, be much simpler defined as an algebraic data types. It's not like the sqlite grammar is going to randomly grow new nodes that requires the extensibility their convoluted encoding requires. The encoding they uses looks like what someone familiar with OO, but not algebraic data types, would come up with.
* "Macros work different in most languages. However they are used for mostly the same reasons: code deduplication and less repetition." That could be said for any abstraction mechanism. E.g. functions. The defining features of macros is they run at compile-time.
* The work on parser combinators would be a good place to start to see how to structure parsing in a clean way.
> I get the impression the author is relatively inexperienced
The author never claimed to be an experienced programmer. The title of the blog is "Why I love ...". Your notes look fair to me, but calling out inexperience is unnecessary IMO. I love it if someone loves programming. I think that's great. Experience will come.
If someone didn't study the state of the art of tokenising and parsing and still wants to write about it, it's absolutely ok to call it out as being written by someone who has only a vague idea of what they're talking about.
I have the experience of writing parsers (lexers) in Ragel, using Go, Java C++, and C.
I must say, once you have some boilerplate generator in place, raw C is as good as the Rust code the author describes. Maybe even better because simplicity.
For example, this is the most of code necessary to have a JSON parser:
https://github.com/gritzko/librdx/blob/master/JSON.lex
I don't know. Having written a small parser [0] for Forsyth-Edwards chess notation [1] Haskell takes the cake here in terms of simplicity and legibility; it reads almost as clearly as BNF, and there is very little technical ceremony involved, letting you focus on the actual grammar of whatever it is you are trying to parse.
But this is not unaided Haskell, it's a parser combinator library, isn't it?
Do you see an obvious reason why a similar approach won't work in Rust? E.g. winnow [1] seems to offer declarative enough style, and there are several more parser combinator libraries in Rust.
data Color = Color
{ r :: Word8
, b :: Word8
, c :: Word8
} deriving Show
hex_primary :: Parser Word8
hex_primary = toWord8 <$> sat isHexDigit <*> sat isHexDigit
where toWord8 a b = read ['0', 'x', a, b]
hex_color :: Parser Color
hex_color = do
_ <- char '#'
Color <$> hex_primary <*> hex_primary <*> hex_primary
Sure, it works in Rust, but it's a pretty far cry from being as simple or legible - there's a lot of extra boilerplate in the Rust.
I think it's a stretch to call parser combinator code in Haskell simple or legible. Most Haskell code is simple and legible if you know enough Haskell to read it, but Haskell isn't exactly a simple or legible language.
Haskell demonstrates the use of parser combinators very well, but I'd still use parser combinators in another language. Parser combinators are implemented in plenty of languages, including Rust, and actually doing anything with the parsed output becomes a lot easier once you leave the Haskell domain.
But it doesn't take much to go from 0 to a parser combinator library. I roll my own each year for advent of code. It starts at like 100 lines of code (which practically writes itself - very hard to stray outside of what the types enforce) and I grow it a bit over the month when I find missing niceties.
So, just to kick this off: I wrote an eBPF disassembler and (half-hearted) emulator in Rust and I also found it a pleasant language to do parsing-type stuff in. But: I think the author cuts against their argument when they manage to necessitate a macro less than 1/6th of the way into their case study. A macro isn't quite code-gen, but it also doesn't quite feel like working idiomatically within the language, either.
Again: not throwing shade. I think this is a place where Rust is genuinely quite strong.
Using parser combinator library "nom", this should probably do what you'd want:
fn parse_abc(input: &str, n: usize) -> IResult<&str, (Vec<char>, Vec<char>, Vec<char>)> {
let (input, result) = tuple(( many_m_n(n, n, char('a')),
many_m_n(n, n, char('b')),
many_m_n(n, n, char('c'))
))(input)?;
Ok((input, result))
}
It parses (the beginning of) the input, ensuring `n` repetitions of 'a', 'b', and 'c'. Parse errors are reported through the return type, and the remaining characters are returned for the application to deal with as it sees fit.
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import qualified Data.Text as T
type ParseError = String
csgParse :: T.Text -> Either ParseError Int
csgParse = eitherResult . parse parser where
parser = do
as <- many' $ char 'a'
let n = length as
count n $ char 'b'
count n $ char 'c'
char '\n'
return n
ghci> csgParse "aaabbbccc\n"
Right 3
Something that was hard when I wrote a full AST parser in Rust was representing a hierarchy of concrete AST types, with upcasting and downcasting. I was able to figure out a way, but it required some really weird type shenanigans (eg PhantomData) and some macros. Looks like they had to do crazy macros here too
Hmmm, yeah Rust’s ADTs and matching syntax would be great. Until you got to the up/down casting. I’m inexperienced enough in Rust to know if there’s good ways to handle it. Dynamic traits maybe?
That talk is great, but I remember some discussion later about Go actually NOT using this technique because of goroutine scheduling overhead and/or inefficient memory allocation patterns? The best discussion I could find is [1].
Another great talk about making efficient lexers and parsers is Andrew Kelley's "Practical Data Oriented Design" [2]. Summary: "it explains various strategies one can use to reduce memory footprint of programs while also making the program cache friendly which increase throughput".
The parallelism provided by the goroutines caused races and eventually led to abandoning the design in favor of the lexer storing state in an object, which was a more faithful simulation of a coroutine. Proper coroutines would have avoided the races and been more efficient than goroutines.
I feel like that talk has more to do with expressing concurrency, in problems where concurrency is a natural thing to think about, than it does with lexing.
So how do you debug code written with macros like this, or come into it as a new user of the codebase?
I’m imagining seeing the node! macro used, and seeing the macro definition, but still having a tough time knowing exactly what code is produced.
Do I just use the Example and see what type hints I get from it? Can I hover over it in my IDE and see an expanded version? Do I need to reference the compiled code to be sure?
(I do all my work in JS/TS so I don’t touch any macros; just curious about the workflow here!)
Rust is really several languages, ”vanilla” rust, declarative macros and proc macros. Each have a slightly different capability set and different dialect. You get used to working with each in turn over time.
Also unit tests is generally a good playground area to understand the impacts of modifying a macro.
rust-analyzer, the Rust LSP used in e.g. VSCode, can expand declarative and proc macros recursively.
it isn't too bad, although the fewer proc macros in a code base, the better. declarative macros are slightly easier to grok, but much easier to maintain and test. (i feel the same way about opaque codegen in other languages.)
I think except macros, most of these features are ML family language features as well. Rust stands out because it can implement this in an efficient, zero overhead abstraction way.
I cannot agree less, C++ is the best and always will be. You youngsters made up this new dialect that can also compile with the C++ compiler. This is like people putting VS Code in dark mode thinking they're now also working in the Terminal like the Gods of Binary.
Does anyone have a good EBNF notation for Sqlite? I tried to make a tree-sitter grammar, which produces C code and great Rust bindings for it. But they use some lemon parser. Not sure how to read the grammar from that.
Also I'm collecting several LALR(1) grammars here https://mingodad.github.io/parsertl-playground/playground/ that is an Yacc/Lex compatible online editor/interpreter that can generate EBNF for railroad diagram, SQL, C++ from the grammars, select "SQLite3 parser (partially working)" from "Examples" then click "Parse" to see the parse tree for the content in "Input source".
Not EBNF or anything standard, but possibly readable enough. It is an LR(1) grammar that has tested on all the test cases in Sqlite's test suite at the time:
The grammer contains things you won't have seen before, like Prio(). Think of them as macros. It all gets translated to LR(1) productions which you can ask it to print out. LR(1) productions are simpler than EBNF. They look like:
This is, to me, an odd way to approach parsing. I get the impression the author is relatively inexperienced with Rust and the PL ideas it builds on.
A few notes:
* The AST would, I believe, be much simpler defined as an algebraic data types. It's not like the sqlite grammar is going to randomly grow new nodes that requires the extensibility their convoluted encoding requires. The encoding they uses looks like what someone familiar with OO, but not algebraic data types, would come up with.
* "Macros work different in most languages. However they are used for mostly the same reasons: code deduplication and less repetition." That could be said for any abstraction mechanism. E.g. functions. The defining features of macros is they run at compile-time.
* The work on parser combinators would be a good place to start to see how to structure parsing in a clean way.
> I get the impression the author is relatively inexperienced
The author never claimed to be an experienced programmer. The title of the blog is "Why I love ...". Your notes look fair to me, but calling out inexperience is unnecessary IMO. I love it if someone loves programming. I think that's great. Experience will come.
If someone didn't study the state of the art of tokenising and parsing and still wants to write about it, it's absolutely ok to call it out as being written by someone who has only a vague idea of what they're talking about.
I have the experience of writing parsers (lexers) in Ragel, using Go, Java C++, and C. I must say, once you have some boilerplate generator in place, raw C is as good as the Rust code the author describes. Maybe even better because simplicity. For example, this is the most of code necessary to have a JSON parser: https://github.com/gritzko/librdx/blob/master/JSON.lex
In fact, that eBNF only produces the lexer. The parser part is not that impressive either, 120 LoC and quite repetitive https://github.com/gritzko/librdx/blob/master/JSON.c
So, I believe, a parser infrastructure evolves till it only needs eBNF to make a parser. That is the saturation point.
That repetitivness can be seen as a downside, not a virtue. And I feel that Rust's ADTs make working with the resulting syntax tree much easier.
Though I agree that a little code generation and/or macro magic can make C significantly more workable.
I don't know. Having written a small parser [0] for Forsyth-Edwards chess notation [1] Haskell takes the cake here in terms of simplicity and legibility; it reads almost as clearly as BNF, and there is very little technical ceremony involved, letting you focus on the actual grammar of whatever it is you are trying to parse.
[0] https://github.com/ryandv/chesskell/blob/master/src/Chess/Fa...
[1] https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notati...
Haskell definitely takes the cake in terms of leveraging parser combinators, but you’re still stuck with Haskell to deal with the result.
That's what they call a "win-win".
For some of us, "being stuck with Haskell" isn't a problem.
For the rest, being stuck with real-world problems instead of self-inflicted ones is preferable :-)
$ echo "Haskell" | sed 's/ke/-ki'
Has-kill
$
| sed '/k/sk'
Has-skill
$
Write the full transform in Haskell?
don't make it sound as if it's bad, it's actually superb on all these levels: the typelevel, the SMP runtime, and throughput.
But this is not unaided Haskell, it's a parser combinator library, isn't it?
Do you see an obvious reason why a similar approach won't work in Rust? E.g. winnow [1] seems to offer declarative enough style, and there are several more parser combinator libraries in Rust.
[1]: https://docs.rs/winnow/latest/winnow/
I think it's a stretch to call parser combinator code in Haskell simple or legible. Most Haskell code is simple and legible if you know enough Haskell to read it, but Haskell isn't exactly a simple or legible language.
Haskell demonstrates the use of parser combinators very well, but I'd still use parser combinators in another language. Parser combinators are implemented in plenty of languages, including Rust, and actually doing anything with the parsed output becomes a lot easier once you leave the Haskell domain.
The nom crate has an RGB parser example: https://docs.rs/nom/latest/nom/#example
It’s slightly longer, but more legible.
But it doesn't take much to go from 0 to a parser combinator library. I roll my own each year for advent of code. It starts at like 100 lines of code (which practically writes itself - very hard to stray outside of what the types enforce) and I grow it a bit over the month when I find missing niceties.
I wouldn't consider FEN a great parsing example, simply because it can be implement in a simple function with a single loop.
Just a few days ago, I wrote a FEN "parser" for an experimental quad-bitboard impelementation. It almost wrote itself.
P.S.: I am the author of chessIO on Hackage
So, just to kick this off: I wrote an eBPF disassembler and (half-hearted) emulator in Rust and I also found it a pleasant language to do parsing-type stuff in. But: I think the author cuts against their argument when they manage to necessitate a macro less than 1/6th of the way into their case study. A macro isn't quite code-gen, but it also doesn't quite feel like working idiomatically within the language, either.
Again: not throwing shade. I think this is a place where Rust is genuinely quite strong.
How can one define an infinite grammar in Rust?
E.g., a context-free rule S ::= abc|aabbcc|aaabbbccc|... can effectively parse a^Nb^Nc^N which is an example of context-sensitive grammar.
This is a simple example, but something like that can be seen in practice. One example is when language allows definition of operators.
So, how does Rust handle that?
Using parser combinator library "nom", this should probably do what you'd want:
It parses (the beginning of) the input, ensuring `n` repetitions of 'a', 'b', and 'c'. Parse errors are reported through the return type, and the remaining characters are returned for the application to deal with as it sees fit.https://play.rust-lang.org/?version=stable&mode=debug&editio...
In Haskell I think it's something like:
Link us your eBPF disassembler if you can. Sounds cool.
It's not. If you wrote one, it'd be more interesting than mine.
Something that was hard when I wrote a full AST parser in Rust was representing a hierarchy of concrete AST types, with upcasting and downcasting. I was able to figure out a way, but it required some really weird type shenanigans (eg PhantomData) and some macros. Looks like they had to do crazy macros here too
Curious what the rest of the prior art looks like
Hmmm, yeah Rust’s ADTs and matching syntax would be great. Until you got to the up/down casting. I’m inexperienced enough in Rust to know if there’s good ways to handle it. Dynamic traits maybe?
Sorry to bother you, but would that be open-source by any chance? Is there any public repo available? Thank you.
Maybe it can work as a quick glimpse into how parser and lexer can work in Rust https://github.com/jmaczan/0x6b73746b
I wrote it long time ago and it’s not fully implemented tho
Related, I love Rob Pike's talk about lexical Scanning in Go (2011).
Educational and elegant approach.
https://www.youtube.com/watch?v=HxaD_trXwRE
That talk is great, but I remember some discussion later about Go actually NOT using this technique because of goroutine scheduling overhead and/or inefficient memory allocation patterns? The best discussion I could find is [1].
Another great talk about making efficient lexers and parsers is Andrew Kelley's "Practical Data Oriented Design" [2]. Summary: "it explains various strategies one can use to reduce memory footprint of programs while also making the program cache friendly which increase throughput".
--
1: https://news.ycombinator.com/item?id=31649617
2: https://www.youtube.com/watch?v=IroPQ150F6c
Yeah I actually remember that too, this article mentions it:
Coroutines for Go - https://research.swtch.com/coro
The parallelism provided by the goroutines caused races and eventually led to abandoning the design in favor of the lexer storing state in an object, which was a more faithful simulation of a coroutine. Proper coroutines would have avoided the races and been more efficient than goroutines.
I feel like that talk has more to do with expressing concurrency, in problems where concurrency is a natural thing to think about, than it does with lexing.
So how do you debug code written with macros like this, or come into it as a new user of the codebase?
I’m imagining seeing the node! macro used, and seeing the macro definition, but still having a tough time knowing exactly what code is produced.
Do I just use the Example and see what type hints I get from it? Can I hover over it in my IDE and see an expanded version? Do I need to reference the compiled code to be sure?
(I do all my work in JS/TS so I don’t touch any macros; just curious about the workflow here!)
Run:
And you’ll see the resulting code.Rust is really several languages, ”vanilla” rust, declarative macros and proc macros. Each have a slightly different capability set and different dialect. You get used to working with each in turn over time.
Also unit tests is generally a good playground area to understand the impacts of modifying a macro.
rust-analyzer, the Rust LSP used in e.g. VSCode, can expand declarative and proc macros recursively.
it isn't too bad, although the fewer proc macros in a code base, the better. declarative macros are slightly easier to grok, but much easier to maintain and test. (i feel the same way about opaque codegen in other languages.)
I think except macros, most of these features are ML family language features as well. Rust stands out because it can implement this in an efficient, zero overhead abstraction way.
I like MegaParsec in haskell quite expressive, based on my limited experience using nom in Rust
I cannot agree less, C++ is the best and always will be. You youngsters made up this new dialect that can also compile with the C++ compiler. This is like people putting VS Code in dark mode thinking they're now also working in the Terminal like the Gods of Binary.
Rust being a dialect of c++ is certainly a novel take
I've found that the logos crate is really nice for writing lexers in rust
https://docs.rs/logos/0.14.2/logos/
this is the third day in a row this article is being posted here.
this time it got traction. funny how HN works.
https://news.ycombinator.com/item?id=42055954
https://news.ycombinator.com/item?id=42058920
Does anyone have a good EBNF notation for Sqlite? I tried to make a tree-sitter grammar, which produces C code and great Rust bindings for it. But they use some lemon parser. Not sure how to read the grammar from that.
The lemon tool that is used by SQLite can output the grammar as SQL database that you can manipulate. There is https://github.com/ricomariani/CG-SQL-author that goes way beyond and you'll need to create the Rust generation, you can play with it here with a Lua backend https://mingodad.github.io/CG-SQL-Lua-playground/ .
Also I'm collecting several LALR(1) grammars here https://mingodad.github.io/parsertl-playground/playground/ that is an Yacc/Lex compatible online editor/interpreter that can generate EBNF for railroad diagram, SQL, C++ from the grammars, select "SQLite3 parser (partially working)" from "Examples" then click "Parse" to see the parse tree for the content in "Input source".
I also created https://mingodad.github.io/plgh/json2ebnf.html to have a unified view of tree-sitter grammars and https://mingodad.github.io/lua-wasm-playground/ where there is an Lua script to generate an alternative EBNF to write tree-sitter grammars that can later be converted to the standard "grammar.js".
Not EBNF or anything standard, but possibly readable enough. It is an LR(1) grammar that has tested on all the test cases in Sqlite's test suite at the time:
https://lrparsing.sourceforge.net/doc/examples/lrparsing-sql...
The grammer contains things you won't have seen before, like Prio(). Think of them as macros. It all gets translated to LR(1) productions which you can ask it to print out. LR(1) productions are simpler than EBNF. They look like:
Documentation on what the macros do, and how to get it to spit out the LR1(1) productions is here:https://lrparsing.sourceforge.net/doc/html/
It was used to do a similar task the OP is attempting.
Perhaps this ANTLR v4 sqlite grammar? [1]
--
1: https://github.com/antlr/grammars-v4/tree/master/sql/sqlite
It looks pretty much like BNF. Not too far off, anyway. https://sqlite.org/src/doc/trunk/doc/lemon.html#syntax