Why I love Rust for tokenising and parsing

(xnacly.me)

133 points | by thunderbong 6 hours ago ago

48 comments

  • noelwelsh 2 hours ago

    This is, to me, an odd way to approach parsing. I get the impression the author is relatively inexperienced with Rust and the PL ideas it builds on.

    A few notes:

    * The AST would, I believe, be much simpler defined as an algebraic data types. It's not like the sqlite grammar is going to randomly grow new nodes that requires the extensibility their convoluted encoding requires. The encoding they uses looks like what someone familiar with OO, but not algebraic data types, would come up with.

    * "Macros work different in most languages. However they are used for mostly the same reasons: code deduplication and less repetition." That could be said for any abstraction mechanism. E.g. functions. The defining features of macros is they run at compile-time.

    * The work on parser combinators would be a good place to start to see how to structure parsing in a clean way.

    • huijzer 25 minutes ago

      > I get the impression the author is relatively inexperienced

      The author never claimed to be an experienced programmer. The title of the blog is "Why I love ...". Your notes look fair to me, but calling out inexperience is unnecessary IMO. I love it if someone loves programming. I think that's great. Experience will come.

      • guappa 2 minutes ago

        If someone didn't study the state of the art of tokenising and parsing and still wants to write about it, it's absolutely ok to call it out as being written by someone who has only a vague idea of what they're talking about.

  • gritzko an hour ago

    I have the experience of writing parsers (lexers) in Ragel, using Go, Java C++, and C. I must say, once you have some boilerplate generator in place, raw C is as good as the Rust code the author describes. Maybe even better because simplicity. For example, this is the most of code necessary to have a JSON parser: https://github.com/gritzko/librdx/blob/master/JSON.lex

    In fact, that eBNF only produces the lexer. The parser part is not that impressive either, 120 LoC and quite repetitive https://github.com/gritzko/librdx/blob/master/JSON.c

    So, I believe, a parser infrastructure evolves till it only needs eBNF to make a parser. That is the saturation point.

    • dvdkon 36 minutes ago

      That repetitivness can be seen as a downside, not a virtue. And I feel that Rust's ADTs make working with the resulting syntax tree much easier.

      Though I agree that a little code generation and/or macro magic can make C significantly more workable.

  • ryandv 4 hours ago

    I don't know. Having written a small parser [0] for Forsyth-Edwards chess notation [1] Haskell takes the cake here in terms of simplicity and legibility; it reads almost as clearly as BNF, and there is very little technical ceremony involved, letting you focus on the actual grammar of whatever it is you are trying to parse.

    [0] https://github.com/ryandv/chesskell/blob/master/src/Chess/Fa...

    [1] https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notati...

    • PittleyDunkin 3 hours ago

      Haskell definitely takes the cake in terms of leveraging parser combinators, but you’re still stuck with Haskell to deal with the result.

      • wesselbindt an hour ago

        That's what they call a "win-win".

      • pjmlp an hour ago

        For some of us, "being stuck with Haskell" isn't a problem.

        • anilakar 23 minutes ago

          For the rest, being stuck with real-world problems instead of self-inflicted ones is preferable :-)

      • fuzztester an hour ago

        $ echo "Haskell" | sed 's/ke/-ki'

        Has-kill

        $

        • fuzztester an hour ago

          | sed '/k/sk'

          Has-skill

          $

          • orf 22 minutes ago

            Write the full transform in Haskell?

      • instig007 2 hours ago

        don't make it sound as if it's bad, it's actually superb on all these levels: the typelevel, the SMP runtime, and throughput.

    • nine_k 3 hours ago

      But this is not unaided Haskell, it's a parser combinator library, isn't it?

      Do you see an obvious reason why a similar approach won't work in Rust? E.g. winnow [1] seems to offer declarative enough style, and there are several more parser combinator libraries in Rust.

      [1]: https://docs.rs/winnow/latest/winnow/

      • codebje 2 hours ago

            data Color = Color
                { r :: Word8
                , b :: Word8
                , c :: Word8
                } deriving Show
        
            hex_primary :: Parser Word8
            hex_primary = toWord8 <$> sat isHexDigit <*> sat isHexDigit
                where toWord8 a b = read ['0', 'x', a, b]
        
            hex_color :: Parser Color
            hex_color = do
                _ <- char '#'
                Color <$> hex_primary <*> hex_primary <*> hex_primary
        
        Sure, it works in Rust, but it's a pretty far cry from being as simple or legible - there's a lot of extra boilerplate in the Rust.
        • jeroenhd 43 minutes ago

          I think it's a stretch to call parser combinator code in Haskell simple or legible. Most Haskell code is simple and legible if you know enough Haskell to read it, but Haskell isn't exactly a simple or legible language.

          Haskell demonstrates the use of parser combinators very well, but I'd still use parser combinators in another language. Parser combinators are implemented in plenty of languages, including Rust, and actually doing anything with the parsed output becomes a lot easier once you leave the Haskell domain.

        • orf 20 minutes ago

          The nom crate has an RGB parser example: https://docs.rs/nom/latest/nom/#example

          It’s slightly longer, but more legible.

      • mrkeen an hour ago

        But it doesn't take much to go from 0 to a parser combinator library. I roll my own each year for advent of code. It starts at like 100 lines of code (which practically writes itself - very hard to stray outside of what the types enforce) and I grow it a bit over the month when I find missing niceties.

    • lynx23 2 hours ago

      I wouldn't consider FEN a great parsing example, simply because it can be implement in a simple function with a single loop.

      Just a few days ago, I wrote a FEN "parser" for an experimental quad-bitboard impelementation. It almost wrote itself.

      P.S.: I am the author of chessIO on Hackage

  • tptacek 5 hours ago

    So, just to kick this off: I wrote an eBPF disassembler and (half-hearted) emulator in Rust and I also found it a pleasant language to do parsing-type stuff in. But: I think the author cuts against their argument when they manage to necessitate a macro less than 1/6th of the way into their case study. A macro isn't quite code-gen, but it also doesn't quite feel like working idiomatically within the language, either.

    Again: not throwing shade. I think this is a place where Rust is genuinely quite strong.

    • thesz 4 hours ago

      How can one define an infinite grammar in Rust?

      E.g., a context-free rule S ::= abc|aabbcc|aaabbbccc|... can effectively parse a^Nb^Nc^N which is an example of context-sensitive grammar.

      This is a simple example, but something like that can be seen in practice. One example is when language allows definition of operators.

      So, how does Rust handle that?

      • jeroenhd 12 minutes ago

        Using parser combinator library "nom", this should probably do what you'd want:

            fn parse_abc(input: &str, n: usize) -> IResult<&str, (Vec<char>, Vec<char>, Vec<char>)> {
              let (input, result) = tuple(( many_m_n(n, n, char('a')),
                                          many_m_n(n, n, char('b')),
                                          many_m_n(n, n, char('c'))
                                    ))(input)?;
              Ok((input, result)) 
            }
            
        
        It parses (the beginning of) the input, ensuring `n` repetitions of 'a', 'b', and 'c'. Parse errors are reported through the return type, and the remaining characters are returned for the application to deal with as it sees fit.

        https://play.rust-lang.org/?version=stable&mode=debug&editio...

      • ryandv 3 hours ago

        In Haskell I think it's something like:

            {-# LANGUAGE OverloadedStrings #-}
            import Data.Attoparsec.Text
            import qualified Data.Text as T
        
            type ParseError = String
        
            csgParse :: T.Text -> Either ParseError Int
            csgParse = eitherResult . parse parser where
            parser = do
                as <- many' $ char 'a'
                let n = length as
                count n $ char 'b'
                count n $ char 'c'
                char '\n'
                return n
        
            ghci> csgParse "aaabbbccc\n"
            Right 3
    • jamra 5 hours ago

      Link us your eBPF disassembler if you can. Sounds cool.

      • tptacek 5 hours ago

        It's not. If you wrote one, it'd be more interesting than mine.

  • brundolf 3 hours ago

    Something that was hard when I wrote a full AST parser in Rust was representing a hierarchy of concrete AST types, with upcasting and downcasting. I was able to figure out a way, but it required some really weird type shenanigans (eg PhantomData) and some macros. Looks like they had to do crazy macros here too

    Curious what the rest of the prior art looks like

    • elcritch 3 hours ago

      Hmmm, yeah Rust’s ADTs and matching syntax would be great. Until you got to the up/down casting. I’m inexperienced enough in Rust to know if there’s good ways to handle it. Dynamic traits maybe?

    • ainiriand 2 hours ago

      Sorry to bother you, but would that be open-source by any chance? Is there any public repo available? Thank you.

  • hu3 5 hours ago

    Related, I love Rob Pike's talk about lexical Scanning in Go (2011).

    Educational and elegant approach.

    https://www.youtube.com/watch?v=HxaD_trXwRE

    • emmanueloga_ 4 hours ago

      That talk is great, but I remember some discussion later about Go actually NOT using this technique because of goroutine scheduling overhead and/or inefficient memory allocation patterns? The best discussion I could find is [1].

      Another great talk about making efficient lexers and parsers is Andrew Kelley's "Practical Data Oriented Design" [2]. Summary: "it explains various strategies one can use to reduce memory footprint of programs while also making the program cache friendly which increase throughput".

      --

      1: https://news.ycombinator.com/item?id=31649617

      2: https://www.youtube.com/watch?v=IroPQ150F6c

      • chubot 3 hours ago

        Yeah I actually remember that too, this article mentions it:

        Coroutines for Go - https://research.swtch.com/coro

        The parallelism provided by the goroutines caused races and eventually led to abandoning the design in favor of the lexer storing state in an object, which was a more faithful simulation of a coroutine. Proper coroutines would have avoided the races and been more efficient than goroutines.

    • tptacek 3 hours ago

      I feel like that talk has more to do with expressing concurrency, in problems where concurrency is a natural thing to think about, than it does with lexing.

  • ketzo 4 hours ago

    So how do you debug code written with macros like this, or come into it as a new user of the codebase?

    I’m imagining seeing the node! macro used, and seeing the macro definition, but still having a tough time knowing exactly what code is produced.

    Do I just use the Example and see what type hints I get from it? Can I hover over it in my IDE and see an expanded version? Do I need to reference the compiled code to be sure?

    (I do all my work in JS/TS so I don’t touch any macros; just curious about the workflow here!)

    • schneems 4 hours ago

      Run:

          $ cargo expand
      
      And you’ll see the resulting code.

      Rust is really several languages, ”vanilla” rust, declarative macros and proc macros. Each have a slightly different capability set and different dialect. You get used to working with each in turn over time.

      Also unit tests is generally a good playground area to understand the impacts of modifying a macro.

    • guitarbill 3 hours ago

      rust-analyzer, the Rust LSP used in e.g. VSCode, can expand declarative and proc macros recursively.

      it isn't too bad, although the fewer proc macros in a code base, the better. declarative macros are slightly easier to grok, but much easier to maintain and test. (i feel the same way about opaque codegen in other languages.)

  • WiSaGaN 4 hours ago

    I think except macros, most of these features are ML family language features as well. Rust stands out because it can implement this in an efficient, zero overhead abstraction way.

  • kldx 4 hours ago

    I like MegaParsec in haskell quite expressive, based on my limited experience using nom in Rust

  • jurschreuder 2 hours ago

    I cannot agree less, C++ is the best and always will be. You youngsters made up this new dialect that can also compile with the C++ compiler. This is like people putting VS Code in dark mode thinking they're now also working in the Terminal like the Gods of Binary.

    • arlort 2 hours ago

      Rust being a dialect of c++ is certainly a novel take

  • sksxihve 5 hours ago

    I've found that the logos crate is really nice for writing lexers in rust

    https://docs.rs/logos/0.14.2/logos/

  • omani 3 hours ago

    this is the third day in a row this article is being posted here.

    this time it got traction. funny how HN works.

    https://news.ycombinator.com/item?id=42055954

    https://news.ycombinator.com/item?id=42058920

  • jamra 5 hours ago

    Does anyone have a good EBNF notation for Sqlite? I tried to make a tree-sitter grammar, which produces C code and great Rust bindings for it. But they use some lemon parser. Not sure how to read the grammar from that.