I love regular expression derivatives. One neat thing about regular expression derivatives is they are continuation-passing style for regular expressions. The derivative is "what to do next" after seeing a character, which is the continuation of the re. It's a nice conceptual connection if you're into programming language theory.
Low-key hate the lack of capitalization on the blog, which made me stumble over every sentence start. Great blog post a bit marred by unnecessary divergence from standard written English.
The author mentions that they found Mamouras et al. (POPL 2024), but not the associated implementation. While the Rust implementation is not public, a Haskell implementation can be found here: https://github.com/Agnishom/lregex
Worth mentioning (haven’t checked if the paper talks about this) that while the industry mostly forgot about derivatives and extended REs (i.e. REs with intersection), academia did not. Unfortunately, there have been some pretty discouraging results: the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression[1], not just exponential as for normal REs. So there is a potential reason not to support intersections in one’s RE matcher, even if they are enticingly easy to implement in terms of derivatives (and even if I personally would like some experimentation in this direction).
> how does RE# find the leftmost-longest match efficiently? remember the bidirectional scanning we mentioned earlier - run the DFA right to left to find all possible match starts, then run a reversed DFA left to right to find the ends. the leftmost start paired with the rightmost end gives you leftmost-longest. two linear DFA scans, no backtracking, no ambiguity.
I'm pretty sure that should say "the leftmost start paired with the leftmost end".
This also implies that the algorithm has to scan the entire input to find the first match, and the article goes on to confirm this. So the algorithm is a poor choice if you just want the first match in a very long text. But if you want all matches it is very good.
Back in the Usenet days, questions came up all the time about matching substrings that do not contain whatever. It’s technically possible without an explicit NOT operator because regular languages are closed under complement — along with union, intersection, Kleene star, etc. — but a bear to get right by hand for even simple cases.
Unbounded lookarounds without performance penalty at search time are an exciting feature too.
Management has always behaved as if they repent having added F# to VS 2010, at least it hasn't yet suffered the same stagnation as VB, even C++/CLI was updated to C++20 (minus modules).
In any case, those of us that don't have issues with .NET, or Java (also cool to hate these days), get to play with F# and Scala, and feel no need to be amazed with Rust's type system inherited from ML languages.
It is yet another "Rust but with GC" that every couple of months pops up in some forums.
This is very interesting. I'm a bit skeptical about the benchmarks / performance claims because they seem almost too good to be true but even just the extended operators alone are a nice improvement over existing regex engines.
The post mentions they also have a native library implemented in Rust without dependencies but I couldn't find a link to it. Is that available somewhere? I would love to try it out in some of my projects but I don't use .NET so the NuGET package is of no use to me.
I will open source the rust engine soon as well, some time this month.
As for the benchmarks, it's the fastest for large patterns and lookarounds, where leftmost-longest lets you get away with less memory usage so we don't need to transition from DFA to NFA.
In the github readme benchmarks it's faster than the exponential implementations of .NET Compiled so the 35 000x can be an arbitrary multiplier, you can keep adding alternatives and make it 1000000x.
for simple string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use.
Yes, that's exactly what we did to be competitive in the benchmarks.
There's a lot of simple cases where you don't really need a regex engine at all.
integrating SearchValues as a multi-string prefix search is a bit harder since it doesn't expose which branch matched so we would be taking unnecessary steps.
Also .NET implementation of Hyperscan's Teddy algorithm only goes left to right.. if it went right to left it would make RE# much faster for these cases.
I second this request. It would be wonderful to be able to test the rust implementation since it easier to call from other languages in my typical setup. I have a couple uses cases I have never fully resolved, just implemented partial work arounds and accepted a restricted feature set. This would probably allow me to deal with them correctly.
F# is one of the biggest 'What could have beens'. Great language, that just didn't hit the right time, or reach critical mass of the gestalt of the community.
I love regular expression derivatives. One neat thing about regular expression derivatives is they are continuation-passing style for regular expressions. The derivative is "what to do next" after seeing a character, which is the continuation of the re. It's a nice conceptual connection if you're into programming language theory.
Low-key hate the lack of capitalization on the blog, which made me stumble over every sentence start. Great blog post a bit marred by unnecessary divergence from standard written English.
in what is it different ?
The author mentions that they found Mamouras et al. (POPL 2024), but not the associated implementation. While the Rust implementation is not public, a Haskell implementation can be found here: https://github.com/Agnishom/lregex
Finally, an article about humans programming some computers. Thank you!
Worth mentioning (haven’t checked if the paper talks about this) that while the industry mostly forgot about derivatives and extended REs (i.e. REs with intersection), academia did not. Unfortunately, there have been some pretty discouraging results: the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression[1], not just exponential as for normal REs. So there is a potential reason not to support intersections in one’s RE matcher, even if they are enticingly easy to implement in terms of derivatives (and even if I personally would like some experimentation in this direction).
[1] https://arxiv.org/abs/0802.2869
This is very impressive.
> how does RE# find the leftmost-longest match efficiently? remember the bidirectional scanning we mentioned earlier - run the DFA right to left to find all possible match starts, then run a reversed DFA left to right to find the ends. the leftmost start paired with the rightmost end gives you leftmost-longest. two linear DFA scans, no backtracking, no ambiguity.
I'm pretty sure that should say "the leftmost start paired with the leftmost end".
This also implies that the algorithm has to scan the entire input to find the first match, and the article goes on to confirm this. So the algorithm is a poor choice if you just want the first match in a very long text. But if you want all matches it is very good.
> I'm pretty sure that should say "the leftmost start paired with the leftmost end".
I’m pretty sure it shouldn’t, that would give you the leftmost shortest match instead of leftmost longest.
As originally written, doesn't it go from the start of the first match to the end of the last match? I feel like I'm missing something.
It goes from start of the first match to the longest "alive" end, in practice it will go to a dead state and return after finding the match end.
there's an implicit `.*` in front of the first pass but i felt it would've been a long tangent so i didn't want to get into it.
so given input 'aabbcc' and pattern `b+`,
first reverse pass (using `.*b+`) marks 'aa|b|bcc'<-
the forward pass starts from the first match:
'aa->b|b|cc' marking 2 ends
then enters a dead state after the first 'c' and returns the longest end: aa|bb|cc
i hope this explains it better
That’s beautiful work. Check out other examples in the interactive web app:
https://ieviev.github.io/resharp-webapp/
Back in the Usenet days, questions came up all the time about matching substrings that do not contain whatever. It’s technically possible without an explicit NOT operator because regular languages are closed under complement — along with union, intersection, Kleene star, etc. — but a bear to get right by hand for even simple cases.
Unbounded lookarounds without performance penalty at search time are an exciting feature too.
I've had nothing but great experience with F#. If it wasn't associated with Microsoft, it'd be more popular than haskell
I think if it weren't a 'first class' member of .NET ecosystem[0], no one would know F#. After all Haskell and Ocaml already exist.
[0]: my very charitable take, as MS obviously cares C# much much more than F#.
Management has always behaved as if they repent having added F# to VS 2010, at least it hasn't yet suffered the same stagnation as VB, even C++/CLI was updated to C++20 (minus modules).
In any case, those of us that don't have issues with .NET, or Java (also cool to hate these days), get to play with F# and Scala, and feel no need to be amazed with Rust's type system inherited from ML languages.
It is yet another "Rust but with GC" that every couple of months pops up in some forums.
> ...it'd be more popular than haskell
https://steve-yegge.blogspot.com/2010/12/haskell-researchers...
> But they all just skip the press releases and go straight to the not using it part
Lol
Did they ever get the full extra person who gives a shit?
You realize that Microsoft Research employed Simon for many many years?
Fantastic stuff!
FYI some code snippets are unreadable in 'light mode' ("what substrings does the regex (a|ab)+ match in the following input?")
ah thank you for letting me know, fixed it now!
This is very interesting. I'm a bit skeptical about the benchmarks / performance claims because they seem almost too good to be true but even just the extended operators alone are a nice improvement over existing regex engines.
The post mentions they also have a native library implemented in Rust without dependencies but I couldn't find a link to it. Is that available somewhere? I would love to try it out in some of my projects but I don't use .NET so the NuGET package is of no use to me.
There's currently only a string solver with the same core library, but not a full regex engine https://github.com/ieviev/cav25-resharp-smt
I will open source the rust engine soon as well, some time this month.
As for the benchmarks, it's the fastest for large patterns and lookarounds, where leftmost-longest lets you get away with less memory usage so we don't need to transition from DFA to NFA.
In the github readme benchmarks it's faster than the exponential implementations of .NET Compiled so the 35 000x can be an arbitrary multiplier, you can keep adding alternatives and make it 1000000x.
for simple string literals it will definitely lose to Hyperscan and Rust regex since they have a high effort left-to-right SIMD algorithm that we cannot easily use.
Would SearchValues<char> help there for a fallback to a SIMD optimized simple string literal search rather than the happy path?
Yes, that's exactly what we did to be competitive in the benchmarks.
There's a lot of simple cases where you don't really need a regex engine at all.
integrating SearchValues as a multi-string prefix search is a bit harder since it doesn't expose which branch matched so we would be taking unnecessary steps.
Also .NET implementation of Hyperscan's Teddy algorithm only goes left to right.. if it went right to left it would make RE# much faster for these cases.
I second this request. It would be wonderful to be able to test the rust implementation since it easier to call from other languages in my typical setup. I have a couple uses cases I have never fully resolved, just implemented partial work arounds and accepted a restricted feature set. This would probably allow me to deal with them correctly.
You got me at TalTech. Great job and the paper is high quality. I'll have to learn F# but I believe it is worth it.
F# is one of the biggest 'What could have beens'. Great language, that just didn't hit the right time, or reach critical mass of the gestalt of the community.
@burnsushi is that true?
Tentative doubt until/if he confirms. (but specifically in F# though)
Obligatory: Hitler reacts to functional programming
https://m.youtube.com/watch?v=ADqLBc1vFwI