Regular Expressions which query an Oracle

(arxiv.org)

28 points | by agnishom a day ago ago

11 comments

  • shayanjm 5 hours ago

    At what point do we drop the term "regular" expressions altogether for stuff like this? This is going to sound pedantic since I know that most popularly-used regex implementations are themselves non-regular, but I feel like we're just piling more and more stuff on top of good-old-regexes and trying to turn the concept into a catch-all for anything that does pattern matching on text.

    I guess it just feels icky that "regular expressions" has inherent meaning (i.e. can be represented entirely by a finite automaton) which has become completely diluted at this point.

    That rant aside, cool paper. The idea of bridging formal language theory with modern computational tooling feels timely. I think I would've liked to see more exploration of oracle-based costs, for instance:

    * What happens when oracle outputs are inconsistent/uncertain?

    * What happens as oracle interactions become more computationally expensive?

    • judofyr 4 hours ago

      Your rant is a bit unfounded for this paper as they do actually take completely standard regular expressions (no backtracking or anything like that) and extend it with one more construct. Calling it «semantic regular expressions» seems perfectly reasonable to me. What else to call it?

      As for outside of the computer science sphere (which I find is quite consistent in their terminology): I do agree that it seems like it’s a lost cause and «regex» is now synonymous for «pattern matching using this one specific syntax» :(

      • tgv 2 hours ago

        Regular has a meaning, and it isn't this. It's had this meaning since the 1950s. OTOH, these expressions do not generate a regular language. That's a good reason to me.

        They call it "semantic regular expression" because it apparently already is a lost cause. "Regular expressions with TMs embedded" doesn't quite have the same ring to it. Nobody would see it as a regexp.

        • judofyr an hour ago

          > OTOH, these expressions do not generate a regular language.

          Okay sure you're technically correct here, but only because these expressions generate a subset of a regular language. The LLM can only be invoked on a substring that can expressed as a regular expression, and then it's only used to remove strings from the language. Their results are based heavily on how regular expressions work. A "semantic context-free grammar" would have different type of characteristics and behavior.

          Maybe throwing in the word "extended" or "augmented" would be a bit more clear, but as I reader I definitely would expect "regular expression" to be part of the name.

        • taeric 2 hours ago

          To be fair, I could see this basically allowing a form of "back reference" lookup that lets you offload the back reference to other parts. For example, `/(.); \1 = (.)/.exec("foo; foo = anything")`, but instead of doing a back reference, you could have an oracle lookup.

          I haven't looked at the examples in this paper, yet. But I'm having fun imagining ways this could be used.

      • shayanjm 4 hours ago

        Nested querying is not something that is standard for regular grammars, amongst other aspects introduced in this paper that implicitly require things like memory (again, not standard for regular grammars)

    • krackers 3 hours ago

      >which has become completely diluted at this point.

      Isn't it common in TCS to consider the class X with oracle access to decide L though?

  • ashvardanian 4 hours ago

    There aren’t many papers on the subject, and I’m probably not great at spotting them. The linked paper also doesn’t mention much recent research, so it seems like an excellent opportunity to ask for the collective wisdom of the HN community - what are your favorite undervalued information retrieval papers after 2015?

  • larodi 4 hours ago

    This article lets an UTF8 symbol be something... so now we have all the existing math symbols + all UTF8 icons. Wonder what the papers be look like in 20 years.

  • nickpsecurity 3 hours ago

    They said they want the cost of invoking the oracle, the DB, to be low. Might be a good use for in-memory, embedded DB’s with a good API.

  • kevingadd 3 hours ago

    To me this feels kind of similar to the idea of verifying tokens in a compiler by querying an oracle. i.e. in the script compiler for my game engine, there's an additional pass at the end of compilation that checks things like filename literals or entity ID literals against a database of assets or a list of all the scripts that were compiled, so when I hit F5 in visual studio any typos in those literals are caught before I waste time booting the game up.

    It would be interesting to be able to do this eagerly during parsing, so it's neat to see people thinking about doing this at a regex level, though I'm having trouble thinking of specific cases where I would both want to use a regex and want to query an oracle from inside of it...