@fabiospampinato I've just added regexp-simpler-parser. GrepGrep looks awesome, but there are a ton of grep-like tools so I've kept the relevant section on that ultra slim.
I shouldn't be too opinionated about this stuff. :) I'm assuming you created an excellent design, and that you had at least slightly different principles and goals.
And to be clear, it's obviously not that bad if all I'm advocating is to change the property name from kind/subtype to format (or similar) in such cases. The main reason I'm suggesting this is so it's clear you can change or ignore any format value when working with an AST, which you can't do for a single-purpose kind/subtype without changing the meaning or breaking things.
Thanks for the details. :D JS regex AST design is a space I looked at when creating TS lib oniguruma-parser https://t.co/p2gsU0oL81 earlier this year.
It was my first time creating a true parser, so the implementation might be nontraditional and no doubt has things to improve. But I like to think it includes some good stuff. It parses Oniguruma regexes though, so there are a fair number of syntax/behavior differences and different considerations compared to JS RegExp.
If I recall, I wasn't a big fan of regjsparser's AST design and names. That said, I'm sure I took some bits of inspiration from it, as well as from the other three AST structures in the AST explorer I linked to. All of them had some things I thought should be better and/or simpler, more consistent, etc. If I recall, eslint-community/regexpp's AST design was fairly nice, but then made some strange decisions for new features from flag v. I tried to be quite thoughtful about the AST design I used in oniguruma-parser.
I don't have a UI for testing the parser, but if it's of interest, you can open the dev console at https://t.co/hN9uQYIPLf and enter ex: printAst(toOnigurumaAst(String.raw`[\w&&[^\d_]&&\p{a_hex}]?(?<yolo>(?!)|(?i:a\b)||)`))
> I like that mine uses "type" and "subtype", while regjsparser's uses "type" and "kind", which I find is much more confusing hierarchy-wise.
Good call-out. I'll think about adopting the name "subtype" as well. I'm not sure its implications are perfect in all cases, though. I'm definitely opposed to using "kind" (or subtype) for distinguishing different ways to format the same thing (ex: q, \x71, \161), which regjsparser does.
> I don't have a dedicated "value identifier" node type, for example for the name of named groups, it's just a string.
+1 to this.
> Both \uXXXX and \u{XXXX} emit the same node type for mine but not for regjsparser.
I think you made the right choice. Or rather, regjsparser shouldn't distinguish these via `kind`, since `kind` is also used for critical distinctions that are not about formatting. Distinguishing between \u0000, \u{0}, etc. is not a concern for most AST uses (though it's relevant for code generators, minifiers, etc.). See https://t.co/m3A1vszwH4 where I discuss distinguishing (and modifying) these kinds of things via a `format` prop (which would only be taken as a suggestion by the code generator, if applying the requested format would cause contextual problems).
@fabiospampinato I mentioned earlier that browsers have changed their handling of octals over the years. You shouldn't take my word on it, except that that's definitely how browsers behaved back in 2010 when I wrote the post I linked to. :)
Don't get me started. But yeah, so long as you're treating the \1 in that position as a backreference (and not as an octal or identity escape), the fact that backreferences to nonparticipating (and not-yet-participating) groups match the empty string (instead of the standard regex behavior of failing to match and therefore triggering backtracking) is canonical JS regex handling from the beginning.
JS is alone in this terrible behavior, apart from flavors that have explicitly taken prior art from JS (e.g. .NET's non-default ECMAScript mode). IMO this behavior is the second biggest mistake in JS's regex design, after `/g`. It's unintuitive, nonportable, and breaks all kinds of useful things you could otherwise do with backreferences when pushing regex boundaries. In fact, ~15 years ago I convinced Brendan Eich of this - he was open to changing the spec and invited me to help work on bolder RegExp changes for ES6. But I couldn't at the time, and I think the opportunity for changing this has probably passed.
> it's called a backreference
I'd refer to it as a forward reference in that position. Forward references are not the only way that a backreference might refer to a nonparticipating group. It's the same thing conceptually as `(a)|\1` or `(.)??\1` (but NOT the same as `(.??)\1`).
Forward references can sometimes be useful, if they're within a quantified group like `(\2x|(.))+`. But they're never useful in JS because they will always match the empty string. That's because of another JS-specific regex design mistake: backrefs in quantified groups like this are reset on each repetition. So that last regex is equivalent to `(x|(.))+`. 😢
Agree that we could do without it. \x00 is good enough. IMO JS made a mistake by keeping \0 while deprecating octals, especially since doing so came with an extra rule about what characters are allowed after \0. That made interpolation and escaping regex special characters more complicated.
PS: Some regex flavors additionally allow bare \x (not followed by a hex digit or `{`) as an alternative to \x00. While other flavors treat bare \x as either an error or an identity escape. Identity escapes are another big regex syntax mistake. Glad that JS made their rules stricter with \u and \v.
Not if the leading digit is a zero. If it is, then up to three digits after the leading zero can be considered to be part of the octal. But this exception only applied outside of character classes. At least, that was the behavior in browsers at the time. (I wrote lots of tooling around this back then.)
Behavior details for the overlap between octals, backreferences, and identity escapes (\8\9) vary significantly across regex flavors and can be extremely complex and unintuitive. Browsers also used to have inconsistent handling for octals, and changed over the years to follow Annex B.
This old post of mine https://t.co/TbIQnttZmH from 2010 includes a section that documented the behavior back then:
- /a\1/: \1 is an octal.
- /(a)\1/: \1 is a backreference.
- /(a)[\1]/: \1 is an octal.
- /(a)\1\2/: \1 is a backreference; \2 is an octal.
- /(a)\01\001[\01\001]/: All occurrences of \01 and \001 are octals. However, according to the ES3+ specs, 0-9 following \0 should cause a SyntaxError.
- /(a)\0001[\0001]/: The \0001 outside the character class is an octal; but inside, the octal ends at the third zero (i.e., the character class matches character index zero or "1"). This regex is therefore equivalent to /(a)\x01[\x00\x31]/; although, as mentioned above, adherence to ES3 would change the meaning.
- /(a)\00001[\00001]/: Outside the character class, the octal ends at the fourth zero and is followed by a literal "1". Inside, the octal ends at the third zero and is followed by a literal "01".
- /\1(a)/: Given that, in JavaScript, backreferences to capturing groups that have not (yet) participated match the empty string, does this regex match "a" (i.e., \1 is treated as a backreference since a corresponding capturing group appears in the regex) or does it match "\x01a" (i.e., the \1 is treated as an octal since it appears before its corresponding group)? Unsurprisingly, browsers disagree.
- /(\2(a)){2}/: Now things get really hairy. Does this regex match "aa", "aaa", "\x02aaa", "2aaa", "\x02a\x02a", or "2a2a"? All of these options seem plausible, and browsers disagree on the correct choice.
Additional aspects to worry about in other regex flavors include whether octal escapes go up to \377 (\xFF) or \777 (\u01FF), whether octal values >= \200 refer to encoded bytes or code points, whether single-digit \1 to \9 should always be treated as backreferences if not followed by digits, etc.
And all this mess for a regex feature (octals) that no one ever uses, except sometimes \0.
@fabiospampinato If we were cleaning up useless/unused regex features in JS, in addition to removing `\cX`, I'd also remove the special meaning of `\b` within character classes to match a backspace control character, and remove all legacy support for octals.
@fabiospampinato Probably you already know, but note that SpiderMonkey uses Irregexp from V8 as its regex implementation, so generally they should act the same.
Yeah, and it's supported in nearly every modern regex flavor. Only exceptions I can think of are RE2 and Python. Unlike JS, most flavors support more characters than just A-Za-z after `\c` (but the meaning isn't always consistent). And some like Oniguruma support uppercase-C `\C-x` and meta `\M-\C-x`. It gets weird, man.
I've needed this and variations fairly often when processing regexes (but not for perf).
> I *think* you can detect this without parsing (absent flag `v`)
Also not hard to deal with flag v. My ~600 byte https://t.co/HQl6GIJJF1 helps with processing JS regex syntax when you don't need a parser/AST. Has utils for finding unescaped instances of a regex pattern, with optional limit to inside/outside v-mode char classes.
Aside: There may be interesting perf opportunities by making optimizations in V8/Irregexp that benefit Shiki. CC @antfu
Although VS Code hasn’t adopted Oniguruma-To-ES (and I’m not sure they should), Shiki has. Shiki is massively popular, and since earlier this year has recommended its “JS regex engine” (which wraps Oniguruma-To-ES) when using it in browsers. And since Shiki’s syntax highlighting exclusively uses TextMate grammars, I assume it’s among the most widespread libraries with its level of regex intensity and reliance on native JS regex perf.
The thing that stands out most to me as an obvious and possibly-quick win is if V8 were willing to get ahead of ES standards by adding atomic groups ‘(?>…)’ to Irregexp. There’s a stage 1 proposal at https://t.co/WhlRBfkHsg . Possessive quantifiers (from the same proposal) are great syntax sugar for atomic groups, but if they were left out initially, I think it would be non-controversial to add atomic groups since they share identical syntax and behavior across .NET, Perl, PCRE, Python, Ruby, Java, Oniguruma, Boost, etc. If Irregexp supported them, Oniguruma-To-ES/Shiki would use them to quickly start providing a perf boost to many sites that use Shiki for syntax highlighting.