ECMAScript proposal: RegExp set notation + properties of strings

Authors

  • Markus Scherer
  • Mathias Bynens

Status

This proposal is at stage 2 of the TC39 process.

As of the 2021-may-25 TC39 meeting, this proposal officially subsumes the properties of strings proposal.

Summary

In ECMAScript regex character classes, we propose to add syntax & semantics for the following set operations:

  • difference/subtraction (in A but not in B)
  • intersection (in both A and B)
  • nested character classes (needed to enable the above)

In addition, by merging with the properties of strings proposal, we also propose to add certain Unicode properties of strings, and string literals in character classes.

Motivation

Many regular expression engines support named character properties, mostly reflecting Unicode character properties, to avoid hardcoding character classes that may require hundreds of ranges and that may change with new versions of Unicode.

However, a character property is often just a starting point. It is common to need additions (union), exceptions (subtraction), and “both this and that” (intersection). See the recommendation to support set operations in UTS #18: Unicode Regular Expressions.

ECMAScript regular expression patterns already support one set operation in limited form: one can create a union of characters, ranges, and classes, as long as those classes are CharacterClassEscapes like \s or \p{Decimal_Number}.

A web search for questions about regular expressions with such set operations reveals workarounds such as hardcoding the ranges resulting from set operations (losing the benefits of named properties) and lookahead assertions (which are unintuitive for this purpose and perform less well).

We propose adding syntax & semantics for difference and intersection, as well as nested character classes.

Proposed solution

We propose to extend the syntax for character classes to add support for set difference/subtraction, set intersection, and nested character classes.

High-level API

Within regular expression patterns, we propose enabling the following functionality. We have not yet settled on several issues including single vs. double punctuation, some distinct prefix to not break existing expressions (see FAQ), etc.

// difference/subtraction
[A--B]

// intersection
[A&&B]

// nested character class
[A--[0-9]]

Throughout these high-level examples, A and B can be thought of as placeholders for a character class (e.g. [a-z]) or a property escape (e.g. \p{ASCII}) and maybe (subject to discussion of specifics) single characters and/or character ranges. See the illustrative examples section for concrete real-world use cases.

Illustrative examples

Real-world usage examples from code using ICU’s UnicodeSet which implements a pattern syntax similar to regex character classes (modified here to use \p{Perl syntax for properties} rather than [:POSIX syntax for properties:]UnicodeSet supports both):

  • Code that looks for non-ASCII digits, to convert them to ASCII digits:

    [\p{Decimal_Number}--[0-9]]
    
  • Looking for spans of "word/identifier letters" of specific scripts:

    [\p{Script=Khmer}&&[\p{Letter}\p{Mark}\p{Number}]]
    
  • Looking for “breaking spaces”:

    [\p{White_Space}--\p{Line_Break=Glue}]
    

    Note that ECMAScript currently doesn’t support \p{Line_Break=…} — this is an illustrative example regardless.

  • Looking for emoji characters except for the ASCII ones:

    [\p{Emoji}--[#*0-9]]
    
    // …or…
    
    [\p{Emoji}--\p{ASCII}]
    
  • Looking for non-script-specific combining marks:

    [\p{Nonspacing_Mark}&&[\p{Script=Inherited}\p{Script=Common}]]
    
  • Looking for “invisible characters” except for ASCII space:

    [[\p{Other}\p{Separator}\p{White_Space}\p{Default_Ignorable_Code_Point}]--\x20]
    
  • Looking for “first letter in each script” starting from:

    [\P{NFC_Quick_Check=No}--\p{Script=Common}--\p{Script=Inherited}--\p{Script=Unknown}]
    

    Note that ECMAScript currently doesn’t support \p{NFC_Quick_Check=…} — this is an illustrative example regardless.

  • All Greek code points that are either a letter, a mark (e.g. diacritic), or a decimal number:

    [\p{Greek}&&[\p{Letter}\p{Mark}\p{Decimal_Number}]]
    
  • All code points, except for those in the “Other” General_Category, but add back control characters:

    [[\p{Any}--\p{Other}]\p{Control}]
    
  • All assigned code points, except for separators:

    [\p{Assigned}--\p{Separator}]
    
  • All right-to-left and Arabic Letter code points, but remove unassigned code points:

    [[\p{Bidi_Class=R}\p{Bidi_Class=AL}]--\p{Unassigned}]
    

    Note that ECMAScript currently doesn’t support \p{Bidi_Class=…} — this is an illustrative example regardless.

  • All right-to-left and Arabic Letter code points with General_Category “Letter”:

    [\p{Letter}&&[\p{Bidi_Class=R}\p{Bidi_Class=AL}]]
    

    Note that ECMAScript currently doesn’t support \p{Bidi_Class=…} — this is an illustrative example regardless.

  • All characters in the “Other” General_Category EXCEPT for format and control characters (or, equivalently, all surrogate, private use, and unassigned code points):

    [\p{Other}--\p{Format}--\p{Control}]
    

FAQ

Is the new syntax backwards-compatible? Do we need another regular expression flag?

It is an explicit goal of this proposal to not break backwards compatibility. Concretely, we don’t want to change behavior of any regular expression pattern that currently does not throw an exception. There needs to be some way to indicate that the new syntax is in use.

We considered 4 options:

  • A new flag outside the expression itself.
  • A modifier inside the expression, of the form (?L) where L is one ASCII letter. (Several regex engines support various modifiers like this.)
  • A prefix like \U… that is not valid under the current u flag (Unicode mode) – but note that \U without the u flag is just the same as U itself.
    • (Banning the use of unknown escape sequences in u RegExps was a conscious choice, made to enable this kind of extension.)
  • A prefix like (?[ that is not valid in existing patterns regardless of flags.

The idea to use a prefix was suggested in an early TC39 meeting, so we were working with variations of that, for example:

UnicodeCharacterClass = '\UniSet{' ClassContents '}'

However, we found that this is not very developer-friendly.

In particular, one would have to write the prefix and use the u flag. Waldemar pointed out that the prefix looks like it should be enough, and therefore a developer may well accidentally omit adding the u flag. Although this aspect could be addressed by using a more complicated prefix that is currently invalid with and without the u flag (like (?[), doing so would come at the cost of readability.

Also, the use of a backslash-letter prefix would want to enclose the new syntax in {curly braces} because other such syntax (\p{property}, \u{12345}, …) uses curly braces – but not using [square brackets] for the outermost level of a character class looks strange.

Finally, when an expression has several new-syntax character classes, the prefix would have to be used on each one, which is clunky.

An in-expression modifier is an attractive alternative, but ECMAScript does not yet use any such modifiers.

Therefore, a new flag is the simplest, most user-friendly, and syntactically and semantically cleanest way to indicate the new character class syntax. It should imply and build on the u flag.

We suggest using flag v for the next letter after u.

We also suggest that the proposed properties of strings require use of this same new flag.

In other words, the new flag would indicate several connected changes related to properties and character classes:

  • properties of strings
  • character classes may contain multi-character-string elements, via string literals or certain properties
  • nested classes
  • set operators
  • simpler parsing of dashes and square brackets

For more discussion see issue 2.

What’s the precedent in other RegExp flavors?

Several other regex engines support some or all of the proposed extensions in some form:

language/implementation union subtraction intersection nested classes symmetric difference
ICU regex
java.util.regex.Pattern 🤷 *
Perl (“experimental feature available starting in 5.18”)
.Net
XML Schema
Apache Xerces2 XPath regex
Python regex module (not built-in "re")
Ruby Regexp
ECMAScript prior to this proposal
ECMAScript with this proposal

* Subtraction is documented as intersection with negation. With only support for negation + nested classes, you already have the functional equivalent of intersection & subtraction: [^[^ab][^cd]] === [[ab]&&[cd]] and [^[^ab][cd]] === [[ab]--[cd]]. This is just not very readable. For this reason, our proposal includes dedicated syntax for intersection and subtraction as well.

These all differ somewhat in syntax and semantics (e.g. operator precedence). References:

Some Stack Overflow discussions:

How does this interact with properties of strings a.k.a. the sequence properties proposal?

We described the exact interactions between the two proposals on the path to stage 2. (See issue #3 for background.)

We propose to require the new flag in order to enable properties-of-strings as well as allowing new-syntax character classes to contain multi-character-string elements (from string literals or properties-of-strings used inside a class).

Can a property of strings change into a property of characters, or vice versa?

Short answer: no.

Long answer: We brought this up with the Unicode Technical Committee (UTC) in May 2019 (see L2/19-168 + meeting notes), and later (in April 2021) proposed a concrete new stability policy (see L2/21-091 + meeting notes). The UTC reached consensus to approve our proposal. The domain of a normative or informative Unicode property must never change. In particular, a property of characters must never be changed into a property of strings, and vice versa.

Can a property or character class match an infinite set of strings?

Short answer: no.

This proposal, just like the original properties of strings proposal, adds support for certain properties of strings, each of which expands to a finite, well-defined set of strings (Basic_Emoji also applies to many single characters); and this proposal adds syntax for character classes with explicitly enumerated strings, which also creates a finite set. This is a natural extension from finite properties of characters and finite character classes/sets of characters.

For example, in UTS #51 there is a very clear distinction between

  1. an emoji zwj sequence, defined via a regular expression that matches an infinite set of strings
  2. the RGI emoji ZWJ sequence set (= the RGI_Emoji_ZWJ_Sequence property) which is a finite set of strings listed in a data file

It is theoretically possible to support named matchers for infinite sets of strings, that is, a kind of named sub-regular-expression. That is decidedly not part of this proposal, nor is any speculation about possible syntax and semantics of such hypothetical expressions part of this proposal.

There is enough reserved syntax (e.g., curly braces) to enable wide-ranging extensions in the future, but we don’t plan to build something specific into the proposed spec changes.

What’s the match order for character classes containing strings?

This proposal ensures longest strings are matched first, so that a prefix like 'xy' does not hide a longer string like 'xyz'. For example, the pattern [a-c\q{W|xy|xyz}] applies to the strings 'a', 'b', 'c', 'W', 'xy', and 'xyz'. This pattern behaves like xyz|xy|a|b|c|W or xyz|xy|[a-cW].

Matching the longest strings first is key to the integration with properties of strings like \p{RGI_Emoji}. A Unicode property defines a set of characters/strings in the mathematical sense; in particular, no order. Thus, there is no order of the strings in e.g. [\p{RGI_Emoji}--\q{🇧🇪}] that we could preserve.

For more details on the rationale for matching longest strings first, see issue #25.

How does subtraction behave in the case of A--B where B is not a proper subset of A?

As mentioned in the answer to the previous question, according to both the current ECMAScript specification and other regular expression implementations, character classes are mathematical sets. As such, the removal of strings that are not present in the original set is not an error, but rather a no-op. Example (note that RGI_Emoji includes the string 🇧🇪, but RGI_Emoji_ZWJ_Sequence does not):

# Proper subset.
[\p{RGI_Emoji}--\q{🇧🇪}]
# Not a proper subset.
[\p{RGI_Emoji_ZWJ_Sequence}--\q{🇧🇪}]

It would be confusing and counterproductive if one of these patterns threw an exception.

Several of the real-world illustrative examples in this explainer rely on this useful A--B pattern, and it is crucial that we support it. See issue #32 for more background.

What about symmetric difference?

We considered also proposing an operator for symmetric difference (see issue #5), but we did not find a good use case and wanted to keep the proposal simple.

Instead, we are proposing to reserve doubled ASCII punctuation and symbols for future use. That will allow for future proposals to add ~~ for example, as suggested in UTS #18, for symmetric difference.

Does this proposal affect ECMAScript lexing?

No. It’s an explicit goal of our proposal that a correct ECMAScript lexer before this proposal remains a correct ECMAScript lexer after this proposal.

TC39 meeting notes + slides

Specification

(We developed the draft spec changes in a Google Doc, but all of the changes from there are now in the pull request. There are just a few discussion threads left in the doc that we will resolve. Please review the pull request and the HTML diffs.)

Implementations