Regular Expression

RegExp Constructor

In ES5, there are two cases for the parameters of the RegExp constructor.

The first case is that the parameter is a string, then the second parameter represents the modifier (flag) of the regular expression.

var regex = new RegExp("xyz", "i");
// Equivalent to
var regex = /xyz/i;

The second case is that the parameter is a regular expression, then a copy of the original regular expression will be returned.

var regex = new RegExp(/xyz/i);
// Equivalent to
var regex = /xyz/i;

However, ES5 does not allow using the second parameter to add modifiers at this time, otherwise an error will be reported.

var regex = new RegExp(/xyz/, "i");
// Uncaught TypeError: Cannot supply flags when constructing one RegExp from another

ES6 changed this behavior. If the first parameter of the RegExp constructor is a regular object, then the second parameter can be used to specify the modifier. Moreover, the returned regular expression will ignore the modifiers of the original regular expression, and only use the newly specified modifiers.

new RegExp(/abc/gi, "i").flags;
// "i"

In the above code, the modifier of the original regular object is ig, which will be covered by the second parameter i.

Regular method of string

Before ES6, there were 4 methods for string objects, and regular expressions could be used: match(), replace(), search() and split().

ES6 calls all the four methods of RegExp instance methods within the language, so that all regular-related methods are defined on the RegExp object.

-String.prototype.match calls RegExp.prototype[Symbol.match] -String.prototype.replace calls RegExp.prototype[Symbol.replace] -String.prototype.search calls RegExp.prototype[Symbol.search] -String.prototype.split calls RegExp.prototype[Symbol.split]

u modifier

ES6 adds the u modifier to regular expressions, which means "Unicode mode", which is used to correctly handle Unicode characters larger than \uFFFF. In other words, the four-byte UTF-16 encoding will be handled correctly.

/^\uD83D/u.test('\uD83D\uDC2A') // false
/^\uD83D/.test('\uD83D\uDC2A') // true

In the above code, \uD83D\uDC2A is a four-byte UTF-16 encoding, representing one character. However, ES5 does not support the four-byte UTF-16 encoding and will recognize it as two characters, resulting in the result of the second line of code being true. After adding the u modifier, ES6 will recognize it as a character, so the result of the first line of code is false.

Once the u modifier is added, the behavior of the following regular expressions will be modified.

(1) Dot character

The dot (.) character in regular expressions means any single character except the newline character. For Unicode characters with a code point greater than 0xFFFF, the dot character cannot be recognized, and the u modifier must be added.

var s ='𠮷';

/^.$/.test(s) // false
/^.$/u.test(s) // true

The above code indicates that if the u modifier is not added, the regular expression will consider the string to be two characters, and the match will fail.

(2) Unicode character notation

ES6 added the use of braces to represent Unicode characters. This notation must add the u modifier in the regular expression to recognize the braces, otherwise it will be interpreted as a quantifier.

/\u{61}/.test("a") / // false
  a /
  u.test("a") / // true
  𠮷 /
  u.test("𠮷"); // true

The above code indicates that if the u modifier is not added, the regular expression cannot recognize the notation of \u{61}, and will only think that it matches 61 consecutive us.

(3) Quantifier

After using the u modifier, all quantifiers will correctly identify Unicode characters with code points greater than 0xFFFF.

/a{2}/.test('aa') // true
/a{2}/u.test('aa') // true
/𠮷{2}/.test('𠮷𠮷') // false
/𠮷{2}/u.test('𠮷𠮷') // true

(4) Pre-defined mode

The u modifier also affects the pre-defined mode, whether it can correctly recognize Unicode characters with a code point greater than 0xFFFF.

/^\S$/.test('𠮷') // false
/^\S$/u.test('𠮷') // true

The \S in the above code is a predefined pattern that matches all non-blank characters. Only by adding the u modifier, it can correctly match the Unicode characters whose code points are greater than 0xFFFF.

Using this, you can write a function that correctly returns the length of the string.

function codePointLength(text) {
  var result = text.match(/[\s\S]/gu);
  return result ? result.length : 0;
}

var s = "𠮷𠮷";

s.length; // 4
codePointLength(s); // 2

(5) i modifier

Some Unicode characters have different encodings, but their fonts are very similar. For example, \u004B and \u212A are both uppercase K.

/[az]/i.test("\u212A") / // false
  [az] /
  iu.test("\u212A"); // true

In the above code, without the u modifier, the non-standard K character cannot be recognized.

(6) Escaping

Without the u modifier, the escapes that are not defined in the regular rules (such as comma escape \,) are invalid, and an error will be reported in the u mode.

/\,/ // /\,/
/\,/u // report an error

In the above code, if there is no u modifier, the backslash before the comma is invalid. If the u modifier is added, an error will be reported.

RegExp.prototype.unicode property

The new unicode attribute of regular instance objects indicates whether the u modifier is set.

const r1 = /hello/;
const r2 = /hello/u;

r1.unicode; // false
r2.unicode; // true

In the above code, whether the regular expression is set with the u modifier can be seen from the unicode attribute.

y modifier

In addition to the u modifier, ES6 also adds a y modifier to regular expressions, called the "sticky" modifier.

The function of the y modifier is similar to that of the g modifier, and it is also a global match. The next match starts from the next position after the previous match succeeded. The difference is that the g modifier only needs a match in the remaining positions, while the y modifier ensures that the match must start from the first remaining position, which is the meaning of "glue".

var s = "aaa_aa_a";
var r1 = /a+/g;
var r2 = /a+/y;

r1.exec(s); // ["aaa"]
r2.exec(s); // ["aaa"]

r1.exec(s); // ["aa"]
r2.exec(s); // null

The above code has two regular expressions, one uses the g modifier and the other uses the y modifier. These two regular expressions were executed twice. When executed for the first time, both behave the same, and the remaining strings are all _aa_a. Since the g modifier has no position requirements, the second execution will return the result, and the y modifier requires that the match must start from the head, so null is returned.

If you change the regular expression to ensure that the head matches every time, the y modifier will return the result.

var s = "aaa_aa_a";
var r = /a+_/y;

r.exec(s); // ["aaa_"]
r.exec(s); // ["aa_"]

Each time the above code matches, it starts from the head of the remaining string.

Use the lastIndex attribute to better illustrate the y modifier.

const REGEX = /a/g;

// Specify to start matching from position 2 (y)
REGEX.lastIndex = 2;

// match successfully
const match = REGEX.exec("xaya");

// Successfully matched at position 3
match.index; // 3

// The next match starts at position 4
REGEX.lastIndex; // 4

// Failed to match the 4th position
REGEX.exec("xaya"); // null

In the above code, the lastIndex attribute specifies the starting position of each search, and the g modifier starts from this position and searches backwards until a match is found.

The y modifier also complies with the lastIndex attribute, but requires that a match must be found at the position specified by lastIndex.

const REGEX = /a/y;

// Specify to start matching from position 2
REGEX.lastIndex = 2;

// Not glued, matching failed
REGEX.exec("xaya"); // null

// Specify to start matching from position 3
REGEX.lastIndex = 3;

// No. 3 position is glue, the match is successful
const match = REGEX.exec("xaya");
match.index; // 3
REGEX.lastIndex; // 4

In fact, the y modifier implies the head matching flag ^.

/b/y.exec("aba");
// null

Since the above code cannot guarantee that the head matches, it returns null. The design of the y modifier is intended to make the head matching flag ^ effective in global matching.

The following is an example of the replace method of a string object.

const REGEX = /a/gy;
"aaxa".replace(REGEX, "-"); //'--xa'

In the above code, the last a will not be replaced because it does not appear at the head of the next match.

A single y modifier to the match method can only return the first match. It must be used in conjunction with the g modifier to return all matches.

"a1a2a3".match(/a\d/y); // ["a1"]
"a1a2a3".match(/a\d/gy); // ["a1", "a2", "a3"]

One application of the y modifier is to extract tokens from strings. The y modifier ensures that there are no missing characters between matches.

const TOKEN_Y = /\s*(\+|[0-9]+)\s*/y;
const TOKEN_G = /\s*(\+|[0-9]+)\s*/g;

tokenize(TOKEN_Y, "3 + 4");
// ['3','+', '4']
tokenize(TOKEN_G, "3 + 4");
// ['3','+', '4']

function tokenize(TOKEN_REGEX, str) {
  let result = [];
  let match;
  while ((match = TOKEN_REGEX.exec(str))) {
    result.push(match[1]);
  }
  return result;
}

In the above code, if there are no illegal characters in the string, the extraction results of y modifier and g modifier are the same. However, once illegal characters appear, the behavior of the two is different.

tokenize(TOKEN_Y, "3x + 4");
// ['3']
tokenize(TOKEN_G, "3x + 4");
// ['3','+', '4']

In the above code, the g modifier will ignore illegal characters, but the y modifier will not, so it is easy to find errors.

RegExp.prototype.sticky property

Matching with the y modifier, the regular instance object of ES6 has a sticky attribute, which indicates whether the y modifier is set.

var r = /hello\d/y;
r.sticky; // true

RegExp.prototype.flags property

ES6 adds the flags attribute to regular expressions, which will return the modifiers of the regular expression.

// ES5 source attribute
// Return the body of the regular expression
/abc/gi.source /
  // "abc"

  // ES6 flags attribute
  // Return modifiers of regular expressions
  abc /
  ig.flags;
//'gi'

s Modifier: dotAll mode

In regular expressions, the dot (.) is a special character that represents any single character, but there are two exceptions. One is a four-byte UTF-16 character, which can be solved with the u modifier; the other is a line terminator character.

The so-called line terminator means that the character indicates the end of a line. The following four characters belong to the "line terminator".

-U+000A Newline character (\n) -U+000D Carriage Return (\r) -U+2028 line separator -U+2029 paragraph separator

/foo.bar/.test("foo\nbar");
// false

In the above code, because . does not match \n, the regular expression returns false.

However, in many cases we want to match any single character. At this time, there is a workaround.

/foo[^]bar/.test("foo\nbar");
// true

This solution is not very intuitive after all, ES2018 introduceds modifier, so that . can match any single character.

/foo.bar/s.test("foo\nbar"); // true

This is called the dotAll mode, that is, dots represent all characters. Therefore, the regular expression also introduces a dotAll attribute, which returns a boolean value, indicating whether the regular expression is in the dotAll mode.

const re = /foo.bar/s;
// Another way of writing
// const re = new RegExp('foo.bar','s');

re.test("foo\nbar"); // true
re.dotAll; // true
re.flags; //'s'

The /s modifier and the multi-line modifier /m do not conflict. When the two are used together, . matches all characters, while ^ and $ match the beginning and end of each line .

Backward Assertion

The regular expressions of JavaScript language only support lookahead and negative lookahead, but do not support lookbehind and negative lookbehind. ES2018 introduced Backline Assertion, which is already supported by V8 engine version 4.9 (Chrome 62).

"Ahead assertion" means that x only matches before y and must be written as /x(?=y)/. For example, to match only the number before the percent sign, it should be written as /\d+(?=%)/. "Advance negative assertion" means that x only matches if it is not before y, and must be written as /x(?!y)/. For example, to match only the numbers not before the percent sign, it should be written as /\d+(?!%)/.

/\d+(?=%)/.exec('100% of US presidents have been male') // ["100"]
/\d+(?!%)/.exec('that's all 44 of them') // ["44"]

The above two strings, if you exchange regular expressions, you will not get the same result. In addition, you can also see that the part ((?=%)) in the parentheses of the "pre-assertion" is not included in the returned result.

"Later assertion" is just the opposite of "advance assertion". x only matches after y, and must be written as /(?<=y)x/. For example, to match only the digits after the dollar sign, it should be written as /(?<=\$)\d+/. "Last Negative Assertion" is the opposite of "Last Negative Assertion". x only matches if it does not follow y and must be written as /(?<!y)x/. For example, to match only the numbers not after the dollar sign, it should be written as /(?<!\$)\d+/.

/(?<=\$)\d+/.exec('Benjamin Franklin is on the $100 bill') // ["100"]
/(?<!\$)\d+/.exec('it's is worth about90') // ["90"]

In the above example, the part ((?<=\$)) in the parentheses of the "follow-line assertion" is also not included in the returned result.

The following example is the use of line-behind assertion for string replacement.

const RE_DOLLAR_PREFIX = /(?<=\$)foo/g;
"$foo %foo foo".replace(RE_DOLLAR_PREFIX, "bar");
//'$bar %foo foo'

In the above code, only foo after the dollar sign will be replaced.

The realization of "follow-line assertion" needs to first match the x of /(?<=y)x/, and then return to the left to match the part of y. This "right first and then left" execution order, contrary to all other regular operations, led to some behaviors that did not meet expectations.

First of all, the group match of the subsequent assertion is different from the normal result.

/(?<=(\d+)(\d+))$/.exec('1053') // ["", "1", "053"]
/^(\d+)(\d+)$/.exec('1053') // ["1053", "105", "3"]

In the above code, two groups of matches need to be captured. When there is no "follow-line assertion", the first parenthesis is greedy mode, and the second parenthesis can only capture one character, so the result is 105 and 3. In the case of "follow-line assertion", since the execution order is from right to left, the second parenthesis is greedy mode, and the first parenthesis can only capture one character, so the result is 1 and 053.

Secondly, the backslash references for "follow-line assertion" are also in the reverse order and must be placed before the corresponding parenthesis.

/(?<=(o)d\1)r/.exec('hodor') // null
/(?<=\1d(o))r/.exec('hodor') // ["r", "o"]

In the above code, if the backslash reference (\1) of the subsequent line assertion is placed after the parenthesis, the matching result will not be obtained, and it must be placed in the front. Because the back-line assertion scans from left to right first, and then looks back after a match is found, and completes the backslash quote from right to left.

Unicode Property Class

ES2018 Introduced a new type of writing \p{...} and \P{... }, allowing regular expressions to match all characters that conform to a certain attribute of Unicode.

const regexGreekSymbol = /\p{Script=Greek}/u;
regexGreekSymbol.test("π"); // true

In the above code, \p{Script=Greek} specifies to match a Greek letter, so matching π succeeds.

The Unicode attribute class must specify the attribute name and attribute value.

\p{UnicodePropertyName=UnicodePropertyValue}

For some attributes, you can write only the attribute name or only the attribute value.

\p{UnicodePropertyName}
\p{UnicodePropertyValue}

\P{…} is the reverse matching of \p{…}, that is, matching characters that do not meet the conditions.

Note that these two types are only valid for Unicode, so you must add the u modifier when using them. If the u modifier is not added, the regular expressions using \p and \P will report errors, and ECMAScript reserves these two classes.

Since Unicode has so many attributes, the expressive power of this new class is very strong.

const regex = /^\p{Decimal_Number}+$/u;
regex.test("𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼"); // true

In the above code, the attribute class specifies to match all decimal characters, you can see that the decimal characters of various fonts will be matched successfully.

\p{Number} can even match Roman numerals.

// match all numbers
const regex = /^\p{Number}+$/u;
regex.test("²³¹¼½¾"); // true
regex.test("㉛㉜㉝"); // true
regex.test("ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ"); // true

Here are some other examples.

// match all spaces
\p{White_Space}

// Match all letters of various characters, which is equivalent to the Unicode version of \w
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

// Match all non-letter characters of various characters, which is equivalent to \W in the Unicode version
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

// match Emoji
/\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu

// match all arrow characters
const regexArrows = /^\p{Block=Arrows}+$/u;
regexArrows.test('←↑→↓↔↕↖↗↘↙⇏⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙⇧⇩') // true

Named group matching

Introduction

Regular expressions use parentheses for group matching.

const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;

In the above code, there are three sets of parentheses in the regular expression. Using the exec method, these three sets of matching results can be extracted.

const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;

const matchObj = RE_DATE.exec("1999-12-31");
const year = matchObj[1]; // 1999
const month = matchObj[2]; // 12
const day = matchObj[3]; // 31

One problem with group matching is that the matching meaning of each group is not easy to see, and can only be quoted with a numeric serial number (such as matchObj[1]). If the order of the groups changes, the serial number must be modified when quoting.

ES2018 introduced Named Group Match(Named Capture Groups), which allows you to specify a name for each group match, which is easy to read the code and convenient Reference.

const RE_DATE = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;

const matchObj = RE_DATE.exec("1999-12-31");
const year = matchObj.groups.year; // "1999"
const month = matchObj.groups.month; // "12"
const day = matchObj.groups.day; // "31"

In the above code, "named group matching" is inside the parentheses, and "question mark + angle bracket + group name" (?<year>) is added to the head of the pattern, and then the groups of the result can be returned in the execmethodThe group name is quoted on the attribute. At the same time, the numeric sequence number (matchObj[1]) is still valid.

Named group matching is equivalent to adding an ID to each group of matching, which is convenient for describing the purpose of the matching. If the order of the groups changes, there is no need to change the processing code after the match.

If the named group does not match, the corresponding groups object property will be undefined.

const RE_OPT_A = /^(?<as>a+)?$/;
const matchObj = RE_OPT_A.exec("");

matchObj.groups.as; // undefined
"as" in matchObj.groups; // true

In the above code, no match is found for the named group as, then the attribute value of matchObj.groups.as is undefined, and the key name of as always exists in groups.

Destructuring assignment and replacement

With named group matching, you can use destructuring assignment to assign values ​​to variables directly from the matching results.

let {
  groups: { one, two },
} = /^(?<one>.*):(?<two>.*)$/u.exec("foo:bar");
one; // foo
two; // bar

When replacing strings, use $<group name> to refer to the named group.

let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;

"2015-01-02".replace(re, "$<day>/$<month>/$<year>");
// '02/01/2015'

In the above code, the second parameter of the replace method is a string, not a regular expression.

The second parameter of the replace method can also be a function. The parameter sequence of the function is as follows.

"2015-01-02".replace(
  re,
  (
    matched, // The entire matching result 2015-01-02
    capture1, // The first group matches 2015
    capture2, // The second group matches 01
    capture3, // The third group matches 02
    position, // The position where the match starts 0
    S, // Original string 2015-01-02
    groups // An object composed of named groups {year, month, day}
  ) => {
    let { day, month, year } = groups;
    return `${day}/${month}/${year}`;
  }
);

Named group matching is based on the original one, and the last function parameter is added: an object formed by the named group. This object can be directly deconstructed and assigned within the function.

Quote

If you want to refer to a "named group match" inside the regular expression, you can use \k<group name>.

const RE_TWICE = /^(?<word>[az]+)!\k<word>$/;
RE_TWICE.test("abc!abc"); // true
RE_TWICE.test("abc!ab"); // false

Digital references (\1) are still valid.

const RE_TWICE = /^(?<word>[az]+)!\1$/;
RE_TWICE.test("abc!abc"); // true
RE_TWICE.test("abc!ab"); // false

These two reference syntaxes can also be used at the same time.

const RE_TWICE = /^(?<word>[az]+)!\k<word>!\1$/;
RE_TWICE.test("abc!abc!abc"); // true
RE_TWICE.test("abc!abc!ab"); // false

Regular match index

The start position and end position of the regular matching result are currently not very convenient to obtain. The exec() method of the regular instance, the return result has an index attribute, which can get the starting position of the whole matching result, but if it includes group matching, the starting position of each group matching is difficult to get.

There is now a third-stage proposal, which adds the attribute indices to the return result of the exec() method, above this attribute You can get the start position and end position of the match.

const text = "zabbcdef";
const re = /ab/;
const result = re.exec(text);

result.index; // 1
result.indices; // [[1, 3]]

In the above example, the return result of the exec() method is result, and its index property is the starting position of the whole matching result (ab), and its indices property is an array, and its members are each An array of matching start and end positions. Since the regular expression in this example has no group matching, the indices array has only one member, which means that the start position of the entire match is 1 and the end position is 3.

Note that the starting position is included in the matching result, but the ending position is not included in the matching result. For example, if the matching result is ab, which are the first and second positions of the original string, the ending position is the third position.

If the regular expression contains group matches, the array corresponding to the indices attribute will contain multiple members, providing the start and end positions of each group match.

const text = "zabbcdef";
const re = /ab+(cd)/;
const result = re.exec(text);

result.indices; // [[1, 6 ], [4, 6]]

In the above example, the regular expression contains a group match, then the indices attribute array has two members, the first member is the start and end positions of the entire matching result (abbcd), and the second member is the group The start and end positions of the match (cd).

The following is an example of multiple group matching.

const text = "zabbcdef";
const re = /ab+(cd(ef))/;
const result = re.exec(text);

result.indices; // [[1, 8], [4, 8], [6, 8]]

In the above example, the regular expression contains two group matches, so the indices attribute array has three members.

If the regular expression contains a named group match, the indices attribute array will also have a groups attribute. This attribute is an object from which the start position and end position of the named group match can be obtained.

const text = "zabbcdef";
const re = /ab+(?<Z>cd)/;
const result = re.exec(text);

result.indices.groups; // {Z: [4, 6]}

In the above example, the indices.groups property of the result returned by the exec() method is an object that provides the start position and end position of the named group matching Z.

If the group matching is not successful, the corresponding member of the indices property array is undefined, and the corresponding member of the indices.groups property object is also undefined.

const text = "zabbcdef";
const re = /ab+(?<Z>ce)?/;
const result = re.exec(text);

result.indices[1]; // undefined
result.indices.groups["Z"]; // undefined

In the above example, since the group matching is unsuccessful, the group matching members corresponding to the indices property array and the indices.groups property object are both undefined.

String.prototype.matchAll()

If a regular expression has multiple matches in a string, now generally use the g modifier or the y modifier, and take them out one by one in the loop.

var regex = /t(e)(st(\d?))/g;
var string = "test1test2test3";

var matches = [];
var match;
while ((match = regex.exec(string))) {
  matches.push(match);
}

matches;
// [
// ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"],
// ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"],
// ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]
//]

In the above code, the while loop takes out each round of regular matching, a total of three rounds.

ES2020 Added the String.prototype.matchAll() method, which can retrieve all matches at once. However, it returns an Iterator, not an array.

const string = "test1test2test3";
const regex = /t(e)(st(\d?))/g;

for (const match of string.matchAll(regex)) {
  console.log(match);
}
// ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"]
// ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"]
// ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]

In the above code, since string.matchAll(regex) returns a traverser, it can be retrieved using a for...of loop. Compared with the return array, the advantage of the return traverser is that if the matching result is a large array, the traverser saves resources.

It is very simple to convert a iterator to an array, just use the ... operator and the Array.from() method.

// Method one to convert to an array
[...string.matchAll(regex)];

// The second method of converting to an array
Array.from(string.matchAll(regex));