New method of string

This chapter introduces the new methods of string objects.

String.fromCodePoint()

ES5 provides the String.fromCharCode() method, which is used to return the corresponding character from the Unicode code point, but this method cannot recognize the character whose code point is greater than 0xFFFF.

String.fromCharCode(0x20bb7);
// "ஷ"

In the above code, String.fromCharCode() cannot identify code points greater than 0xFFFF, so 0x20BB7 overflows, the highest bit 2 is discarded, and finally the code point corresponding to U+0BB7 is returned. Character instead of the character corresponding to the code point U+20BB7.

ES6 provides the String.fromCodePoint() method, which can recognize characters larger than 0xFFFF, which makes up for the shortcomings of the String.fromCharCode() method. In terms of function, it is just the opposite of the following codePointAt() method.

String.fromCodePoint(0x20bb7);
// "𠮷"
String.fromCodePoint(0x78, 0x1f680, 0x79) === "x\uD83D\uDE80y";
// true

In the above code, if the String.fromCodePoint method has multiple parameters, they will be combined into one string and returned.

Note that the fromCodePoint method is defined on the String object, and the codePointAt method is defined on the instance object of the string.

String.raw()

ES6 also provides a raw() method for native String objects. This method returns a string in which all slashes are escaped (that is, a slash is added before the slash), which is often used in the processing of template strings.

String.raw`Hi\n${2 + 3}!`;
// Actually returns "Hi\\n5!", displaying the escaped result "Hi\n5!"

String.raw`Hi\u000A!`;
// Actually returns "Hi\\u000A!", displaying the escaped result "Hi\u000A!"

If the slash of the original string has been escaped, then String.raw() will be escaped again.

String.raw`Hi\\n`;
// return "Hi\\\\n"

String.raw`Hi\\n` === "Hi\\\\n"; // true

The String.raw() method can be used as the basic method for processing template strings. It replaces all variables and escapes the slashes to facilitate the next step to use as a string.

String.raw() is essentially a normal function, just a label function dedicated to template strings. If written in the form of a normal function, its first parameter should be an object with a raw attribute, and the value of the raw attribute should be an array, corresponding to the parsed value of the template string.

// `foo${1 + 2}bar`
// Equivalent to
String.raw({ raw: ["foo", "bar"] }, 1 + 2); // "foo3bar"

In the above code, the first parameter of the String.raw() method is an object, and its raw property is equivalent to the array obtained after parsing the original template string.

As a function, the code implementation of String.raw() is basically as follows.

String.raw = function (strings, ...values) {
  let output = "";
  let index;
  for (index = 0; index < values.length; index++) {
    output += strings.raw[index] + values[index];
  }

  output += strings.raw[index];
  return output;
};

Example method: codePointAt()

In JavaScript, characters are stored in UTF-16 format, and each character is fixed at 2 bytes. For those characters that require 4 bytes of storage (characters with a Unicode code point greater than 0xFFFF), JavaScript will consider them to be two characters.

var s = "𠮷";

s.length; // 2
s.charAt(0); //''
s.charAt(1); //''
s.charCodeAt(0); // 55362
s.charCodeAt(1); // 57271

In the above code, the code point of the Chinese character "𠮷" (note that this character is not "auspicious" or "吉") is 0x20BB7, and the UTF-16 encoding is 0xD842 0xDFB7 (decimal: 55362 57271), which requires 4bytes are stored. For this type of character of 4 bytes, JavaScript cannot handle it correctly, the string length will be misjudged as 2, and the charAt() method cannot read the entire character, the charCodeAt() method can only be used separately Returns the value of the first two bytes and the last two bytes.

ES6 provides the codePointAt() method, which can correctly process the characters stored in 4 bytes and return the code point of a character.

let s = "𠮷a";

s.codePointAt(0); // 134071
s.codePointAt(1); // 57271

s.codePointAt(2); // 97

The parameter of the codePointAt() method is the position of the character in the string (starting from 0). In the above code, JavaScript treats "𠮷 a" as three characters, and the codePointAt method correctly recognizes "𠮷" on the first character, and returns its decimal code point 134071 (that is, the hexadecimal 20BB7 ). On the second character (the last two bytes of "𠮷") and the third character "a", the result of the codePointAt() method is the same as the charCodeAt() method.

In short, the codePointAt() method will correctly return the code point of the 32-bit UTF-16 character. For those regular characters stored in two bytes, its return result is the same as the charCodeAt() method.

The codePointAt() method returns the decimal value of the code point. If you want the hexadecimal value, you can use the toString() method to convert it.

let s = "𠮷a";

s.codePointAt(0).toString(16); // "20bb7"
s.codePointAt(2).toString(16); // "61"

You may have noticed that the parameters of the codePointAt() method are still incorrect. For example, in the above code, the serial number of the character a at the correct position of the string s should be 1, but 2 must be passed to the codePointAt() method. One way to solve this problem is to use the for...of loop, because it will correctly recognize 32-bit UTF-16 characters.

let s = "𠮷a";
for (let ch of s) {
  console.log(ch.codePointAt(0).toString(16));
}
// 20bb7
// 61

Another method is also possible, using the spread operator (...) to perform the expansion operation.

let arr = [..."𠮷a"]; // arr.length === 2
arr.forEach((ch) => console.log(ch.codePointAt(0).toString(16)));
// 20bb7
// 61

The codePointAt() method is the easiest way to test whether a character consists of two bytes or four bytes.

function is32Bit(c) {
  return c.codePointAt(0) > 0xffff;
}

is32Bit("𠮷"); // true
is32Bit("a"); // false

Example method: normalize()

Many European languages ​​have intonation marks and accent marks. To represent them, Unicode provides two methods. One is to directly provide accented characters, such as Ǒ (\u01D1). The other is to provide a combining character (combining character), that is, the combination of the original character and the accent. Two characters are combined into one character, such as O (\u004F) and ˇ (\u030C) to synthesize Ǒ (\ u004F\u030C).

These two representation methods are visually and semantically equivalent, but JavaScript cannot recognize them.

"\u01D1" === "\u004F\u030C"; //false

"\u01D1".length; // 1
"\u004F\u030C".length; // 2

The above code indicates that JavaScript treats the composite character as two characters, which causes the two representation methods to be unequal.

ES6 provides the normalize() method of string instances, which is used to unify the different representation methods of characters into the same form. This is called Unicode normalization.

"\u01D1".normalize() === "\u004F\u030C".normalize();
// true

The normalize method can accept a parameter to specify the way of normalize. The four optional values ​​of the parameter are as follows.

-NFC, the default parameter, means "Normalization Form Canonical Composition", which returns a composite character of multiple simple characters. The so-called "standard equivalence" refers to visual and semantic equivalence. -NFD, which stands for "Normalization Form Canonical Decomposition", that is, under the premise of standard equivalence, returns multiple simple characters decomposed into composite characters. -NFKC, which means "Normalization Form Compatibility Composition" (Normalization Form Compatibility Composition), returns the composite character. The so-called "compatible equivalence" refers to semantic equivalence but not visual equivalence, such as "囍" and "xixi". (This is just for example, the normalize method cannot recognize Chinese.) -NFKD, which means "Normalization Form Compatibility Decomposition", that is, on the premise of compatibility and equivalence, return multiple simple characters decomposed into composite characters.

"\u004F\u030C".normalize("NFC").length; // 1
"\u004F\u030C".normalize("NFD").length; // 2

The above code indicates that the NFC parameter returns the composite form of the character, and the NFD parameter returns the decomposed form of the character.

However, the normalize method currently does not recognize the composition of three or more characters. In this case, you can still only use regular expressions and judge by the Unicode number interval.

Example methods: includes(), startsWith(), endsWith()

Traditionally, JavaScript has only the indexOf method, which can be used to determine whether a string is contained in another string. ES6 provides three new methods.

-includes(): returns a boolean value, indicating whether the parameter string is found. -startsWith(): returns a boolean value, indicating whether the parameter string is at the head of the original string. -endsWith(): returns a boolean value, indicating whether the parameter string is at the end of the original string.

let s = "Hello world!";

s.startsWith("Hello"); // true
s.endsWith("!"); // true
s.includes("o"); // true

All three methods support the second parameter, which indicates where to start the search.

let s = "Hello world!";

s.startsWith("world", 6); // true
s.endsWith("Hello", 5); // true
s.includes("Hello", 6); // false

The above code indicates that when the second parameter n is used, the behavior of endsWith is different from the other two methods. It targets the first n characters, while the other two methods target from the nth position until the end of the string.

Example method: repeat()

The repeat method returns a new string, which means repeating the original string n times.

"x".repeat(3); // "xxx"
"hello".repeat(2); // "hellohello"
"na".repeat(0); // ""

If the parameter is a decimal, it will be rounded.

"na".repeat(2.9); // "nana"

If the parameter of repeat is negative or Infinity, an error will be reported.

"na".repeat(Infinity);
// RangeError
"na".repeat(-1);
// RangeError

However, if the parameter is a decimal between 0 and -1, it is equivalent to 0, because the rounding operation will be performed first. The decimal between 0 and -1 is equal to -0 after rounding, and repeat is regarded as 0.

"na".repeat(-0.9); // ""

The parameter NaN is equivalent to 0.

"na".repeat(NaN); // ""

If the parameter of repeat is a string, it will be converted to a number first.

"na".repeat("na"); // ""
"na".repeat("3"); // "nanana"

Example methods: padStart(), padEnd()

ES2017 introduced the function of string completion length. If a string is not enough for the specified length, it will be completed at the head or tail. padStart() is used for head completion, and padEnd() is used for tail completion.

"x".padStart(5, "ab"); //'ababx'
"x".padStart(4, "ab"); //'abax'

"x".padEnd(5, "ab"); //'xabab'
"x".padEnd(4, "ab"); //'xaba'

In the above code, padStart() and padEnd() accept two parameters in total. The first parameter is the maximum length of the string completion to take effect, and the second parameter is the string to be completed.

If the length of the original string is equal to or greater than the maximum length, the string completion will not take effect and the original string will be returned.

"xxx".padStart(2, "ab"); //'xxx'
"xxx".padEnd(2, "ab"); //'xxx'

If the sum of the length of the character string used for completion and the original character string exceeds the maximum length, the completed character string that exceeds the number of digits will be truncated.

"abc".padStart(10, "0123456789");
// '0123456abc'

If the second parameter is omitted, the default length will be filled with spaces.

"x".padStart(4); // 'x'
"x".padEnd(4); //'x '

The common use of padStart() is to specify the number of digits for numerical completion. The following code generates a 10-digit numeric string.

"1".padStart(10, "0"); // "0000000001"
"12".padStart(10, "0"); // "0000000012"
"123456".padStart(10, "0"); // "0000123456"

Another use is to prompt the string format.

"12".padStart(10, "YYYY-MM-DD"); // "YYYY-MM-12"
"09-12".padStart(10, "YYYY-MM-DD"); // "YYYY-09-12"

Example methods: trimStart(), trimEnd()

ES2019 Added the two methods trimStart() and trimEnd() to string instances. Their behavior is consistent with trim(), trimStart() eliminates spaces at the beginning of the string, and trimEnd() eliminates trailing spaces. They return new strings, and do not modify the original strings.

const s = "abc";

s.trim(); // "abc"
s.trimStart(); // "abc "
s.trimEnd(); // "abc"

In the above code, trimStart() only eliminates the spaces at the head and keeps the spaces at the end. trimEnd() is similar behavior.

In addition to the space bar, these two methods are also effective for invisible white space symbols such as the tab key at the head (or tail) of the string, line breaks, etc.

The browser also deploys two additional methods, trimLeft() is an alias of trimStart(), and trimRight() is an alias of trimEnd().

Example method: matchAll()

The matchAll() method returns all matches of a regular expression in the current string. For details, see the chapter of "Regular Extensions".

Example method: replaceAll()

Historically, the string instance method replace() can only replace the first match.

"aabbcc".replace("b", "_");
//'aa_bcc'

In the above example, replace() only replaces the first b with an underscore.

If you want to replace all matches, you have to use the g modifier of regular expressions.

"aabbcc".replace(/b/g, "_");
//'aa__cc'

Regular expressions are not so convenient and intuitive after all. ES2021 introduces the replaceAll() method, which can replace all matches at once.

"aabbcc".replaceAll("b", "_");
//'aa__cc'

Its usage is the same as replace(), it returns a new string without changing the original string.

String.prototype.replaceAll(searchValue, replacement);

In the above code, searchValue is the search mode, which can be a string or a global regular expression (with the g modifier).

If searchValue is a regular expression without the g modifier, replaceAll() will report an error. This is different from replace().

// No error
"aabbcc".replace(/b/, "_");

// report an error
"aabbcc".replaceAll(/b/, "_");

In the above example, /b/ without the g modifier will cause replaceAll() to report an error.

The second parameter of replaceAll(), replacement, is a string representing the text to be replaced. Some special strings can be used.

-$&: The matched substring. -$`: Match the text before the result. -$': Match the text after the result. -$n: The content of the nth group that matches successfully, n is a natural number starting from 1. The premise for this parameter to take effect is that the first parameter must be a regular expression. -$$: Refers to the dollar sign $.

Here are some examples.

// $& represents the matched string, which is `b` itself
// So the returned result is consistent with the original string
"abbc".replaceAll("b", "$&");
//'abbc'

// $` represents the string before the matching result
// For the first `b`, $` refers to `a`
// For the second `b`, $` refers to `ab`
"abbc".replaceAll("b", "$`");
//'aaabc'

// $'represents the string after the matching result
// For the first `b`, $'refers to `bc`
// For the second `b`, $'refers to `c`
"abbc".replaceAll("b", `$'`);
//'abccc'

// $1 represents the first group match of the regular expression, referring to `ab`
// $2 represents the second group match of the regular expression, referring to `bc`
"abbc".replaceAll(/(ab)(bc)/g, "$2$1");
//'bcab'

// $$ refers to $
"abc".replaceAll("b", "$$");
//'a$c'

In addition to a string, the second parameter replacement of replaceAll() can also be a function. The return value of the function will replace the text matched by the first parameter searchValue.

"aabbcc".replaceAll("b", () => "_");
//'aa__cc'

In the above example, the second parameter of replaceAll() is a function whose return value will replace all matches of b.

This replacement function can accept multiple parameters. The first parameter is the captured match content, and the second parameter is captured as a group match (there are as many group matches as there are as many corresponding parameters). In addition, two parameters can be added at the end, the second to last parameter is the position of the captured content in the entire string, and the last parameter is the original string.

const str = "123abc456";
const regex = /(\d+)([az]+)(\d+)/g;

function replacer(match, p1, p2, p3, offset, string) {
  return [p1, p2, p3].join("-");
}

str.replaceAll(regex, replacer);
// 123-abc-456

In the above example, the regular expression has three group matches, so the first parameter match of the replacer() function is the captured match content (ie the string 123abc456), and the following three parameters p1, p2 and p3 are three group matches in turn.