String

Overview

Definition

A string is zero or more characters arranged together, enclosed in single or double quotation marks.

"abc";
"abc";

Double quotation marks can be used inside single quotation marks. Single quotation marks can be used inside double quotation marks.

'key = "value"';
"It's a long journey";

Both of the above are legal strings.

If you want to use single quotation marks inside a single quotation mark, you must add a backslash before the inner single quotation mark to escape it. Double quotation marks are used inside double-quoted strings, and the same is true.

'Did she say \'Hello\'?';
// "Did she say'Hello'?"

"Did she say \"Hello\"?";
// "Did she say "Hello"?"

Because HTML language attribute values ​​use double quotation marks, many projects agree that JavaScript language strings only use single quotation marks. This tutorial abides by this convention. Of course, it is perfectly fine to use only double quotation marks. The important thing is to stick to a style, don’t use single quotes for a string and double quotes for a while.

By default, the string can only be written in one line, and if it is divided into multiple lines, an error will be reported.

'a
b
c'
// SyntaxError: Unexpected token ILLEGAL

The above code divides a string into three lines, and JavaScript will report an error.

If the long string must be divided into multiple lines, you can use a backslash at the end of each line.

var longString =
  "Long \
long \
long \
string";

longString;
// "Long long long string"

The above code indicates that after adding a backslash, the string originally written on one line can be divided into multiple lines for writing. However, the output is still a single line, and the effect is exactly the same as writing on the same line. Note that the backslash must be followed by a newline character, and no other characters (such as spaces), otherwise an error will be reported.

The concatenation operator (+) can concatenate multiple single-line strings, split the long string into multiple lines for writing, and output a single line.

var longString = "Long" + "long " + "long " + "string";

If you want to output a multi-line string, there is a workaround using multi-line comments.

(function () {
  /*
line 1
line 2
line 3
*/
}
  .toString()
  .split("\n")
  .slice(1, -1)
  .join("\n"));
// "line 1
// line 2
// line 3"

In the above example, the output string is multiple lines.

Escaping

The backslash () has a special meaning in a string and is used to represent some special characters, so it is also called an escape character.

The special characters that need to be escaped with a backslash are mainly the following.

-\0: null (\u0000) -\b: Back key (\u0008) -\f: form feed (\u000C) -\n: Newline character (\u000A) -\r: Enter key (\u000D) -\t: Tab character (\u0009) -\v: vertical tab character (\u000B) -\': single quote (\u0027) -\": Double quotes (\u0022) -\\: backslash (\u005C)

Preceding these characters with a backslash indicates a special meaning.

console.log("1\n2");
// 1
// 2

In the above code, \n means line break, which is divided into two lines when outputting.

There are three special uses of backslashes.

(1) \HHH

The backslash is followed by three octal digits (000 to 377), representing one character. HHH corresponds to the Unicode code point of the character, for example, \251 represents the copyright symbol. Obviously, this method can only output 256 characters.

(2) \xHH

\x is followed by two hexadecimal numbers (00 to FF), representing one character. HH corresponds to the Unicode code point of the character, for example, \xA9 represents the copyright symbol. This method can only output 256 characters.

(3) \uXXXX

\u is followed by four hexadecimal numbers (0000 to FFFF), representing one character. XXXX corresponds to the Unicode code point of the character, for example \u00A9 represents the copyright symbol.

The following are examples of the special writing of these three characters.

"\251"; // "©"
"\xA9"; // "©"
"\u00A9"; // "©"

"\172" === "z"; // true
"\x7A" === "z"; // true
"\u007A" === "z"; // true

If you use a backslash in front of a non-special character, the backslash will be omitted.

"\a";
// "a"

In the above code, a is a normal character, adding a backslash before it has no special meaning, and the backslash will be omitted automatically.

If a backslash needs to be included in the normal content of the string, a backslash needs to be added before the backslash to escape itself.

"Prev \\ Next";
// "Prev \ Next"

Strings and arrays

Strings can be regarded as character arrays, so the square bracket operator of the array can be used to return the characters at a certain position (position numbering starts from 0).

var s = "hello";
s[0]; // "h"
s[1]; // "e"
s[4]; // "o"

// Use the square bracket operator directly on the string
"hello"[1]; // "e"

If the number in the square brackets exceeds the length of the string, or if the number in the square brackets is not a number at all, then undefined is returned.

"abc"[3]; // undefined
"abc"[-1]; // undefined
"abc"["x"]; // undefined

However, the similarity between strings and arrays is nothing more. In fact, you cannot change a single character in a string.

var s = "hello";

delete s[0];
s; // "hello"

s[1] = "a";
s; // "hello"

s[5] = "!";
s; // "hello"

The above code indicates that a single character in the string cannot be changed, added or deleted, and these operations will silently fail.

length property

The length property returns the length of the string, and this property cannot be changed.

var s = "hello";
s.length; // 5

s.length = 3;
s.length; // 5

s.length = 7;
s.length; // 5

The above code means that the length property of the string cannot be changed, but no error will be reported.

character set

JavaScript uses the Unicode character set. Inside the JavaScript engine, all characters are represented by Unicode.

JavaScript not only stores characters in Unicode, but also allows direct use of Unicode code points to represent characters in the program, that is, the characters are written in the form of \uxxxx, where xxxx represents the Unicode code point of the character. For example, \u00A9 represents the copyright symbol.

var s = "\u00A9";
s; // "©"

When parsing the code, JavaScript will automatically recognize whether a character is represented in literal form or in Unicode form. When outputting to the user, all characters will be converted into literal form.

var foo = "abc";
foo; // "abc"

In the above code, the variable name foo in the first line is expressed in Unicode, and the second line is expressed in literal form. JavaScript will be automatically recognized.

We also need to know that each character is stored in a 16-bit (2 bytes) UTF-16 format inside JavaScript. In other words, the unit character length of JavaScript is fixed at 16 bits, which is 2 bytes.

However, UTF-16 has two lengths: for characters with code points between U+0000 and U+FFFF, the length is 16 bits (ie 2 bytes); for code points at U+10000 The characters from to U+10FFFF are 32 bits in length (ie 4 bytes), and the first two bytes are between 0xD800 and 0xDBFF, and the last two bytes are 0xDC00 Between 0xDFFF. For example, the character corresponding to the code point U+1D306 is 𝌆, which is written as UTF-16 is 0xD834 0xDF06`.

JavaScript's support for UTF-16 is incomplete. Due to historical reasons, only two-byte characters are supported, and four-byte characters are not supported. This is because when the first version of JavaScript was released, the Unicode code point was only compiled to U+FFFF, so two bytes are enough to represent it. Later, Unicode included more and more characters, and four-byte encoding appeared. However, the JavaScript standard has been finalized at this time, and the character length is uniformly limited to two bytes, which makes it impossible to recognize four-byte characters. The four-byte character 𝌆 in the previous section will be correctly recognized by the browser as one character, but JavaScript cannot recognize it and will think it is two characters.

"𝌆".length; // 2

In the above code, JavaScript considers the length of 𝌆 to be 2, not 1.

To sum up, for characters with code points between U+10000 and U+10FFFF, JavaScript always considers them to be two characters (the length attribute is 2). So when processing, you must take this into account, that is, the length of the string returned by JavaScript may be incorrect.

Base64 transcoding

Sometimes, the text contains some non-printable symbols, such as symbols with ASCII codes 0 to 31 that cannot be printed. In this case, you can use Base64 encoding to convert them into printable characters. Another scenario is that sometimes binary data needs to be transmitted in text format, then Base64 encoding can also be used.

The so-called Base64 is an encoding method that can convert any value into a printable character consisting of 64 characters of 0-9, A-Z, az, + and /. The main purpose of using it is not to encrypt, but to avoid special characters and simplify the processing of the program.

JavaScript natively provides two methods related to Base64.

-btoa(): Convert any value to Base64 encoding -atob(): Base64 encoding is converted to the original value

var string = "Hello World!";
btoa(string); // "SGVsbG8gV29ybGQh"
atob("SGVsbG8gV29ybGQh"); // "Hello World!"

Note that these two methods are not suitable for non-ASCII characters, and an error will be reported.

btoa("hello"); // report an error

To convert non-ASCII characters to Base64 encoding, you must insert a transcoding link in between, and then use these two methods.

function b64Encode(str) {
  return btoa(encodeURIComponent(str));
}

function b64Decode(str) {
  return decodeURIComponent(atob(str));
}

b64Encode("Hello"); // "JUU0JUJEJUEwJUU1JUE1JUJE"
b64Decode("JUU0JUJEJUEwJUU1JUE1JUJE"); // "Hello"