HTML Character Encoding


Web pages contain a large amount of text, and browsers must know how to encode this text in order to restore it.

Normally, when a server sends an HTML web page file to a browser, it declares the encoding method of the web page through the HTTP header message.

Content-Type: text/html; charset=UTF-8

In the above code, the Content-Type field of the HTTP header first declares that the data type sent by the server is text/html (i.e. HTML web page), and then declares that the text encoding of the web page is UTF-8.

The encoding of the web page is also declared again internally with the ``` tag.

<meta charset="UTF-8" />

Numeric representation of characters

Web pages can be encoded in different languages, but the most common encoding is UTF-8. UTF-8 encoding is an expression of the Unicode character set. This character set is designed to include all characters in the world, and currently includes more than 100,000 characters.

Each character has a Unicode number, called a code point. If you know the code point, you can find out what character it is. For example, the code point of the English letter a is 97 in decimal (61 in hexadecimal), and the code point of the Chinese character "中" is 20013 in decimal (4e2d in hexadecimal).

Not every Unicode character can be displayed directly in the HTML language for the following reasons.

(1) Not every Unicode character can be printed, and some do not have a printable form. For example, the code point of a newline character is 10 in decimal (A in hexadecimal), which does not have a literal form.

(2) The less-than sign (<) and greater-than sign (>) are used to define HTML tags, and other occasions where these two symbols are needed must prevent them from being interpreted as tags.

(3) Because there are so many Unicode characters, it is impossible to find an input method that allows direct input of all these characters. In other words, there is no one keyboard that has a way to enter all symbols.

(4) Web pages do not allow a mix of encodings, so it is difficult to use UTF-8 encoding while trying to insert characters from other encodings.

HTML solves these problems by allowing Unicode code points to be used to represent characters, and the browser will automatically convert the code points to the corresponding characters.

The code point representation of a character is &#N; (decimal, N for code point) or &#xN; (hexadecimal, N for code point), for example, the character a can be written as &#97; (decimal) or &#x61; (hexadecimal), and the character middle can be written as &#20013; (decimal) or &#x4e2d; ( hexadecimal), and the browser will convert them automatically.

<! -- is equivalent to -->
<! -- is equivalent to -->

In the above code, characters can be represented directly, or using decimal code points or hexadecimal code points.

Note that HTML tags themselves cannot be represented using code points, otherwise the browser will assume that this is the text content to be displayed, not the tag. For example, if <p> is written as <&#112;> or &#60;&#112;&#62;, the browser will no longer consider it a tag, but will display it as <p> as text content.

Entity representation of characters

The inconvenience of the numeric representation is that you have to know the code point of each character, which is hard to remember. To allow for quick input, HTML provides easy-to-remember names for some special characters, allowing them to be represented by their names, called entity representations.

The entity is written as &name;, where name is the name of the character. Here are some of these special characters, and their corresponding entities.

  • <: &lt;
  • >: &gt;
  • ":&quot;
  • ':&apos;
  • &:&amp;
  • ©:&copy;
  • #:&num;
  • §:&sect;
  • ¥:&yen;
  • $: &dollar;
  • £:&pound;
  • ¢: &cent;
  • %: &percnt;
  • *: $ast;
  • @:&commat;
  • ^:&Hat;
  • ±: &plusmn;
  • Spaces: &nbsp;

Note that the last special character above is a space, which also has a corresponding entity representation.

The numeric representation of characters and entity representation can represent characters that cannot be entered under normal circumstances, escaping the restrictions of browsers, so it is called "escape" in English, which translates to "escape of characters" in Chinese.