URL Introduction

Overview

URL is the acronym for "Uniform Resource Locator", which is translated into "URL" in Chinese, which means the Internet address of various resources. Below is a typical URL.

https://www.example.com/path/index.html

The so-called resources can be simply understood as various files that can be accessed through the Internet, such as web pages, images, audio, video, JavaScript scripts, and so on. Only by knowing their URL can they be obtained on the Internet.

As long as a resource can be accessed via the Internet, it must have a corresponding URL. One URL corresponds to one resource, but the same resource may correspond to multiple URLs.

URL is the foundation of the Internet. The Internet is "interconnected" because web pages can contain other URLs through "links". As long as the user clicks, he can jump from one URL to another and go to different websites.

Components of the URL

The URL consists of multiple parts. The following is a more complex URL, the actual URL usually does not have so many parts.

https://www.example.com:80/path/to/myfile.html?key1=value1&key2=value2#anchor

Let's take a look at the various parts of this URL.

Agreement

The protocol is the method for the browser to request server resources. The above example is the part of https://, which means that the HTTPS protocol is used.

The Internet supports multiple protocols. You must specify which protocol the URL uses. The default is HTTP protocol. In other words, if you omit the protocol and enter www.example.com directly in the browser address bar, the browser will access http://www.example.com by default. HTTPS is an encrypted version of HTTP. For security reasons, more and more websites use this protocol.

The protocol names of HTTP and HTTPS are followed by a colon and two slashes (://). This is not necessarily the case for other protocols. The mail address protocol mailto: has only a colon after the protocol name, such as mailto:foo@example.com.

Host

The host is the name of the website or server where the resource is located, also known as the domain name. The host of the above example is www.example.com.

Some hosts do not have a domain name but only an IP address, such as 192.168.2.15. This situation often occurs in local area networks.

Port

The same domain name may contain multiple websites at the same time, and they are distinguished by ports. "Port" is an integer, which can be simply understood as the visitor tells the server which website they want to visit. The default port is 80. If this parameter is omitted, the server will return a port 80 website.

The port immediately follows the domain name, separated by a colon, such as www.example.com:80.

Path

Path (path) is the location of the resource on the website. For example, the path /path/index.html points to the web page file index.html under the /path subdirectory of the website.

In the early days of the Internet, paths were physical locations that actually existed. Now since the server can simulate these locations, the path is just a virtual location.

The path may only include the directory, not the file name, such as /foo/, and even the trailing slash can be omitted. At this time, the server usually jumps to the index.html file in the directory by default (that is, it is equivalent to requesting /foo/index.html), but there may be other processing (such as listing all the files in the directory). File), it depends on the server settings. Generally speaking, when visiting the URL of www.example.com, it is likely that the web file www.example.com/index.html will be returned.

Query parameters

Query parameters (parameter) are additional information provided to the server. The position of the parameter is after the path, separated by ?, the above example is ?key1=value1&key2=value2.

There can be one or more groups of query parameters. Each set of parameters is in the form of a key-value pair, with a key name (key) and a key value (value) at the same time, and they are connected by an equal sign (=). For example, key1=value is a key-value pair, key1 is the key name, and value1 is the key value.

Use & to connect multiple sets of parameters, such as key1=value1&key2=value2.

Anchor

The anchor is the anchor point inside the webpage. Use # plus the anchor name and put it at the end of the URL, such as #anchor. After the browser loads the page, it will automatically scroll to the anchor point.

The anchor name is named by the id attribute of the web page element. For details, see the chapter "Element Properties".

URL characters

Only the following characters can be used in the various components of the URL.

-26 English letters (both uppercase and lowercase) -10 Arabic numerals -Hyphen (-) -Period (.) -Underscore (_)

In addition, there are 18 characters that belong to the reserved characters of the URL, which can only appear in the given position. For example, the beginning of the query parameter is a question mark (?), that is, the question mark can only appear at the beginning of the query parameter. It is illegal to appear in other positions and will cause URL parsing errors. If you want to use these reserved characters in other parts of the URL, you must use their escaped form.

The way to escape URL characters is to add a percent sign (%) in front of the hexadecimal ASCII code of these characters. The following are these 18 characters and their escaped forms.

-!: %21 -#: %23 -$: %24 -&: %26 -': %27 -(: %28 -): %29 -*: %2A -+: %2B -,: %2C -/: %2F -:: %3A -;: %3B -=: %3D -?: %3F -@: %40 -[: %5B -]: %5D

For example, if the URL of a web page is foo?bar.html, that is, the file contains a question mark, then it needs to be written as foo%3Fbar.html.

The legal characters of URL can also be escaped in this way, but it is not recommended. For example, the hexadecimal ASCII code of the letter a is 61, and the escaped form is %61. Therefore, www.apple.com can be written as www.%61pple.com, which is recognized by the browser.

It is worth noting that the escape form of spaces is %20. For those file names that contain spaces, this escaping is necessary.

Other characters that are neither legal nor reserved characters (such as Chinese characters), theoretically do not need to be manually escaped, and can be written directly in the URL, such as www.example.com/中国.html, the browser will They are automatically escaped and sent to the server. The escape method is to use the hexadecimal UTF-8 encoding of these characters. Every two digits are counted as a group, and then a percent sign (%) is added to the head of each group.

For example, the UTF-8 hexadecimal encoding of in Chinese characters is e4b8ad, every two characters are set, and the URL is escaped as %e4%b8%ad. In other words, wherever there are Chinese characters in the URL, it must be written as %e4%b8%ad. Therefore, to visit the URL of www.example.com/中国.html, it needs to be written as follows.

www.example.com/%e4%b8%ad%e5%9b%bd.html

In the above code, the escape form of is %e4%b8%ad, and is %e5%9b%bd.

Absolute URL and relative URL

There are two types of URLs: absolute URLs and relative URLs.

Absolute URL means that the location of a resource can be determined only by the URL itself. This means that the URL must contain the complete information of the resource, including the protocol, host, path, etc. The previous examples are absolute URLs.

Relative URL means that the URL does not contain all the information about the location of the resource. It must be combined with the location of the current web page to locate the resource. For example, the URL of the current webpage is https://www.example.com/path/index.html, there is a resource on the webpage, and the URL points to a.html, which is a relative URL. Because I only know a.html, and cannot locate resources. The browser assumes that a.html is in the same subdirectory as the current URL, thus obtaining the absolute URL https://www.example.com/path/a.html.

If a relative URL starts with a slash (/), it means the root directory of the website. Otherwise, you must use the current directory as a starting point to calculate the location of the resource. For example, the relative URL /foo/bar.html represents the subdirectory foo of the website root directory, and foo/bar.html represents the foo subdirectory of the current directory.

URLs can also use two special abbreviations to indicate specific locations.

-.: indicates the current directory, such as ./a.html (a.html file in the current directory) -..: indicates the parent directory, such as ../a.html (the file a.html in the parent directory)

These two abbreviations can be used in multiples, for example ../../ means the upper two-level directory.

Absolute URLs can also use these two abbreviations. For example, www.example.com/./index.html is equivalent to www.example.com/index.html, and then . is equivalent to the current directory of the root directory , The root directory itself.

<base>

The <base> tag specifies the calculation basis for all relative URLs inside the web page. The entire webpage can only have one <base> tag, and it can only be placed in <head>. It is a label used alone, there is no closed label, the following is an example.

<head>
  <base href="https://www.example.com/files/" target="_blank" />
</head>

The href attribute of the <base> tag gives the calculated base URL, and the target attribute gives instructions on how to open the link (see the chapter "Links"). The known calculation base is https://www.example.com/files/, then the relative URL foo.html can be converted into an absolute URL https://www.example.com/files/foo. html.

Note that the <base> tag must have at least one of the href attribute or the target attribute.

<base href="http://foo.com/app/" /> <base target="_blank" />

Once <base> is set, it is valid for the entire web page. If you want to change the behavior of a link, you can only use absolute links instead of relative links. Pay special attention to the anchor point. At this time, the anchor point is also calculated for <base>, not for the URL of the current webpage.