Character encodings in HTML
HTML (Hypertext Markup Language) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.
There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "
charset" in the Hypertext Transfer Protocol (HTTP)
Content-Type header, which would typically look like this:
This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module
For HTML it is possible to include this information inside the
head element near the top of the document:
As the character encoding cannot be known until this[clarification needed] declaration is parsed, there can be a problem knowing which encoding is used for the declaration itself. The main principle is that the declaration shall be encoded in pure ASCII, and therefore (if the declaration is inside the file) the encoding needs to be an ASCII extension. In order to allow encodings not backwards compatible with ASCII, browsers must be able to parse declarations in such encodings. Examples of such encodings are UTF-16BE and UTF-16LE.
As of HTML5 the recommended charset is UTF-8. An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
For ASCII-compatible character encodings the consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
The WHATWG Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competing W3C HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings. The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use UTF-8 exclusively.
Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:
The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:
The following encodings are listed as explicit examples of forbidden encodings:
The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to the replacement character (�), refusing to process it at all. This is intended to prevent attacks (e.g. cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content. Although the same security concern applies to ISO-2022-JP and UTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content. The following encodings receive this treatment:
In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.
where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.
Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.
For codes from 0 to 127, the original 7-bit ASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using character entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
Character entity references can also have the format
&name; where name is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as
λ in an HTML document. The character entity references
& are predefined in HTML and SGML, because
& are already used to delimit markup. This notably did not include XML's
' (') entity prior to HTML5. For a list of all named HTML character entity references along with the versions in which they were introduced, see List of XML and HTML character entity references.
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native Unicode encoding like UTF-8 is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as cross-site scripting. If HTML attributes are left unquoted, certain characters, most importantly whitespace, such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
All other character entity references have to be defined before they can be used. For example, use of
é (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the
x in hexadecimal numeric references be in lowercase: for example
ਛ rather than
ਛ. XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.