[html] Which characters need to be escaped in HTML?

Are they the same as XML, perhaps plus the space one ( )?

I've found some huge lists of HTML escape characters but I don't think they must be escaped. I want to know what needs to be escaped.

The answer is


It depends upon the context. Some possible contexts in HTML:

  • document body
  • inside common attributes
  • inside script tags
  • inside style tags
  • several more!

See OWASP's Cross Site Scripting Prevention Cheat Sheet, especially the "Why Can't I Just HTML Entity Encode Untrusted Data?" and "XSS Prevention Rules" sections. However, it's best to read the whole document.


The exact answer depends on the context. In general, these characters must not be present (HTML 5.2 §3.2.4.2.5):

Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context.

For elements in HTML, the constraints of the Text content model also depends on the kind of element. For instance, an "<" inside a textarea element does not need to be escaped in HTML because textarea is an escapable raw text element.

These restrictions are scattered across the specification. E.g., attribute values (§8.1.2.3) must not contain an ambiguous ampersand and be either (i) empty, (ii) within single quotes (and thus must not contain U+0027 APOSTROPHE character '), (iii) within double quotes (must not contain U+0022 QUOTATION MARK character "), or (iv) unquoted — with the following restrictions:

... must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.


Basically, there are three main characters which should be always escaped in your HTML and XML files, so they don't interact with the rest of the markups, so as you probably expect, two of them gonna be the syntax wrappers, which are <>, they are listed as below:

 1)  &lt; (<)
    
 2)  &gt; (>)
    
 3)  &amp; (&)

Also we may use double-quote (") as " and the single quote (') as &apos

Avoid putting dynamic content in <script> and <style>.These rules are not for applied for them. For example, if you have to include JSON in a , replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialisation.)

HTML Escape Characters: Complete List: http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

So you need to escape <, or & when followed by anything that could begin a character reference. Also The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. But if you don’t want to terminate the attribute value there, escape the quotation mark.

Changing to UTF-8 means re-saving your file:

Using the character encoding UTF-8 for your page means that you can avoid the need for most escapes and just work with characters. Note, however, that to change the encoding of your document, it is not enough to just change the encoding declaration at the top of the page or on the server. You need to re-save your document in that encoding. For help understanding how to do that with your application read Setting encoding in web authoring applications.

Invisible or ambiguous characters:

A particularly useful role for escapes is to represent characters that are invisible or ambiguous in presentation.

One example would be Unicode character U+200F RIGHT-TO-LEFT MARK. This character can be used to clarify directionality in bidirectional text (eg. when using the Arabic or Hebrew scripts). It has no graphic form, however, so it is difficult to see where these characters are in the text, and if they are lost or forgotten they could create unexpected results during later editing. Using ? (or its numeric character reference equivalent ?) instead makes it very easy to spot these characters.

An example of an ambiguous character is U+00A0 NO-BREAK SPACE. This type of space prevents line breaking, but it looks just like any other space when used as a character. Using   makes it quite clear where such spaces appear in the text.


Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to html-entities

How to create string with multiple spaces in JavaScript Uses for the '&quot;' entity in HTML How to Code Double Quotes via HTML Codes Is there Unicode glyph Symbol to represent "Search" What's the right way to decode a string that has special HTML entities in it? Which characters need to be escaped in HTML? HTML entity for the middle dot HTML character codes for this ? or this ? What do &lt; and &gt; stand for? Transmitting newline character "\n"

Examples related to html-encode

Which characters need to be escaped in HTML? How to encode the plus (+) symbol in a URL Display encoded html with razor Transmitting newline character "\n" Html encode in PHP HtmlSpecialChars equivalent in Javascript? HtmlEncode from Class Library How to remove html special chars? How do I perform HTML decoding/encoding using Python/Django?

Examples related to html-escape-characters

How do I replicate a \t tab space in HTML? Which characters need to be escaped in HTML? What do &lt; and &gt; stand for? How do I prevent people from doing XSS in Spring MVC? HTML-encoding lost when attribute read from input field