Do I really need to encode as amp

Question

I m using an   amp   symbol with HTML5 and UTF-8 in my site s  lt title gt   Google shows the ampersand fine on its SERPs  as do all the browsers in their titles   http   validator w3 org is giving me this       amp  did not start a character reference    amp  probably should have been escaped as  amp amp      Do I really need to do  amp amp    I m not fussed about my pages validating for the sake of validating  but I m curious to hear people s opinions on this and if it s important and why

User · Answer

Validation aside  the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page   Encoding  amp  as  amp amp  under all circumstances  for me  is an easier rule to live by  reducing the likelihood of errors and failures   Compare the following  which is easier  which is easier to bugger up   Methodology 1   Write some content which includes ampersand characters  Encode them all    Methodology 2   with a grain of salt  please        Write some content which includes a ampersand characters  On a case-by-case basis  look at each ampersand  Determine if    It is isolated  and as such unambiguously an ampersand  eg  volt  amp  amp nbsp   In that case don t bother encoding it  It is not isolated  but you feel it is nonetheless unambiguous  as the resulting entity does not exist and will never exist since the entity list could never evolve  eg amp amp volt nbsp   In that case don t bother encoding it  It is not isolated  and ambiguous  eg  volt amp amp nbsp   Encode it

User · Answer

Could you show us what your title actually is  When I submit   lt  DOCTYPE html gt   lt html gt   lt title gt Dolce  amp  Gabbana lt  title gt   lt body gt   lt p gt am i allowed loose  amp  mpersands  lt  p gt   lt  body gt   lt  html gt    to http   validator w3 org  - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the  amp s

User · Answer

It depends on the likelihood of a semicolon ending up near your  amp   causing it to display something quite different   For example  when dealing with input from users  say  if you include the user-provided subject of a forum post in your title tags   you never know where they might be putting random semicolons  and it might randomly display strange entities  So always escape in that situation   For your own static html  sure  you could skip it  but it s so trivial to include proper escaping  that there s no good reason to avoid it

User · Answer

I think this has turned into more of a question of  why follow the spec when browser s don t care   Here is my generalized answer   Standards are not a  quot present quot  thing  They are a  quot future quot  thing  If we  as developers  follow web standards  then browser vendors are more likely to correctly implement those standards  and we move closer to a completely interoperable web  where CSS hacks  feature detection  and browser detection are not necessary  Where we don t have to figure out why our layouts break in a particular browser  or how to work around that   Specifically  if HTML5 does not require using  amp amp  in your specific situation  and you re using an HTML5 doctype  and also expecting your users to be using HTML5-compliant browsers   then there is no reason to do it

User · Answer

HTML5 rules are different from HTML4  It s not required in HTML5 - unless the ampersand looks like it starts a parameter name    amp copy 2  is still a problem  for example  since  amp copy  is the copyright symbol   However it seems to me that it s harder work to decide to encode or not to encode depending on the following text  So the easiest path is probably to encode all the time

User · Answer

Update  March 2020   The W3C validator no longer complains about escaping URLs   I was checking why Image URL s need escaping  hence tried it in https   validator w3 org  The explanation is pretty nice  It highlights that even URL s need to be escaped   PS I guess it will unescaped when its consumed since URL s need  amp   Can anyone clarify     lt img alt    src  foo bar qut amp qux fop    gt       An entity reference was found in the document  but there is no   reference by that name defined  Often this is caused by misspelling   the reference name  unencoded ampersands  or by leaving off the   trailing semicolon      The most common cause of this error is   unencoded ampersands in URLs as described by the WDG in  Ampersands in   URLs   Entity references start with an ampersand   amp   and end with a   semicolon      If you want to use a literal ampersand in your document   you must encode it as   amp    even inside URLs    Be careful to end   entity references with a semicolon or your entity reference may get   interpreted in connection with the following text  Also keep in mind   that named entity references are case-sensitive   Aelig  and  aelig    are different characters  If this error appears in some markup   generated by PHP s session handling code  this article has   explanations and solutions to your problem

User · Answer

Well  if it comes from user input then absolutely yes  for obvious reasons  Think if this very website didn t do it  the title of this question would show up as do i really need to encode     amp     as     amp       If it s just something like echo   lt title gt Dolce  amp  Gabbana lt  title gt    then strictly speaking you don t have to  It would be better  but if you don t no user will notice the difference

User · Answer

In HTML a  amp  marks the begin of a reference  either of a character reference or of an entity reference  From that point on the parser expects either a   denoting a character reference  or an entity name denoting an entity reference  both followed by a    That   s the normal behavior   But if the reference name or just the reference opening  amp  is followed by a white space or other delimiters like        lt    gt    amp   the ending   and even a reference to represent a plain  amp  can be omitted    lt p title   amp amp   gt foo  amp amp  bar lt  p gt   lt p title   amp amp  gt foo  amp amp bar lt  p gt   lt p title   amp   gt foo  amp  bar lt  p gt    Only in these cases the ending   or even the reference itself can be omitted  at least in HTML 4   I think HTML 5 requires the ending     But the specification recommends to always use a reference like the character reference  amp  38  or the entity reference  amp amp  to avoid confusion      Authors should use   amp amp    ASCII decimal 38  instead of   amp   to avoid confusion with the beginning of a character reference  entity reference open delimiter   Authors should also use   amp amp   in attribute values since character references are allowed within CDATA attribute values

User · Answer

if  amp  is used in html then you should escape it  If  amp  is used in javascript strings e g  an alert  This  amp  that    or document href you don t need to use it   If you re using document write then you should use it e g  document write  lt p gt this  amp amp  that lt  p gt

User · Answer

A couple of years ago  we got a report that one of our web apps wasn t displaying correctly in Firefox   It turned out that the page contained a tag that looked like   lt div style           style       gt    When faced with a repeated style attribute  IE combines both of the styles  while Firefox only uses one of them  hence the different behavior   I changed the tag to   lt div style                gt    and sure enough  it fixed the problem   The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML   So  fix your damn markup already    Or use HTML Tidy to fix it

User · Answer

If the user passes it to you  or it will wind up in a URL  you need to escape it   If it appears in static text on a page   All browsers will get this one right either way  you don t worry much about it  since it will work

User · Answer

If you re really talking about the static text   lt title gt Foo  amp  Bar lt  title gt    stored in some file on the hard disk and served directly by a server  then yes  it probably doesn t need to be escaped   However  since there is very little HTML content nowadays that s completely static  I ll add the following disclaimer that assumes that the HTML content is generated from some other source  database content  user input  web service call result  legacy API result         If you don t escape a simple  amp   then chances are you also don t escape a  amp amp  or a  amp nbsp  or  lt b gt  or  lt script src  http   attacker com evil js  gt  or any other invalid text  That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks   In other words  when you re already checking and escaping the other more problematic cases  then there s almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone- amp  unescaped

User · Answer

Yes  you should try to serve valid code if possible   Most browsers will silently correct this error  but there is a problem with relying on the error handling in the browsers  There is no standard for how to handle incorrect code  so it s up to each browser vendor to try to figure out what to do with each error  and the results may vary   Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells  or if you nest links inside each other   For your specific example it s not likely to cause any problems  but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode  which could make your layout break down completely   So  you should correct errors like this in the code  if not for anything else so to keep the error list in the validator short  so that you can spot more serious problems

User · Answer

Yes  Just as the error said  in HTML  attributes are  PCDATA meaning they re parsed  This means you can use character entities in the attributes  Using  amp  by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML  would break the parsing  Just escape it as  amp amp  and everything would be fine   HTML5 allows you to leave it unescaped  but only when the data that follows does not look like a valid character reference  However  it s better just to escape all instances of this symbol than worry about which ones should be and which ones don t need to be   Keep this point in mind  if you re not escaping  amp  to  amp amp   it s bad enough for data that you create  where the code could very well be invalid   you might also not be escaping tag delimiters  which is a huge problem for user-submitted data  which could very well lead to HTML and script injection  cookie stealing and other exploits   Please just escape your code  It will save you a lot of trouble in the future

User · Answer

The link has a fairly good example of when and why you may need to escape  amp  to  amp amp   https   jsfiddle net vh2h7usk 1   Interestingly  I had to escape the character in order to represent it properly in my answer here  If I were to use the built-in code sample option  from the answer panel   I can just type in  amp amp  and it appears as it should  But if I were to manually use the  lt code gt  lt  code gt  element  then I have to escape in order to represent it correctly

[validation] Do I really need to encode '&' as '&'?

Methodology 1

Methodology 2

Examples related to validation

Examples related to html

Examples related to utf-8

Examples related to character-encoding

[validation] Do I really need to encode '&' as '&amp;'?