What are invalid characters in XML

Question

I am working with some XML that holds strings like    lt node gt This is a string lt  node gt    Some of the strings that I am passing to the nodes will have characters like  amp         etc     lt node gt This is a string  amp  so is this lt  node gt    This is not valid due to  amp    I cannot wrap these strings in CDATA as they need to be as they are  I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA   Can someone point me in the direction of one or provide me with a list of illegal characters

User · Answer

Anyone tried this System Security SecurityElement Escape yourstring   This will replace invalid XML characters in a string with their valid equivalent

User · Answer

The only illegal characters are  amp    lt  and  gt   as well as  quot  or   in attributes  depending on which character is used to delimit the attribute value  attr  quot must use  amp quot  here    is allowed quot  and attr  must use  amp apos  here   quot  is allowed    They re escaped using XML entities  in this case you want  amp amp  for  amp   Really  though  you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don t have to worry about it

User · Answer

ampersand   amp   is escaped to  amp amp   double quotes     are escaped to  amp quot   single quotes     are escaped to  amp apos    less than   lt   is escaped to  amp lt    greater than   gt   is escaped to  amp gt    In C   use System Security SecurityElement Escape or System Net WebUtility HtmlEncode to escape these illegal characters   string xml     lt node gt it s my   node    amp  i like it 0x12 x09 x0A  0x09 0x0A  lt node gt    string encodedXml1   System Security SecurityElement Escape xml   string encodedXml2  System Net WebUtility HtmlEncode xml     encodedXml1   amp lt node amp gt it amp apos s my  amp quot node amp quot   amp amp  i like it 0x12 x09 x0A  0x09 0x0A  amp lt node amp gt    encodedXml2   amp lt node amp gt it amp  39 s my  amp quot node amp quot   amp amp  i like it 0x12 x09 x0A  0x09 0x0A  amp lt node amp gt

User · Answer

In addition to potame s answer  if you do want to escape using a CDATA block   If you put your text in a CDATA block then you don t need to use escaping  In that case you can use all characters in the following range     Note  On top of that  you re not allowed to use the    gt  character sequence  Because it would match the end of the CDATA block   If there are still invalid characters  e g  control characters   then probably it s better to use some kind of encoding  e g  base64

User · Answer

Another easy way to escape potentially unwanted XML   XHTML chars in C  is   WebUtility HtmlEncode stringWithStrangeChars

User · Answer

This is a C  code to remove the XML invalid characters from a string and return a new valid string   public static string CleanInvalidXmlChars string text             From xml spec valid chars           x9    xA    xD     x20- xD7FF      xE000- xFFFD      x10000- x10FFFF              any Unicode character  excluding the surrogate blocks  FFFE  and FFFF       string re        x09 x0A x0D x20- uD7FF uE000- uFFFD u10000- u10FFFF         return Regex Replace text  re

User · Answer

For Java folks  Apache has a utility class  StringEscapeUtils  that has a helper method escapeXml which can be used for escaping characters in a string using XML entities

User · Answer

In the Woodstox XML processor  invalid characters are classified by this code   if  c    0        throw new IOException  Invalid null character in text to output      if  c  lt          c  gt   0x7F  amp  amp  c  lt   0x9F         String msg    Invalid white space character  0x    Integer toHexString c       in text to output       if  mXml11            msg       can only be output using character entity              throw new IOException msg     if  c  gt  0x10FFFF        throw new IOException  Illegal unicode character point  0x    Integer toHexString c       to output  max is 0x10FFFF as per RFC            Surrogate pair in non-quotable  not text or attribute value  content  and non-unicode encoding  ISO-8859-x     Ascii       if  c  gt   SURR1 FIRST  amp  amp  c  lt   SURR2 LAST        throw new IOException  Illegal surrogate pair -- can only be output via character entities  which are not allowed in this content      throw new IOException  Invalid XML character  0x  Integer toHexString c     in text to output      Source from here

User · Answer

OK  let s separate the question of the characters that    aren t valid at all in any XML document  need to be escaped    The answer provided by  dolmen in  What are invalid characters in XML  is still valid but needs to be updated with the XML 1 1 specification   1  Invalid characters  The characters described here are all the characters that are allowed to be inserted in an XML document   1 1  In XML 1 0   Reference  see XML recommendation 1 0    2 2 Characters   The global list of allowed characters is       2     Char             x9    xA    xD     x20- xD7FF      xE000- xFFFD      x10000- x10FFFF      any Unicode character  excluding the surrogate blocks  FFFE  and FFFF       Basically  the control characters and characters out of the Unicode ranges are not allowed  This means also that calling for example the character entity  amp  x3  is forbidden   1 2  In XML 1 1   Reference  see XML recommendation 1 1    2 2 Characters  and 1 3 Rationale and list of changes for XML 1 1   The global list of allowed characters is       2     Char              x1- xD7FF      xE000- xFFFD      x10000- x10FFFF     any Unicode character  excluding the surrogate blocks  FFFE  and FFFF           2a      RestrictedChar              x1- x8      xB- xC      xE- x1F      x7F- x84      x86- x9F    This revision of the XML recommendation has extended the allowed characters so control characters are allowed  and takes into account a new revision of the Unicode standard  but these ones are still not allowed   NUL  x00   xFFFE  xFFFF     However  the use of control characters and undefined Unicode char is discouraged   It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected   2  Characters that need to be escaped  to obtain a well-formed document    The  lt  must be escaped with a  amp lt  entity  since it is assumed to be the beginning of a tag   The  amp  must be escaped with a  amp amp  entity  since it is assumed to be the beginning a entity reference  The  gt  should be escaped with  amp gt  entity  It is not mandatory -- it depends on the context -- but it is strongly advised to escape it   The   should be escaped with a  amp apos  entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it   The   should be escaped with a  amp quot  entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it

User · Answer

Another way to remove incorrect XML chars in C  is using XmlConvert IsXmlChar  Available since  NET Framework 4 0   public static string RemoveInvalidXmlChars string content       return new string content Where ch   gt  System Xml XmlConvert IsXmlChar ch   ToArray         or you may check that all characters are XML-valid   public static bool CheckValidXmlChars string content       return content All ch   gt  System Xml XmlConvert IsXmlChar ch         Net Fiddle  For example  the vertical tab symbol   v  is not valid for XML  it is valid UTF-8  but not valid XML 1 0  and even many libraries  including libxml2  miss it and silently output invalid XML

User · Answer

The predeclared characters are    amp   lt   gt        See  What are the special characters in XML   for more information

User · Answer

In summary  valid characters in the text are    tab  line-feed and carriage-return  all non-control characters are valid except  amp  and  lt    gt  is not valid if following       Sections 2 2 and 2 4 of the XML specification provide the answer in detail   Characters     Legal characters are tab  carriage return  line feed  and the legal characters of  Unicode and ISO IEC 10646   Character data     The ampersand character   amp   and the left angle bracket   lt   must not   appear in their literal form  except when used as markup delimiters    or within a comment  a processing instruction  or a CDATA section  If   they are needed elsewhere  they must be escaped using either numeric   character references or the strings    amp    and    lt      respectively  The right angle bracket     may be represented using the   string    gt     and must  for compatibility  be escaped using either      gt    or a character reference when it appears in the string           in content  when that string is not marking the end of a CDATA   section

User · Answer

XmlWriter and lower ASCII characters  worked for me  string code   Regex Replace item Code      u0000- u0008  u000B  u000C  u000E- u001F

User · Answer

For XSL  on really lazy days  I use   capture   amp amp    amp    capturereplace   amp amp amp     to translate all  amp -signs that aren t follwed p   amp  to proper ones   We have cases where the input is in CDATA but the system which uses the XML doesn t take it into account  It s a sloppy fix  beware

User · Answer

The list of valid characters is in the XML specification   Char                 x9    xA    xD     x20- xD7FF      xE000- xFFFD      x10000- x10FFFF      any Unicode character  excluding the surrogate blocks  FFFE  and FFFF

[xml] What are invalid characters in XML

Examples related to xml