ParseError not well-formed invalid token using cElementTree

Question

I receive xml strings from an external source that can contains unsanitized user contributed content   The following xml string gave a ParseError in cElementTree    gt  gt  gt  print repr s    lt Comment gt dddddddd x08 x08 x08 x08 x08 x08      lt  Comment gt    gt  gt  gt  import xml etree cElementTree as ET  gt  gt  gt  ET XML s   Traceback  most recent call last     File   lt pyshell 4 gt    line 1  in  lt module gt      ET XML s    File   lt string gt    line 106  in XML ParseError  not well-formed  invalid token   line 1  column 17   Is there a way to make cElementTree not complain

User · Answer

None of the above fixes worked for me  The only thing that worked was to use BeautifulSoup instead of ElementTree as follows   from bs4 import BeautifulSoup  with open  data myfile xml   as fp      soup   BeautifulSoup fp   xml     Then you can search the tree as   soup find all  mytag

User · Answer

I have been in stuck with similar problem  Finally figured out the what was the root cause in my particular case  If you read the data from multiple XML files that lie in same folder you will parse also  DS Store file  Before parsing add this condition  for file in files      if file endswith   xml           run your code      This trick helped me as well

User · Answer

This is most probably an encoding error  For example I had an xml file encoded in UTF-8-BOM  checked from the Notepad   Encoding menu  and got similar error message    The workaround  Python 3 6   import io from xml etree import ElementTree as ET  with io open file   r   encoding  utf-8-sig   as f      contents   f read       tree   ET fromstring contents    Check the encoding of your xml file  If it is using different encoding  change the  utf-8-sig  accordingly

User · Answer

A solution for gottcha for me  using Python s ElementTree    this has the invalid token error     - - coding  utf-8 - - import xml etree ElementTree as ET  xml   u    lt  xml version  1 0  encoding  utf8   gt   lt osm generator  pycrocosm server  version  0 6  gt  lt changeset created at  2017-09-06T19 26 50 302136 00 00  id  273  max lat  0 0  max lon  0 0  min lat  0 0  min lon  0 0  open  true  uid  345  user  john  gt  lt tag k  test  v                                                             gt  lt tag k  foo  v  bar    gt  lt discussion gt  lt comment data  2015-01-01T18 56 48Z  uid  1841  user  metaodi  gt  lt text gt Did you verify those street names  lt  text gt  lt  comment gt  lt  discussion gt  lt  changeset gt  lt  osm gt      xmltest   ET fromstring xml encode  utf-8      However  it works with the addition of a hyphen in the encoding type    lt  xml version  1 0  encoding  utf-8   gt    Most odd  Someone found this footnote in the python docs      The encoding string included in XML output should conform to the   appropriate standards  For example     UTF-8    is valid  but    UTF8    is   not

User · Answer

What helped me with that error was Juan s answer - https   stackoverflow com a 20204635 4433222 But wasn t enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding   The solution wasn t working for  normal  UTF-8

User · Answer

It seems to complain about  x08 you will need to escape that   Edit    Or you can have the parser ignore the errors using recover  from lxml import etree parser   etree XMLParser recover True  etree fromstring xmlstring  parser parser

User · Answer

I tried the other solutions in the answers here but had no luck  Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so   def ParseXmlTagContents source  tag  tagContentsRegex       openTagString     lt   tag   gt       closeTagString     lt    tag   gt       found   re search openTagString   tagContentsRegex   closeTagString  source      if found             start   found regs 0  0          end   found regs 0  1          return source start len openTagString  end-len closeTagString       return      Example usage would be    lt  xml version  1 0  encoding  utf-16   gt   lt parentNode gt       lt childNode gt 123 lt  childNode gt   lt  parentNode gt   ParseXmlTagContents xmlString   childNode     0-9

User · Answer

After lots of searching through the entire WWW  I only found out that you have to escape certain characters if you want your XML parser to work  Here s how I did it and worked for me   escape illegal xml characters   lambda x  re sub u   x00- x08 x0b x0c x0e- x1F uD800- uDFFF uFFFE uFFFF        x    And use it like you d normally do   ET XML escape illegal xml characters my xml string    instead of ET XML my xml string

User · Answer

I was having the same error  with ElementTree   In my case it was because of encodings  and I was able to solve it without having to use an external library  Hope this helps other people finding this question based on the title   reference   import xml etree ElementTree as ET parser   ET XMLParser encoding  utf-8   tree   ET fromstring xmlstring  parser parser    EDIT  Based on comments  this answer might be outdated  But this did work back when it was answered

User · Answer

lxml solved the issue  in my case  from lxml import etree  for    elein etree iterparse xml file  tag  tag i wanted   unicode  utf-8        print ele tag  ele text      in another case   parser   etree XMLParser recover True  tree   etree parse xml file  parser parser  tags needed   tree iter  TAG NAME     Thanks to theeastcoastwest  Python 2 7

User · Answer

The only thing that worked for me is I had to add mode and encoding while opening the file like below   with open filenames 0   mode  r  encoding  utf-8   as f       readFile     Otherwise it was failing every time with invalid token error if I simply do this    f   open filenames 0    r    readFile

User · Answer

This code snippet worked for me  I have an issue with the parsing batch of XML files  I had to encode them to  iso-8859-5   import xml etree ElementTree as ET  tree   ET parse filename  parser   ET XMLParser encoding    iso-8859-5

User · Answer

See this answer to another question and the according part of the XML spec   The backspace U 0008 is an invalid character in XML documents  It must be represented as escaped entity  amp  8  and cannot occur plainly   If you need to process this XML snippet  you must replace  x08 in s before feeding it into an XML parser

[python] ParseError: not well-formed (invalid token) using cElementTree

Examples related to python

Examples related to parsing

Examples related to elementtree