[python] ParseError: not well-formed (invalid token) using cElementTree

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

This question is related to python parsing elementtree

The answer is


See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity &#8; and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.


I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition

for file in files:
    if file.endswith('.xml'):
       run_your_code...

This trick helped me as well


What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222 But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.

The solution wasn't working for "normal" UTF-8.


The only thing that worked for me is I had to add mode and encoding while opening the file like below:

with open(filenames[0], mode='r',encoding='utf-8') as f:
     readFile()

Otherwise it was failing every time with invalid token error if I simply do this:

 f = open(filenames[0], 'r')
 readFile()

This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.

The workaround (Python 3.6)

import io
from xml.etree import ElementTree as ET

with io.open(file, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)

Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.


None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')

After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:

escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)

And use it like you'd normally do:

ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)

I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...


This code snippet worked for me. I have an issue with the parsing batch of XML files. I had to encode them to 'iso-8859-5'

import xml.etree.ElementTree as ET

tree = ET.parse(filename, parser = ET.XMLParser(encoding = 'iso-8859-5'))

A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="????? ?? ??? ???? ?????? ??????????? ????? ?? ????? ???" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""

xmltest = ET.fromstring(xml.encode("utf-8"))

However, it works with the addition of a hyphen in the encoding type:

<?xml version='1.0' encoding='utf-8'?>

Most odd. Someone found this footnote in the python docs:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.


lxml solved the issue, in my case

from lxml import etree

for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
    print(ele.tag, ele.text)  

in another case,

parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')

Thanks to theeastcoastwest

Python 2.7


I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

def ParseXmlTagContents(source, tag, tagContentsRegex):
    openTagString = "<"+tag+">"
    closeTagString = "</"+tag+">"
    found = re.search(openTagString + tagContentsRegex + closeTagString, source)
    if found:   
        start = found.regs[0][0]
        end = found.regs[0][1]
        return source[start+len(openTagString):end-len(closeTagString)]
    return ""

Example usage would be:

<?xml version="1.0" encoding="utf-16"?>
<parentNode>
    <childNode>123</childNode>
</parentNode>

ParseXmlTagContents(xmlString, "childNode", "[0-9]+")

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to elementtree

Use xml.etree.ElementTree to print nicely formatted xml files Convert Python ElementTree to string Parsing XML with namespace in Python via 'ElementTree' ParseError: not well-formed (invalid token) using cElementTree Parsing XML in Python using ElementTree example