Python 2
The error is caused because ElementTree did not expect to find non-ASCII strings set the XML when trying to write it out. You should use Unicode strings for non-ASCII instead. Unicode strings can be made either by using the u
prefix on strings, i.e. u'€'
or by decoding a string with mystr.decode('utf-8')
using the appropriate encoding.
The best practice is to decode all text data as it's read, rather than decoding mid-program. The io
module provides an open()
method which decodes text data to Unicode strings as it's read.
ElementTree will be much happier with Unicodes and will properly encode it correctly when using the ET.write()
method.
Also, for best compatibility and readability, ensure that ET encodes to UTF-8 during write()
and adds the relevant header.
Presuming your input file is UTF-8 encoded (0xC2
is common UTF-8 lead byte), putting everything together, and using the with
statement, your code should look like:
with io.open('myText.txt', "r", encoding='utf-8') as f:
data = f.read()
root = ET.Element("add")
doc = ET.SubElement(root, "doc")
field = ET.SubElement(doc, "field")
field.set("name", "text")
field.text = data
tree = ET.ElementTree(root)
tree.write("output.xml", encoding='utf-8', xml_declaration=True)
Output:
<?xml version='1.0' encoding='utf-8'?>
<add><doc><field name="text">data€</field></doc></add>