In a text file, there is a string "I don't like this".
However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use
f1 = open (file1, "r")
text = f1.read()
command to do the reading.
Now, is it possible to read the string in such a way that when it is read into the string, it is "I don't like this", instead of "I don\xe2\x80\x98t like this like this"?
Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?
There are a few points to consider.
A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:
>>> text = u'‘'
>>> print repr(text)
u'\u2018'
Now if you simply want to print the unicode string prettily, just use unicode's encode
method:
>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this
To make sure that every line from any file would be read as unicode, you'd better use the codecs.open
function instead of just open
, which allows you to specify file's encoding:
>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this
There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:
>>> print repr(text)
'I don\\u2018t like this'
This actually happened to me once before. You can use a unicode_escape
codec to decode the string to unicode and then encode it to any format you want:
>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this
Not sure about the (errors="ignore") option but it seems to work for files with strange Unicode characters.
with open(fName, "rb") as fData:
lines = fData.read().splitlines()
lines = [line.decode("utf-8", errors="ignore") for line in lines]
It is also possible to read an encoded text file using the python 3 read method:
f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()
With this variation, there is no need to import any additional libraries
Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.
You'll have to google for "iconvcodec", since the module seems not to be supported anymore and I can't find a canonical home page for it.
>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"
Alternatively you can use the iconv
command line utility to clean up your file:
$ xxd foo
0000000: e280 980a ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a '.
Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:
text = text.replace (u"\u2018", "'")
In addition, what are you using to write the file? f1.read()
should return a string that looks like this:
'I don\xe2\x80\x98t like this'
If it's returning this string, the file is being written incorrectly:
'I don\u2018t like this'
But it really is "I don\u2018t like this" and not "I don't like this". The character u'\u2018' is a completely different character than "'" (and, visually, should correspond more to '`').
If you're trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.
punctuation = {
u'\u2018': "'",
u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
text = text.replace(src, dest)
There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you're reading.
This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.
>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this
Source: Stackoverflow.com