Convert a Unicode string to a string in Python containing extra symbols

Question

How do you convert a Unicode string  containing extra characters like       etc   into a Python string

User · Answer

Well  if you re willing ready to switch to Python 3  which you may not be due to the backwards incompatibility with some Python 2 code   you don t have to do any converting  all text in Python 3 is represented with Unicode strings  which also means that there s no more usage of the u  lt text gt   syntax  You also have what are  in effect  strings of bytes  which are used to represent data  which may be an encoded string    http   docs python org 3 1 whatsnew 3 0 html text-vs-data-instead-of-unicode-vs-8-bit   Of course  if you re currently using Python 3  then the problem is likely something to do with how you re attempting to save the text to a file

User · Answer

gt  gt  gt  text u abcd   gt  gt  gt  str text   abcd    If the string only contains ascii characters

User · Answer

Here is an example    gt  gt  gt  u   u             gt  gt  gt  s   u encode  utf8    gt  gt  gt  s   xe2 x82 xac xe2 x82 xac xe2 x82 xac

User · Answer

Here is an example code  import unicodedata     raw text   u here   6757 dfgdfg  convert text   unicodedata normalize  NFKD   raw text  encode  ascii   ignore

User · Answer

You can use encode to ASCII if you don t need to translate the non-ASCII characters    gt  gt  gt  a u aaa                 gt  gt  gt  type a   lt type  unicode  gt   gt  gt  gt  a encode  ascii   ignore    aaa   gt  gt  gt  a encode  ascii   replace    aaa          gt  gt  gt

User · Answer

If you have a Unicode string  and you want to write this to a file  or other serialised form  you must first encode it into a particular representation that can be stored   There are several common Unicode encodings  such as UTF-16  uses two bytes for most Unicode characters  or UTF-8  1-4 bytes   codepoint depending on the character   etc  To convert that string into a particular encoding  you can use    gt  gt  gt  s  u   10   gt  gt  gt  s encode  utf8     xc2 x9c10   gt  gt  gt  s encode  utf16     xff xfe x9c x001 x000 x00    This raw string of bytes can be written to a file  However  note that when reading it back  you must know what encoding it is in and decode it using that same encoding   When writing to files  you can get rid of this manual encode decode process by using the codecs module  So  to open a file that encodes all Unicode strings into UTF-8  use   import codecs f   codecs open  path to file txt   w   utf8   f write my unicode string     Stored on disk as UTF-8   Do note that anything else that is using these files must understand what encoding the file is in if they want to read them  If you are the only one doing the reading writing this isn t a problem  otherwise make sure that you write in a form understandable by whatever else uses the files   In Python 3  this form of file access is the default  and the built-in open function will take an encoding parameter and always translate to from Unicode strings  the default string object in Python 3  for files opened in text mode

User · Answer

No answere worked for my case  where I had a string variable containing unicode chars  and no encode-decode explained here did the work   If I do in a Terminal  echo  no me llama mucho la atenci u00f3n    or  python3  gt  gt  gt  print  no me llama mucho la atenci u00f3n     The output is correct   output  no me llama mucho la atenci  n   But working with scripts loading this string variable didn t work   This is what worked on my case  in case helps anybody   string to convert    no me llama mucho la atenci u00f3n  print json dumps json loads r   s     string to convert   ensure ascii False   output  no me llama mucho la atenci  n

User · Answer

file contain unicode-esaped string    message        u0410  u0432  u0442  u043e  u0437  u0430  u0446  u0438  u044f            for me    f   open  56ad62-json log   encoding  utf-8    qq f readline      print qq                               log    message        u0410  u0432  u0442  u043e  u0440  u0438  u0437  u0430  u0446  u0438  u044f   u043f  u043e  u043b  u044c  u0437  u043e  u0432  u0430  u0442  u0435  u043b  u044f      qq encode   decode  unicode-escape   encode   decode  unicode-escape          log   message                               n

User · Answer

There is a library that can help with Unicode issues called ftfy  Has made my life easier  Example 1 import ftfy print ftfy fix text  u    nicode     output -- gt    nicode  Example 2 - UTF-8 import ftfy print ftfy fix text   xe2 x80 xa2     output -- gt       Example 3 - Unicode code point import ftfy print ftfy fix text u  u2026     output -- gt        https   ftfy readthedocs io en latest   pip install ftfy  https   pypi org project ftfy

User · Answer

See unicodedata normalize  title   u Kl  ft skr  ms inf  r p   f  d  ral   lectoral gro  e  import unicodedata unicodedata normalize  NFKD   title  encode  ascii    ignore    Kluft skrams infor pa federal electoral groe

[python] Convert a Unicode string to a string in Python (containing extra symbols)

Examples related to python

Examples related to string

Examples related to unicode

Examples related to type-conversion