Why declare unicode by string in python

Question

I m still learning python and I have a doubt   In python 2 6 x I usually declare encoding in the file header like this  as in PEP 0263     - - coding  utf-8 - -   After that  my strings are written as usual   a    A normal string without declared Unicode    But everytime I see a python project code  the encoding is not declared at the header  Instead  it is declared at every string like this   a   u A string with declared Unicode    What s the difference  What s the purpose of this  I know Python 2 6 x sets ASCII encoding by default  but it can be overriden by the header declaration  so what s the point of per string declaration   Addendum  Seems that I ve mixed up file encoding with string encoding  Thanks for explaining it

User · Answer

That doesn t set the format of the string  it sets the format of the file  Even with that header   hello  is a byte string  not a Unicode string  To make it Unicode  you re going to have to use u hello  everywhere  The header is just a hint of what format to use when reading the  py file

User · Answer

The header definition is to define the encoding of the code itself  not the resulting strings at runtime   putting a non-ascii character like   in the python script without the utf-8 header definition will throw a warning

User · Answer

I made the following module called unicoder to be able to do the transformation on variables   import sys import os  def ustr string        string    u  s   string      with open   unicoder py    w   as script           script write    - - coding  utf-8 - - n           script write   ustr    s  string       import  unicoder     value    unicoder  ustr      del  unicoder     del sys modules   unicoder        os system  del  unicoder py       os system  del  unicoder pyc        return value   Then in your program you could do the following     - - coding  utf-8 - -  from unicoder import ustr  txt    Hello  Unicode World  txt   ustr txt   print type txt     lt type  unicode  gt

User · Answer

As others have said    coding  specifies the encoding the source file is saved in   Here are some examples to illustrate this   A file saved on disk as cp437  my console encoding   but no encoding declared  b      ber  u   u   ber  print b repr b  print u repr u    Output     File  C  ex py   line 1 SyntaxError  Non-ASCII character   x81  in file C  ex py on line 1  but no encoding declared  see http   www python org peps pep-0263 html for details   Output of file with   coding  cp437 added     ber   x81ber    ber u  xfcber    At first  Python didn t know the encoding and complained about the non-ASCII character   Once it knew the encoding  the byte string got the bytes that were actually on disk   For the Unicode string  Python read  x81  knew that in cp437 that was a     and decoded it into the Unicode codepoint for    which is U 00FC   When the byte string was printed  Python sent the hex value 81 to the console directly   When the Unicode string was printed  Python correctly detected my console encoding as cp437 and translated Unicode    to the cp437 value for      Here s what happens with a file declared and saved in UTF-8     ber   xc3 xbcber    ber u  xfcber    In UTF-8     is encoded as the hex bytes C3 BC  so the byte string contains those bytes  but the Unicode string is identical to the first example   Python read the two bytes and decoded it correctly   Python printed the byte string incorrectly  because it sent the two UTF-8 bytes representing    directly to my cp437 console   Here the file is declared cp437  but saved in UTF-8     ber   xc3 xbcber    ber u  u251c u255dber    The byte string still got the bytes on disk  UTF-8 hex bytes C3 BC   but interpreted them as two cp437 characters instead of a single UTF-8-encoded character   Those two characters where translated to Unicode code points  and everything prints incorrectly

User · Answer

Those are two different things  as others have mentioned     When you specify   - - coding  utf-8 - -  you re telling Python the source file you ve saved is utf-8   The default for Python 2 is ASCII  for Python 3 it s utf-8    This just affects how the interpreter reads the characters in the file   In general  it s probably not the best idea to embed high unicode characters into your file no matter what the encoding is  you can use string unicode escapes  which work in either encoding     When you declare a string with a u in front  like u This is a string   it tells the Python compiler that the string is Unicode  not bytes   This is handled mostly transparently by the interpreter  the most obvious difference is that you can now embed unicode characters in the string  that is  u  u2665  is now legal    You can use from   future   import unicode literals to make it the default   This only applies to Python 2  in Python 3 the default is Unicode  and you need to specify a b in front  like b These are bytes   to declare a sequence of bytes

[python] Why declare unicode by string in python?

Examples related to python

Examples related to encoding

Examples related to utf-8