How do I check if a string is unicode or ascii

Question

What do I have to do in Python to figure out which encoding a string has

User · Answer

How to tell if an object is a unicode string or a byte string

You can use type or isinstance.

In Python 2:

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

In Python 2, str is just a sequence of bytes. Python doesn't know what its encoding is. The unicode type is the safer way to store text. If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.

In Python 3:

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

In Python 3, str is like Python 2's unicode, and is used to store text. What was called str in Python 2 is called bytes in Python 3.

How to tell if a byte string is valid utf-8 or ascii

You can call decode. If it raises a UnicodeDecodeError exception, it wasn't valid.

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

User · Answer

use   import six if isinstance obj  six text type    inside the six library it is represented as     if PY3      string types   str  else      string types   basestring

User · Answer

Note that on Python 3  it s not really fair to say any of    strs are UTFx for any x  eg  UTF8  strs are Unicode strs are ordered collections of Unicode characters   Python s str type is  normally  a sequence of Unicode code points  some of which map to characters     Even on Python 3  it s not as simple to answer this question as you might imagine   An obvious way to test for ASCII-compatible strings is by an attempted encode    Hello there   encode  ascii     gt  gt  gt  b Hello there     Hello there        encode  ascii     gt  gt  gt  Traceback  most recent call last     gt  gt  gt    File     line 4  in  lt module gt    gt  gt  gt  UnicodeEncodeError   ascii  codec can t encode character   u2603  in position 15  ordinal not in range 128    The error distinguishes the cases   In Python 3  there are even some strings that contain invalid Unicode code points    Hello there   encode  utf8     gt  gt  gt  b Hello there      udcc3  encode  utf8     gt  gt  gt  Traceback  most recent call last     gt  gt  gt    File     line 19  in  lt module gt    gt  gt  gt  UnicodeEncodeError   utf-8  codec can t encode character   udcc3  in position 0  surrogates not allowed   The same method to distinguish them is used

User · Answer

This may help someone else  I started out testing for the string type of the variable s  but for my application  it made more sense to simply return s as utf-8   The process calling return utf  then knows what it is dealing with and can handle the string appropriately   The code is not pristine  but I intend for it to be Python version agnostic without a version test or importing six   Please comment with improvements to the sample code below to help other people   def return utf s       if isinstance s  str           return s encode  utf-8       if isinstance s   int  float  complex            return str s  encode  utf-8       try          return s encode  utf-8       except TypeError          try              return str s  encode  utf-8           except AttributeError              return s     except AttributeError          return s     return s   assume it was already utf-8

User · Answer

One simple approach is to check if unicode is a builtin function  If so  you re in Python 2 and your string will be a string  To ensure everything is in unicode one can do   import builtins  i    cats  if  unicode  in dir builtins         True in python 2  False in 3   i   unicode i

User · Answer

If your code needs to be compatible with both Python 2 and Python 3, you can't directly use things like isinstance(s,bytes) or isinstance(s,unicode) without wrapping them in either try/except or a python version test, because bytes is undefined in Python 2 and unicode is undefined in Python 3.

There are some ugly workarounds. An extremely ugly one is to compare the name of the type, instead of comparing the type itself. Here's an example:

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

An arguably slightly less ugly workaround is to check the Python version number, e.g.:

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

Those are both unpythonic, and most of the time there's probably a better way.

User · Answer

In python 3 x all strings are sequences of Unicode characters  and doing the isinstance check for str  which means unicode string by default  should suffice   isinstance x  str    With regards to python 2 x   Most people seem to be using an if statement that has two checks  one for str and one for unicode   If you want to check if you have a  string-like  object all with one statement though  you can do the following   isinstance x  basestring

User · Answer

For py2 py3 compatibility simply use    import six if isinstance obj  six text type

User · Answer

In Python 3  all strings are sequences of Unicode characters  There is a bytes type that holds raw bytes   In Python 2  a string may be of type str or of type unicode  You can tell which using code something like this   def whatisthis s       if isinstance s  str           print  ordinary string      elif isinstance s  unicode           print  unicode string      else          print  not a string    This does not distinguish  Unicode or ASCII   it only distinguishes Python types  A Unicode string may consist of purely characters in the ASCII range  and a bytestring may contain ASCII  encoded Unicode  or even non-textual data

User · Answer

Unicode is not an encoding - to quote Kumar McMillan:

If ASCII, UTF-8, and other byte strings are "text" ...

...then Unicode is "text-ness";

it is the abstract form of text

Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.

User · Answer

You could use Universal Encoding Detector  but be aware that it will just give you best guess  not the actual encoding  because it s impossible to know encoding of a string  abc  for example  You will need to get encoding information elsewhere  eg HTTP protocol uses Content-Type header for that

[python] How do I check if a string is unicode or ascii?

The answer is

How to tell if an object is a unicode string or a byte string

How to tell if a byte string is valid utf-8 or ascii

Examples related to python

Examples related to unicode

Examples related to encoding

Examples related to utf-8

Tags