How to get string objects instead of Unicode from JSON

Question

I m using Python 2 to parse JSON from ASCII encoded text files    When loading these files with either json or  simplejson  all my string values are cast to Unicode objects instead of string objects  The problem is  I have to use the data with some libraries that only accept string objects  I can t change the libraries nor update them   Is it possible to get string objects instead of Unicode ones   Example      import json     original list     a    b       json list   json dumps original list      json list    a    b        new list   json loads json list      new list  u a   u b      I want these to be of type  str   not  unicode   Update  This question was asked a long time ago  when I was stuck with Python 2  One easy and clean solution for today is to use a recent version of Python     i e  Python 3 and forward

User · Answer

As Mark  Amery  correctly notes  Using PyYaml s deserializer on a json dump works only if you have ASCII only  At least out of the box    Two quick comments on the PyYaml approach    NEVER use yaml load on data from the field  Its a feature    of yaml to execute arbitrary code hidden within the structure   You can make it work also for non ASCII via this   def to utf8 loader  node       return loader construct scalar node  encode  utf-8   yaml add constructor u tag yaml org 2002 str   to utf8     But performance wise its of no comparison to Mark Amery s answer   Throwing some deeply nested sample dicts onto the two methods  I get this  with dt j    time delta of json loads json dumps m           dt yaml safe load json dumps m       100   dt j       dt byteify recursion Mark Amery        5   dt j    So deserialization including fully walking the tree and encoding  well within the order of magnitude of json s C based implementation  I find this remarkably fast and its also more robust than the yaml load at deeply nested structures  And less security error prone  looking at yaml load      While I would appreciate a pointer to a C only based converter the byteify function should be the default answer    This holds especially true if your json structure is from the field  containing user input  Because then you probably need to walk anyway over your structure - independent on your desired internal data structures   unicode sandwich  or byte strings only    Why   Unicode normalisation  For the unaware  Take a painkiller and read this   So using the byteify recursion you kill two birds with one stone     get your bytestrings from nested json dumps get user input values normalised  so that you find the stuff in your storage    In my tests it turned out that replacing the input encode  utf-8   with a unicodedata normalize  NFC   input  encode  utf-8   was even faster than w o NFC - but thats heavily dependent on the sample data I guess

User · Answer

There s no built-in option to make the json module functions return byte strings instead of unicode strings  However  this short and simple recursive function will convert any decoded JSON object from using unicode strings to UTF-8-encoded byte strings   def byteify input       if isinstance input  dict           return  byteify key   byteify value                  for key  value in input iteritems        elif isinstance input  list           return  byteify element  for element in input      elif isinstance input  unicode           return input encode  utf-8       else          return input   Just call this on the output you get from a json load or json loads call   A couple of notes    To support Python 2 6 or earlier  replace return  byteify key   byteify value  for key  value in input iteritems    with return dict   byteify key   byteify value   for key  value in input iteritems      since dictionary comprehensions weren t supported until Python 2 7  Since this answer recurses through the entire decoded object  it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object hook or object pairs hook parameters  Mirec Miskuf s answer is so far the only one that manages to pull this off correctly  although as a consequence  it s significantly more complicated than my approach

User · Answer

I ran into this problem too  and having to deal with JSON  I came up with a small loop that converts the unicode keys to strings    simplejson on GAE does not return string keys    obj is the object decoded from JSON   if NAME CLASS MAP has key cls       kwargs          for i in obj keys            kwargs str i     obj i      o   NAME CLASS MAP cls    kwargs      o save     kwargs is what I pass to the constructor of the GAE application  which does not like unicode keys in   kwargs   Not as robust as the solution from Wells  but much smaller

User · Answer

Mike Brennan s answer is close  but there is no reason to re-traverse the entire structure  If you use the object hook pairs  Python 2 7   parameter      object pairs hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs   The return value of object pairs hook will be used instead of the dict  This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded  for example  collections OrderedDict will remember the order of insertion   If object hook is also defined  the object pairs hook takes priority    With it  you get each JSON object handed to you  so you can do the decoding with no need for recursion   def deunicodify hook pairs       new pairs          for key  value in pairs          if isinstance value  unicode               value   value encode  utf-8           if isinstance key  unicode               key   key encode  utf-8           new pairs append  key  value       return dict new pairs   In  52   open  test json   read   Out 52      1    hello    abc    1  2  3    def     hi    mom     boo    1   hi    moo     5    some                                               In  53   json load open  test json    Out 53     u 1   u hello    u abc    1  2  3    u boo    1  u hi   u moo    u 5   u some      u def    u hi   u mom     In  54   json load open  test json    object pairs hook deunicodify hook  Out 54      1    hello     abc    1  2  3     boo    1   hi    moo     5    some       def     hi    mom      Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object pairs hook  You do have to care about lists  but as you can see  an object within a list will be properly converted  and you don t have to recurse to make it happen   EDIT  A coworker pointed out that Python2 6 doesn t have object hook pairs  You can still use this will Python2 6 by making a very small change  In the hook above  change   for key  value in pairs    to  for key  value in pairs iteritems      Then use object hook instead of object pairs hook   In  66   json load open  test json    object hook deunicodify hook  Out 66      1    hello     abc    1  2  3     boo    1   hi    moo     5    some       def     hi    mom      Using object pairs hook results in one less dictionary being instantiated for each object in the JSON object  which  if you were parsing a huge document  might be worth while

User · Answer

here is a recursive encoder written in C   https   github com axiros nested encode  Performance overhead for  average  structures around 10  compared to json loads   python speed py                                                                                               json loads             0 16sec    u a     u b     1  2   u  xd6ster     json loads   encoding  0 18sec     a      b     1  2     xc3 x96ster    time overhead in percent  9    using this teststructure   import json  nested encode  time  s            firstName    Jos  u0301      lastName    Smith      isAlive   true     age   25     address          streetAddress    21 2nd Street        city      u00d6sterreich        state    NY        postalCode    10021-3100          phoneNumbers                  type    home          number    212 555-1234                      type    office          number    646 555-4567                children          spouse   null     a      b     1  2      u00d6sterreich               t1   time time   for i in xrange 10000       u   json loads s  dt json   time time   - t1  t1   time time   for i in xrange 10000       b   nested encode encode nested json loads s   dt json enc   time time   - t1  print  json loads               2fsec    s        dt json  str u   20   print  json loads   encoding    2fsec    s        dt json enc  str b   20    print  time overhead in percent   i        100    dt json enc - dt json  dt json

User · Answer

That s because json has no difference between string objects and unicode objects  They re all strings in javascript   I think JSON is right to return unicode objects  In fact  I wouldn t accept anything less  since javascript strings are in fact unicode objects  i e  JSON  javascript  strings can store any kind of unicode character  so it makes sense to create unicode objects when translating strings from JSON  Plain strings just wouldn t fit since the library would have to guess the encoding you want   It s better to use unicode string objects everywhere  So your best option is to update your libraries so they can deal with unicode objects   But if you really want bytestrings  just encode the results to the encoding of your choice    gt  gt  gt  nl   json loads js   gt  gt  gt  nl  u a   u b    gt  gt  gt  nl    s encode  utf-8   for s in nl   gt  gt  gt  nl   a    b

User · Answer

This is late to the game  but I built this recursive caster  It works for my needs and I think it s relatively complete  It may help you   def  parseJSON self  obj       newobj           for key  value in obj iteritems            key   str key           if isinstance value  dict               newobj key    self  parseJSON value          elif isinstance value  list               if key not in newobj                  newobj key                       for i in value                      newobj key  append self  parseJSON i           elif isinstance value  unicode               val   str value              if val isdigit                    val   int val              else                  try                      val   float val                  except ValueError                      val   str val              newobj key    val      return newobj   Just pass it a JSON object like so   obj   json loads content  parse float float  parse int int  obj    parseJSON obj    I have it as a private member of a class  but you can repurpose the method as you see fit

User · Answer

I m afraid there s no way to achieve this automatically within the simplejson library   The scanner and decoder in simplejson are designed to produce unicode text  To do this  the library uses a function called c scanstring  if it s available  for speed   or py scanstring if the C version is not available  The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text  You d have to either monkeypatch the scanstring value in simplejson decoder  or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text   The reason that simplejson outputs unicode  however  is that the json spec specifically mentions that  A string is a collection of zero or more Unicode characters     support for unicode is assumed as part of the format itself  Simplejson s scanstring implementation goes so far as to scan and interpret unicode escapes  even error-checking for malformed multi-byte charset representations   so the only way it can reliably return the value to you is as unicode   If you have an aged library that needs an str  I recommend you either laboriously search the nested data structure after parsing  which I acknowledge is what you explicitly said you wanted to avoid    sorry   or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level  The second approach might be more manageable than the first if your data structures are indeed deeply nested

User · Answer

While there are some good answers here  I ended up using PyYAML to parse my JSON files  since it gives the keys and values as str type strings instead of unicode type  Because JSON is a subset of YAML it works nicely    gt  gt  gt  import json  gt  gt  gt  import yaml  gt  gt  gt  list org     a    b    gt  gt  gt  list dump   json dumps list org   gt  gt  gt  list dump    a    b     gt  gt  gt  json loads list dump   u a   u b    gt  gt  gt  yaml safe load list dump    a    b     Notes  Some things to note though    I get string objects because all my entries are ASCII encoded  If I would use unicode encoded entries  I would get them back as unicode objects     there is no conversion  You should  probably always  use PyYAML s safe load function  if you use it to load JSON files  you don t need the  additional power  of the load function anyway  If you want a YAML parser that has more support for the 1 2 version of the spec  and correctly parses very low numbers  try Ruamel YAML  pip install ruamel yaml and import ruamel yaml as yaml was all I needed in my tests    Conversion  As stated  there is no conversion  If you can t be sure to only deal with ASCII values  and you can t be sure most of the time   better use a conversion function   I used the one from Mark Amery a couple of times now  it works great and is very easy to use  You can also use a similar function as an object hook instead  as it might gain you a performance boost on big files  See the slightly more involved answer from Mirec Miskuf for that

User · Answer

With Python 3 6  sometimes I still run into this problem  For example  when getting response from a REST API and loading the response text to JSON  I still get the unicode strings  Found a simple solution using json dumps     response message   json loads json dumps response text   print response message

User · Answer

Just use pickle instead of json for dump and load  like so       import json     import pickle      d      field1    value1    field2   2         json dump d open  testjson txt   w         print json load open  testjson txt   r         pickle dump d open  testpickle txt   w         print pickle load open  testpickle txt   r      The output it produces is  strings and integers are handled correctly         u field2   2  u field1   u value1         field2   2   field1    value1

User · Answer

So  I ve run into the same problem  Guess what was the first Google result   Because I need to pass all data to PyGTK  unicode strings aren t very useful to me either  So I have another recursive conversion method  It s actually also needed for typesafe JSON conversion - json dump   would bail on any non-literals  like Python objects  Doesn t convert dict indexes though     removes any objects  turns unicode back into str def filter data obj           if type obj  in  int  float  str  bool                   return obj         elif type obj     unicode                  return str obj          elif type obj  in  list  tuple  set                   obj   list obj                  for i v in enumerate obj                           obj i    filter data v          elif type obj     dict                  for i v in obj iteritems                            obj i    filter data v          else                  print  invalid object in data  converting to string                  obj   str obj           return obj

User · Answer

I had a JSON dict as a string  The keys and values were unicode objects like in the following example   myStringDict     u key  u value      I could use the byteify function suggested above by converting the string to a dict object using ast literal eval myStringDict

User · Answer

I rewrote Wells s  parse json   to handle cases where the json object itself is an array  my use case    def  parseJSON self  obj       if isinstance obj  dict           newobj              for key  value in obj iteritems                key   str key              newobj key    self  parseJSON value      elif isinstance obj  list           newobj              for value in obj              newobj append self  parseJSON value       elif isinstance obj  unicode           newobj   str obj      else          newobj   obj     return newobj

User · Answer

I ve adapted the code from the answer of Mark Amery  particularly in order to get rid of isinstance for the pros of duck-typing   The encoding is done manually and ensure ascii is disabled  The python docs for json dump says that      If ensure ascii is True  the default   all non-ASCII characters in the output are escaped with  uXXXX sequences   Disclaimer  in the doctest I used the Hungarian language  Some notable Hungarian-related character encodings are  cp852 the IBM OEM encoding used eg  in DOS  sometimes referred as ascii  incorrectly I think  it is dependent on the codepage setting   cp1250 used eg  in Windows  sometimes referred as ansi  dependent on the locale settings   and iso-8859-2  sometimes used on http servers  The test text T  sk  sh  t   k  gy  buv  lo is attributed to Koltai L  szl    native personal name form  and is from wikipedia     coding  utf-8     This file should be encoded correctly with utf-8      import json  def encode items input  encoding  utf-8        u   original from  https   stackoverflow com a 13101776 611007     adapted by SO u 611007  20150623       gt  gt  gt        gt  gt  gt     run this with  python -m doctest  lt this file gt  py  from command line      gt  gt  gt        gt  gt  gt  txt   u T  sk  sh  t   k  gy  buv  lo       gt  gt  gt  txt2   u T  u00fcsk  u00e9sh  u00e1t  u00fa k  u00edgy  u00f3b  u0171v  u00f6l  u0151       gt  gt  gt  txt3   u u  uutifu       gt  gt  gt  txt4   b u  xfauutifu       gt  gt  gt    txt4 shouldn t be  u  xc3  xbauutifu   string content needs double backslash for doctest       gt  gt  gt  assert u   u0102  not in b u  xfauutifu  decode  cp1250        gt  gt  gt  txt4u   txt4 decode  cp1250        gt  gt  gt  assert txt4u    u u  xfauutifu   repr txt4u       gt  gt  gt  txt5   b u  xc3  xbauutifu       gt  gt  gt  txt5u   txt5 decode  utf-8        gt  gt  gt  txt6   u u  u251c  u2551uutifu       gt  gt  gt  there and back again   lambda t  encode items t  encoding  utf-8   decode  utf-8        gt  gt  gt  assert txt    there and back again txt       gt  gt  gt  assert txt    there and back again txt2       gt  gt  gt  assert txt3    there and back again txt3       gt  gt  gt  assert txt3 encode  cp852      there and back again txt4u  encode  cp852        gt  gt  gt  assert txt3    txt4u  txt3 txt4u       gt  gt  gt  assert txt3    there and back again txt5       gt  gt  gt  assert txt3    there and back again txt5u       gt  gt  gt  assert txt3    there and back again txt4u       gt  gt  gt  assert txt3 encode  cp1250      encode items txt4  encoding  utf-8        gt  gt  gt  assert txt3 encode  utf-8      encode items txt5  encoding  utf-8        gt  gt  gt  assert txt2 encode  utf-8      encode items txt  encoding  utf-8        gt  gt  gt  assert   a  txt2 encode  utf-8       encode items   a  txt   encoding  utf-8        gt  gt  gt  assert  txt2 encode  utf-8       encode items  txt   encoding  utf-8        gt  gt  gt  assert   txt2 encode  utf-8        encode items   txt    encoding  utf-8        gt  gt  gt  assert    a  txt2 encode  utf-8        encode items    a  txt    encoding  utf-8        gt  gt  gt  assert   b    a  txt2 encode  utf-8        encode items   b    a  txt    encoding  utf-8               try          input iteritems         return  encode items k   encode items v  for  k v  in input iteritems        except AttributeError          if isinstance input  unicode               return input encode encoding          elif isinstance input  str               return input         try              iter input              return  encode items e  for e in input          except TypeError              return input  def alt dumps obj    kwargs                gt  gt  gt  alt dumps   a   u T  u00fcsk  u00e9sh  u00e1t  u00fa k  u00edgy  u00f3b  u0171v  u00f6l  u0151           a    T  xc3  xbcsk  xc3  xa9sh  xc3  xa1t  xc3  xba k  xc3  xadgy  xc3  xb3b  xc5  xb1v  xc3  xb6l  xc5  x91                if  ensure ascii  in kwargs          del kwargs  ensure ascii       return json dumps encode items obj   ensure ascii False    kwargs    I d also like to highlight the answer of Jarret Hardie which references the JSON spec  quoting      A string is a collection of zero or more Unicode characters   In my use-case I had files with json  They are utf-8 encoded files  ensure ascii results in properly escaped but not very readable json files  that is why I ve adapted Mark Amery s answer to fit my needs   The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone

User · Answer

You can use the object hook parameter for json loads to pass in a converter  You don t have to do the conversion after the fact  The json module will always pass the object hook dicts only  and it will recursively pass in nested dicts  so you don t have to recurse into nested dicts yourself  I don t think I would convert unicode strings to numbers like Wells shows  If it s a unicode string  it was quoted as a string in the JSON file  so it is supposed to be a string  or the file is bad    Also  I d try to avoid doing something like str val  on a unicode object  You should use value encode encoding  with a valid encoding  depending on what your external lib expects   So  for example   def  decode list data       rv          for item in data          if isinstance item  unicode               item   item encode  utf-8           elif isinstance item  list               item    decode list item          elif isinstance item  dict               item    decode dict item          rv append item      return rv  def  decode dict data       rv          for key  value in data iteritems            if isinstance key  unicode               key   key encode  utf-8           if isinstance value  unicode               value   value encode  utf-8           elif isinstance value  list               value    decode list value          elif isinstance value  dict               value    decode dict value          rv key    value     return rv  obj   json loads s  object hook  decode dict

User · Answer

A solution with object hook  import json  def json load byteified file handle       return  byteify          json load file handle  object hook  byteify           ignore dicts True        def json loads byteified json text       return  byteify          json loads json text  object hook  byteify           ignore dicts True        def  byteify data  ignore dicts   False         if this is a unicode string  return its string representation     if isinstance data  unicode           return data encode  utf-8         if this is a list of values  return list of byteified values     if isinstance data  list           return    byteify item  ignore dicts True  for item in data         if this is a dictionary  return dictionary of byteified keys and values       but only if we haven t already byteified it     if isinstance data  dict  and not ignore dicts          return                byteify key  ignore dicts True    byteify value  ignore dicts True              for key  value in data iteritems                   if it s anything else  return it in its original form     return data   Example usage       json loads byteified    Hello    World       Hello    World       json loads byteified   I am a top-level string     I am a top-level string      json loads byteified  7   7     json loads byteified    I am inside a list       I am inside a list       json loads byteified           I am inside a big nest of lists                     I am inside a big nest of lists              json loads byteified    foo    bar    things    7    qux    baz    moo     cow     milk           things    7    qux    baz    moo     cow     milk        foo    bar       json load byteified open  somefile json      more json    from a file    How does this work and why would I use it   Mark Amery s function is shorter and clearer than these ones  so what s the point of them  Why would you want to use them   Purely for performance  Mark s answer decodes the JSON text fully first with unicode strings  then recurses through the entire decoded value to convert all strings to byte strings  This has a couple of undesirable effects    A copy of the entire decoded structure gets created in memory If your JSON object is really deeply nested  500 levels or more  then you ll hit Python s maximum recursion depth   This answer mitigates both of those performance issues by using the object hook parameter of json load and json loads  From the docs      object hook is an optional function that will be called with the result of any object literal decoded  a dict   The return value of object hook will be used instead of the dict  This feature can be used to implement custom decoders   Since dictionaries nested many levels deep in other dictionaries get passed to object hook as they re decoded  we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later   Mark s answer isn t suitable for use as an object hook as it stands  because it recurses into nested dictionaries  We prevent that recursion in this answer with the ignore dicts parameter to  byteify  which gets passed to it at all times except when object hook passes it a new dict to byteify  The ignore dicts flag tells  byteify to ignore dicts since they already been byteified   Finally  our implementations of json load byteified and json loads byteified call  byteify  with ignore dicts True  on the result returned from json load or json loads to handle the case where the JSON text being decoded doesn t have a dict at the top level

User · Answer

There exists an easy work-around   TL DR - Use ast literal eval   instead of json loads     Both ast and json are in the standard library   While not a  perfect  answer  it gets one pretty far if your plan is to ignore Unicode altogether   In Python 2 7  import json  ast d      field     value    print  JSON Fail     json loads json dumps d   print  AST Win    ast literal eval json dumps d     gives   JSON Fail    u field   u value   AST Win    field    value     This gets more hairy when some objects are really Unicode strings   The full answer gets hairy quickly

User · Answer

Check out this answer to a similar question like this which states that  The u- prefix just means that you have a Unicode string   When you really use the string  it won t appear in your data   Don t be thrown by the printed output   For example  try this   print mail accounts 0   i     You won t see a u

User · Answer

Support Python2 amp 3 using hook  from https   stackoverflow com a 33571117 558397   import requests import six from six import iteritems  requests packages urllib3 disable warnings       UndefinedVariable r   requests get  http   echo jsontest com key value one two three   verify False   def  byteify data         if this is a unicode string  return its string representation     if isinstance data  six string types           return str data encode  utf-8   decode           if this is a list of values  return list of byteified values     if isinstance data  list           return    byteify item  for item in data          if this is a dictionary  return dictionary of byteified keys and values       but only if we haven t already byteified it     if isinstance data  dict           return                byteify key    byteify value  for key  value in iteritems data                  if it s anything else  return it in its original form     return data  w   r json object hook  byteify  print w    Returns      three        key    value    one    two

User · Answer

The gotcha is that simplejson and json are two different modules  at least in the manner they deal with unicode  You have json in py 2 6   and this gives you unicode values  whereas simplejson returns string objects  Just try easy install-ing simplejson in your environment and see if that works  It did for me

[python] How to get string objects instead of Unicode from JSON?

Examples related to python

Examples related to json

Examples related to serialization

Examples related to unicode

Examples related to python-2.x