[python] How to get string objects instead of Unicode from JSON?

As Mark (Amery) correctly notes: Using PyYaml's deserializer on a json dump works only if you have ASCII only. At least out of the box.

Two quick comments on the PyYaml approach:

  1. NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.

  2. You can make it work also for non ASCII via this:

    def to_utf8(loader, node):
        return loader.construct_scalar(node).encode('utf-8')
    yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
    

But performance wise its of no comparison to Mark Amery's answer:

Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

So deserialization including fully walking the tree and encoding, well within the order of magnitude of json's C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.

=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.

This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).

Why?

Unicode normalisation. For the unaware: Take a painkiller and read this.

So using the byteify recursion you kill two birds with one stone:

  1. get your bytestrings from nested json dumps
  2. get user input values normalised, so that you find the stuff in your storage.

In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than w/o NFC - but thats heavily dependent on the sample data I guess.

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to json

Use NSInteger as array index Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) HTTP POST with Json on Body - Flutter/Dart Importing json file in TypeScript json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190) Angular 5 Service to read local .json file How to import JSON File into a TypeScript file? Use Async/Await with Axios in React.js Uncaught SyntaxError: Unexpected token u in JSON at position 0 how to remove json object key and value.?

Examples related to serialization

laravel Unable to prepare route ... for serialization. Uses Closure TypeError: Object of type 'bytes' is not JSON serializable Best way to save a trained model in PyTorch? Convert Dictionary to JSON in Swift Java: JSON -> Protobuf & back conversion Understanding passport serialize deserialize How to generate serial version UID in Intellij Parcelable encountered IOException writing serializable object getactivity() Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects Cannot deserialize the JSON array (e.g. [1,2,3]) into type ' ' because type requires JSON object (e.g. {"name":"value"}) to deserialize correctly

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to python-2.x

Combine several images horizontally with Python How to print variables without spaces between values How to return dictionary keys as a list in Python? How to add an element to the beginning of an OrderedDict? How to read a CSV file from a URL with Python? Malformed String ValueError ast.literal_eval() with String representation of Tuple Relative imports for the billionth time How do you use subprocess.check_output() in Python? write() versus writelines() and concatenated strings How to select a directory and store the location using tkinter in Python