As Mark (Amery) correctly notes: Using PyYaml's deserializer on a json dump works only if you have ASCII only. At least out of the box.
Two quick comments on the PyYaml approach:
NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.
You can make it work also for non ASCII via this:
def to_utf8(loader, node):
return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
But performance wise its of no comparison to Mark Amery's answer:
Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):
dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
dt[byteify recursion(Mark Amery)] =~ 5 * dt[j]
So deserialization including fully walking the tree and encoding, well within the order of magnitude of json's C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.
=> While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.
This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).
Why?
Unicode normalisation. For the unaware: Take a painkiller and read this.
So using the byteify recursion you kill two birds with one stone:
In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than w/o NFC - but thats heavily dependent on the sample data I guess.