Binary Data in JSON String Something better than Base64

Question

The JSON format natively doesn t support binary data  The binary data has to be escaped so that it can be placed into a string element  i e  zero or more Unicode chars in double quotes using backslash escapes  in JSON   An obvious method to escape binary data is to use Base64  However  Base64 has a high processing overhead  Also it expands 3 bytes into 4 characters which leads to an increased data size by around 33    One use case for this is the v0 8 draft of the CDMI cloud storage API specification  You create data objects via a REST-Webservice using JSON  e g   PUT  MyContainer BinaryObject HTTP 1 1 Host  cloud example com Accept  application vnd org snia cdmi dataobject json Content-Type  application vnd org snia cdmi dataobject json X-CDMI-Specification-Version  1 0        mimetype     application octet-stream        metadata              value       TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz     IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg     dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu     dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo     ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4        Are there better ways and standard methods to encode binary data into JSON strings

User · Answer

My solution now  XHR2 is using ArrayBuffer  The ArrayBuffer as binary sequence contains multipart-content  video  audio  graphic  text and so on with multiple content-types  All in One Response   In modern browser  having DataView  StringView and Blob for different Components  See also  http   rolfrost de video html for more details

User · Answer

I ran into the same problem  and thought I d share a solution  multipart form-data   By sending a multipart form you send first as string your JSON meta-data  and then separately send as raw binary  image s   wavs  etc  indexed by the Content-Disposition name   Here s a nice tutorial on how to do this in obj-c  and here is a blog article that explains how to partition the string data with the form boundary  and separate it from the binary data   The only change you really need to do is on the server side  you will have to capture your meta-data which should reference the POST ed binary data appropriately  by using a Content-Disposition boundary     Granted it requires additional work on the server side  but if you are sending many images or large images  this is worth it  Combine this with gzip compression if you want   IMHO sending base64 encoded data is a hack  the RFC multipart form-data was created for issues such as this  sending binary data in combination with text or meta-data

User · Answer

yEnc might work for you   http   en wikipedia org wiki Yenc      yEnc is a binary-to-text encoding scheme for transferring binary   files in  text   It reduces the overhead over previous US-ASCII-based   encoding methods by using an 8-bit Extended ASCII encoding method    yEnc s overhead is often  if each byte value appears approximately   with the same frequency on average  as little as 1   2   compared to   33    40  overhead for 6-bit encoding methods like uuencode and Base64        By 2003 yEnc became the de facto standard encoding system for   binary files on Usenet     However  yEnc is an 8-bit encoding  so storing it in a JSON string has the same problems as storing the original binary data     doing it the na  ve way means about a 100  expansion  which is worse than base64

User · Answer

Since you re looking for the ability to shoehorn binary data into a strictly text-based and very limited format  I think Base64 s overhead is minimal compared to the convenience you re expecting to maintain with JSON  If processing power and throughput is a concern  then you d probably need to reconsider your file formats

User · Answer

There are 94 Unicode characters which can be represented as one byte according to the JSON spec  if your JSON is transmitted as UTF-8   With that in mind  I think the best you can do space-wise is base85 which represents four bytes as five characters  However  this is only a 7  improvement over base64  it s more expensive to compute  and implementations are less common than for base64 so it s probably not a win   You could also simply map every input byte to the corresponding character in U 0000-U 00FF  then do the minimum encoding required by the JSON standard to pass those characters  the advantage here is that the required decoding is nil beyond builtin functions  but the space efficiency is bad -- a 105  expansion  if all input bytes are equally likely  vs  25  for base85 or 33  for base64   Final verdict  base64 wins  in my opinion  on the grounds that it s common  easy  and not bad enough to warrant replacement   See also  Base91 and Base122

User · Answer

I dig a little bit more  during implementation of base128   and expose that when we send characters which ascii codes are bigger than 128 then browser  chrome  in fact send TWO characters  bytes  instead one     The reason is that JSON by defaul use utf8 characters for which characters with ascii codes above 127 are coded by two bytes what was mention by chmike answer  I made test in this way  type in chrome url bar chrome   net-export    select  Include raw bytes   start capturing  send POST requests  using snippet at the bottom   stop capturing and save json file with raw requests data  Then we look inside that json file    We can find our base64 request by finding string 4142434445464748494a4b4c4d4e this is hex coding of ABCDEFGHIJKLMN and we will see that  byte count   639 for it  We can find our above127 request by finding string C2BCC2BDC380C381C382C383C384C385C386C387C388C389C38AC38B this are request-hex utf8 codes of characters                               however the ascii hex codes of this characters are c1c2c3c4c5c6c7c8c9cacbcccdce   The  byte count   703 so it is 64bytes longer than base64 request because characters with ascii codes above 127 are code by 2 bytes in request      So in fact we don t have profit with sending characters with codes  127      For base64 strings we not observe such negative behaviour  probably for base85 too - I don check it  - however may be some solution for this problem will be sending data in binary part of POST multipart form-data described in   lex answer  however usually in this case we don t need to use any base coding at all        The alternative approach may rely on mapping two bytes data portion into one valid utf8 character by code it using something like base65280   base65k but probably it would be less effective than base64 due to utf8 specification       x000D   x000D  function postBase64     x000D    let formData   new FormData    x000D    let req   new XMLHttpRequest    x000D   x000D    formData append  base64ch    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789      x000D    req open  POST     testBase64ch    x000D    req send formData   x000D    x000D   x000D   x000D  function postAbove127     x000D    let formData   new FormData    x000D    let req   new XMLHttpRequest    x000D   x000D    formData append  above127                                                                                                                                        x000D    req open  POST     testAbove127    x000D    req send formData   x000D    x000D   lt button onclick postBase64   gt POST base64 chars lt  button gt  x000D   lt button onclick postAbove127   gt POST chars with codes gt 127 lt  button gt  x000D   x000D   x000D

User · Answer

Just to add another option that we low level dinosaur programmers use    An old school method that s been around since three years after the dawn of time would be the Intel HEX format   It was established in 1973 and the UNIX epoch started on January 1  1970   Is it more efficient  No  Is it a well established standard  Yes  Is it human readable like JSON  Yes-ish and a lot more readable than most any binary solution   The json would look like         quot data quot          quot  10010000214601360121470136007EFE09D2190140 quot        quot  100110002146017E17C20001FF5F16002148011928 quot        quot  10012000194E79234623965778239EDA3F01B2CAA7 quot        quot  100130003F0156702B5E712B722B732146013421C7 quot        quot  00000001FF quot

User · Answer

BSON  Binary JSON  may work for you  http   en wikipedia org wiki BSON  Edit  FYI the  NET library json net supports reading and writing bson if you are looking for some C  server side love

User · Answer

While it is true that base64 has  33  expansion rate  it is not necessarily true that processing overhead is significantly more than this  it really depends on JSON library toolkit you are using  Encoding and decoding are simple straight-forward operations  and they can even be optimized wrt character encoding  as JSON only supports UTF-8 16 32  -- base64 characters are always single-byte for JSON String entries  For example on Java platform there are libraries that can do the job rather efficiently  so that overhead is mostly due to expanded size   I agree with two earlier answers    base64 is simple  commonly used standard  so it is unlikely to find something better specifically to use with JSON  base-85 is used by postscript etc  but benefits are at best marginal when you think about  it  compression before encoding  and after decoding  may make lots of sense  depending on data you use

User · Answer

Smile format  It s very fast to encode  decode and compact  Speed comparison  java based but meaningful nevertheless   https   github com eishay jvm-serializers wiki   Also it s an extension to JSON that allow you to skip base64 encoding for byte arrays  Smile encoded strings can be gzipped when space is critical

User · Answer

The problem with UTF-8 is that it is not the most space efficient encoding  Also  some random binary byte sequences are invalid UTF-8 encoding  So you can t just interpret a random binary byte sequence as some UTF-8 data because it will be invalid UTF-8 encoding  The benefit of this constrain on the UTF-8 encoding is that it makes it robust and possible to locate multi byte chars start and end whatever byte we start looking at   As a consequence  if encoding a byte value in the range  0  127  would need only one byte in UTF-8 encoding  encoding a byte value in the range  128  255  would require 2 bytes   Worse than that  In JSON  control chars    and   are not allowed to appear in a string  So the binary data would require some transformation to be properly encoded    Let see  If we assume uniformly distributed random byte values in our binary data then  on average  half of the bytes would be encoded in one bytes and the other half in two bytes  The UTF-8 encoded binary data would have 150  of the initial size    Base64 encoding grows only to 133  of the initial size  So Base64 encoding is more efficient   What about using another Base encoding   In UTF-8  encoding the 128 ASCII values is the most space efficient  In 8 bits you can store 7 bits  So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string  the encoded data would grow only to 114  of the initial size  Better than Base64  Unfortunately we can t use this easy trick because JSON doesn t allow some ASCII chars  The 33 control characters of ASCII    0  31  and 127  and the   and   must be excluded  This leaves us only 128-35   93 chars    So in theory we could define a Base93 encoding which would grow the encoded size to 8 log2 93    8 log10 2  log10 93    122   But a Base93 encoding would not be as convenient as a Base64 encoding  Base64 requires to cut the input byte sequence in 6bit chunks for which simple bitwise operation works well  Beside 133  is not much more than 122     This is why I came independently to the common conclusion that Base64 is indeed the best choice to encode binary data in JSON  My answer presents a justification for it  I agree it isn t very attractive from the performance point of view  but consider also the benefit of using JSON with it s human readable string representation easy to manipulate in all programming languages    If performance is critical than a pure binary encoding should be considered as replacement of JSON  But with JSON my conclusion is that Base64 is the best

User · Answer

If you deal with bandwidth problems  try to compress data at the client side first  then base64-it   Nice example of such magic is at http   jszip stuartk co uk  and more discussion to this topic is at JavaScript implementation of Gzip

User · Answer

Refer  http   snia org sites default files Multi-part 20MIME 20Extension 20v1 0g pdf  It describes a way to transfer binary data between a CDMI client and server using  CDMI content type  operations without requiring base64 conversion of the binary data   If you can use  Non-CDMI content type  operation  it is ideal to transfer  data  to from a  object  Metadata can then later be added retrieved to from the object as a subsequent  CDMI content type  operation

User · Answer

Edit 7 years later  Google Gears is gone  Ignore this answer      The Google Gears team ran into the lack-of-binary-data-types problem and has attempted to address it      Blob API      JavaScript has a built-in data type for text strings  but nothing for binary data  The Blob object attempts to address this limitation    Maybe you can weave that in somehow

User · Answer

Data type really concerns  I have tested different scenarios on sending the payload from a RESTful resource  For encoding I have used Base64 Apache  and for compression GZIP java utils zip    The payload contains information about film an image and an audio file  I have compressed and encoded the image and audio files which drastically degraded the performance  Encoding before compression turned out well  Image and audio content were sent as encoded and compressed bytes

User · Answer

Just to add the resource and complexity standpoint to the discussion  Since doing PUT POST and PATCH for storing new resources and altering them  one should remember that the content transfer is an exact representation of the content that is stored and that is received by issuing a GET operation   A multi-part message is often used as a savior but for simplicity reason and for more complex tasks  I prefer the idea of giving the content as a whole  It is self-explaining and it is simple    And yes JSON is something crippling but in the end JSON itself is verbose  And the overhead of mapping to BASE64 is a way to small   Using Multi-Part messages correctly one has to either dismantle the object to send  use a property path as the parameter name for automatic combination or will need to create another protocol format to just express the payload   Also liking the BSON approach  this is not that widely and easily supported as one would like it to be    Basically  we just miss something here but embedding binary data as base64 is well established and way to go unless you really have identified the need to do the real binary transfer  which is hardly often the case

[json] Binary Data in JSON String. Something better than Base64

Examples related to json

Examples related to base64