[algorithm] Why do we use Base64?

Wikipedia says

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

But is it not that data is always stored/transmitted in binary because the memory that our machines have store binary and it just depends how you interpret it? So, whether you encode the bit pattern 010011010110000101101110 as Man in ASCII or as TWFu in Base64, you are eventually going to store the same bit pattern.

If the ultimate encoding is in terms of zeros and ones and every machine and media can deal with them, how does it matter if the data is represented as ASCII or Base64?

What does it mean "media that are designed to deal with textual data"? They can deal with binary => they can deal with anything.


Thanks everyone, I think I understand now.

When we send over data, we cannot be sure that the data would be interpreted in the same format as we intended it to be. So, we send over data coded in some format (like Base64) that both parties understand. That way even if sender and receiver interpret same things differently, but because they agree on the coded format, the data will not get interpreted wrongly.

From Mark Byers example

If I want to send

Hello
world!

One way is to send it in ASCII like

72 101 108 108 111 10 119 111 114 108 100 33

But byte 10 might not be interpreted correctly as a newline at the other end. So, we use a subset of ASCII to encode it like this

83 71 86 115 98 71 56 115 67 110 100 118 99 109 120 107 73 61 61

which at the cost of more data transferred for the same amount of information ensures that the receiver can decode the data in the intended way, even if the receiver happens to have different interpretations for the rest of the character set.

This question is related to algorithm character-encoding binary ascii base64

The answer is


Encoding binary data in XML

Suppose you want to embed a couple images within an XML document. The images are binary data, while the XML document is text. But XML cannot handle embedded binary data. So how do you do it?

One option is to encode the images in base64, turning the binary data into text that XML can handle.

Instead of:

<images>
  <image name="Sally">{binary gibberish that breaks XML parsers}</image>
  <image name="Bobby">{binary gibberish that breaks XML parsers}</image>
</images>

you do:

<images>
  <image name="Sally" encoding="base64">j23894uaiAJSD3234kljasjkSD...</image>
  <image name="Bobby" encoding="base64">Ja3k23JKasil3452AsdfjlksKsasKD...</image>
</images>

And the XML parser will be able to parse the XML document correctly and extract the image data.


One example of when I found it convenient was when trying to embed binary data in XML. Some of the binary data was being misinterpreted by the SAX parser because that data could be literally anything, including XML special characters. Base64 encoding the data on the transmitting end and decoding it on the receiving end fixed that problem.


Here is a summary of my understanding after reading what others have posted:

Important!

Base64 encoding is not meant to provide security

Base64 encoding is not meant to compress data

Why do we use Base64

Base64 is a text representation of data that consists of only 64 characters which are the alphanumeric characters (lowercase and uppercase), +, / and =. These 64 characters are considered ‘safe’, that is, they can not be misinterpreted by legacy computers and programs unlike characters such as <, > \n and many others.


Why/ How do we use Base64 encoding?

Base64 is one of the binary-to-text encoding scheme having 75% efficiency. It is used so that typical binary data (such as images) may be safely sent over legacy "not 8-bit clean" channels. In earlier email networks (till early 1990s), most email messages were plain text in the 7-bit US-ASCII character set. So many early comm protocol standards were designed to work over "7-bit" comm links "not 8-bit clean". Scheme efficiency is the ratio between number of bits in the input and the number of bits in the encoded output. Hexadecimal (Base16) is also one of the binary-to-text encoding scheme with 50% efficiency.

Base64 Encoding Steps (Simplified):

  1. Binary data is arranged in continuous chunks of 24 bits (3 bytes) each.
  2. Each 24 bits chunk is grouped in to four parts of 6 bit each.
  3. Each 6 bit group is converted into their corresponding Base64 character values, i.e. Base64 encoding converts three octets into four encoded characters. The ratio of output bytes to input bytes is 4:3 (33% overhead).
  4. Interestingly, the same characters will be encoded differently depending on their position within the three-octet group which is encoded to produce the four characters.
  5. The receiver will have to reverse this process to recover the original message.

Base64 instead of escaping special characters

I'll give you a very different but real example: I write javascript code to be run in a browser. HTML tags have ID values, but there are constraints on what characters are valid in an ID.

But I want my ID to losslessly refer to files in my file system. Files in reality can have all manner of weird and wonderful characters in them from exclamation marks, accented characters, tilde, even emoji! I cannot do this:

<div id="/path/to/my_strangely_named_file!@().jpg">
    <img src="http://myserver.com/path/to/my_strangely_named_file!@().jpg">
    Here's a pic I took in Moscow.
</div>

Suppose I want to run some code like this:

# ERROR
document.getElementById("/path/to/my_strangely_named_file!@().jpg");

I think this code will fail when executed.

With Base64 I can refer to something complicated without worrying about which language allows what special characters and which need escaping:

document.getElementById("18GerPD8fY4iTbNpC9hHNXNHyrDMampPLA");

Unlike using an MD5 or some other hashing function, you can reverse the encoding to find out what exactly the data was that actually useful.

I wish I knew about Base64 years ago. I would have avoided tearing my hair out with ‘encodeURIComponent’ and str.replace(‘\n’,’\\n’)

SSH transfer of text:

If you're trying to pass complex data over ssh (e.g. a dotfile so you can get your shell personalizations), good luck doing it without Base 64. This is how you would do it with base 64 (I know you can use SCP, but that would take multiple commands - which complicates key bindings for sshing into a server):


In addition to the other (somewhat lengthy) answers: even ignoring old systems that support only 7-bit ASCII, basic problems with supplying binary data in text-mode are:

  • Newlines are typically transformed in text-mode.
  • One must be careful not to treat a NUL byte as the end of a text string, which is all too easy to do in any program with C lineage.

What does it mean "media that are designed to deal with textual data"?

That those protocols were designed to handle text (often, only English text) instead of binary data (like .png and .jpg images).

They can deal with binary => they can deal with anything.

But the converse is not true. A protocol designed to represent text may improperly treat binary data that happens to contain:

  • The bytes 0x0A and 0x0D, used for line endings, which differ by platform.
  • Other control characters like 0x00 (NULL = C string terminator), 0x03 (END OF TEXT), 0x04 (END OF TRANSMISSION), or 0x1A (DOS end-of-file) which may prematurely signal the end of data.
  • Bytes above 0x7F (if the protocol that was designed for ASCII).
  • Byte sequences that are invalid UTF-8.

So you can't just send binary data over a text-based protocol. You're limited to the bytes that represent the non-space non-control ASCII characters, of which there are 94. The reason Base 64 was chosen was that it's faster to work with powers of two, and 64 is the largest one that works.

One question though. How is that systems still don't agree on a common encoding technique like the so common UTF-8?

On the Web, at least, they mostly have. A majority of sites use UTF-8.

The problem in the West is that there is a lot of old software that ass-u-me-s that 1 byte = 1 character and can't work with UTF-8.

The problem in the East is their attachment to encodings like GB2312 and Shift_JIS.

And the fact that Microsoft seems to have still not gotten over having picked the wrong UTF encoding. If you want to use the Windows API or the Microsoft C runtime library, you're limited to UTF-16 or the locale's "ANSI" encoding. This makes it painful to use UTF-8 because you have to convert all the time.


What does it mean "media that are designed to deal with textual data"?

Back in the day when ASCII ruled the world dealing with non-ASCII values was a headache. People jumped through all sorts of hoops to get these transferred over the wire without losing out information.


Why not look to the RFC that currently defines Base64?

Base encoding of data is used in many situations to store or transfer
data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data.Base encoding can also be used in new applications that do not have legacy restrictions, simply because it makes it possible to manipulate objects with text editors.

In the past, different applications have had different requirements and thus sometimes implemented base encodings in slightly different ways. Today, protocol specifications sometimes use base encodings in general, and "base64" in particular, without a precise description or reference. Multipurpose Internet Mail Extensions (MIME) [4] is often used as a reference for base64 without considering the consequences for line-wrapping or non-alphabet characters. The purpose of this specification is to establish common alphabet and encoding considerations. This will hopefully reduce ambiguity in other documents, leading to better interoperability.

Base64 was originally devised as a way to allow binary data to be attached to emails as a part of the Multipurpose Internet Mail Extensions.


It is more that the media validates the string encoding, so we want to ensure that the data is acceptable by a handling application (and doesn't contain a binary sequence representing EOL for example)

Imagine you want to send binary data in an email with encoding UTF-8 -- The email may not display correctly if the stream of ones and zeros creates a sequence which isn't valid Unicode in UTF-8 encoding.

The same type of thing happens in URLs when we want to encode characters not valid for a URL in the URL itself:

http://www.foo.com/hello my friend -> http://www.foo.com/hello%20my%20friend

This is because we want to send a space over a system that will think the space is smelly.

All we are doing is ensuring there is a 1-to-1 mapping between a known good, acceptable and non-detrimental sequence of bits to another literal sequence of bits, and that the handling application doesn't distinguish the encoding.

In your example, man may be valid ASCII in first form; but often you may want to transmit values that are random binary (ie sending an image in an email):

MIME-Version: 1.0
Content-Description: "Base64 encode of a.gif"
Content-Type: image/gif; name="a.gif"
Content-Transfer-Encoding: Base64
Content-Disposition: attachment; filename="a.gif"

Here we see that a GIF image is encoded in base64 as a chunk of an email. The email client reads the headers and decodes it. Because of the encoding, we can be sure the GIF doesn't contain anything that may be interpreted as protocol and we avoid inserting data that SMTP or POP may find significant.


Media that is designed for textual data is of course eventually binary as well, but textual media often use certain binary values for control characters. Also, textual media may reject certain binary values as non-text.

Base64 encoding encodes binary data as values that can only be interpreted as text in textual media, and is free of any special characters and/or control characters, so that the data will be preserved across textual media as well.


Most computers store data in 8-bit binary format, but this is not a requirement. Some machines and transmission media can only handle 7 bits (or maybe even lesser) at a time. Such a medium would interpret the stream in multiples of 7 bits, so if you were to send 8-bit data, you won't receive what you expect on the other side. Base-64 is just one way to solve this problem: you encode the input into a 6-bit format, send it over your medium and decode it back to 8-bit format at the receiving end.


Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to binary

Difference between opening a file in binary vs text Remove 'b' character do in front of a string literal in Python 3 Save and retrieve image (binary) from SQL Server using Entity Framework 6 bad operand types for binary operator "&" java C++ - Decimal to binary converting Converting binary to decimal integer output How to convert string to binary? How to convert 'binary string' to normal string in Python3? Read and write to binary files in C? Convert to binary and keep leading zeros in Python

Examples related to ascii

Detect whether a Python string is a number or a letter Is there any ASCII character for <br>? UnicodeEncodeError: 'ascii' codec can't encode character at special name Replace non-ASCII characters with a single space Convert ascii value to char What's the difference between ASCII and Unicode? Invisible characters - ASCII How To Convert A Number To an ASCII Character? Convert ascii char[] to hexadecimal char[] in C Convert character to ASCII numeric value in java

Examples related to base64

How to convert an Image to base64 string in java? How to convert file to base64 in JavaScript? How to convert Base64 String to javascript file object like as from file input form? How can I encode a string to Base64 in Swift? ReadFile in Base64 Nodejs Base64: java.lang.IllegalArgumentException: Illegal character Converting file into Base64String and back again Convert base64 string to image How to encode text to base64 in python Convert base64 string to ArrayBuffer