[python] Why do I need 'b' to encode a string with Base64?

Following this python example, I encode a string as Base64 with:

>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'

But, if I leave out the leading b:

>>> encoded = base64.b64encode('data to be encoded')

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\base64.py", line 56, in b64encode
   raise TypeError("expected bytes, not %s" % s.__class__.__name__)
   TypeError: expected bytes, not str

Why is this?

This question is related to python python-3.x base64

The answer is


If the string is Unicode the easiest way is:

import base64                                                        

a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))

# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'

b = base64.b64decode(a).decode("utf-8", "ignore")                    

print(b)
# b :complex string: ñáéíóúÑ

If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"

encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))

There is all you need:

expected bytes, not str

The leading b makes your string binary.

What version of Python do you use? 2.x or 3.x?

Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x


Short Answer

You need to push a bytes-like object (bytes, bytearray, etc) to the base64.b64encode() method. Here are two ways:

>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

Or with a variable:

>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

Why?

In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.

Base-64 Encoding in Python 3

Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.

base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.

In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.

Example:

>>> data = b'test'
>>> for byte in data:
...     print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>

If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):

base-2:  01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10:                            1952805748
base-64:  B      0      Z      X      N      0

base64 encoding, however, will re-group this data thusly:

base-2:  011101  000110  010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10:     29       6      21     51     29      0
base-64:      d       G       V      z      d      A

So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.

Let's test this to see if I am being dishonest:

>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='

Why use base64 encoding?

Let's say I have to send some data to someone via email, like this data:

>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
   
>>> print(data)
b'\x04msg\x08\x08\x08   '
>>>

There are two problems I planted:

  1. If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission.
  2. Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.

This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-3.x

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Replace specific text with a redacted version using Python Upgrade to python 3.8 using conda "Permission Denied" trying to run Python on Windows 10 Python: 'ModuleNotFoundError' when trying to import module from imported package What is the meaning of "Failed building wheel for X" in pip install? How to downgrade python from 3.7 to 3.6 I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? Iterating over arrays in Python 3 How to upgrade Python version to 3.7?

Examples related to base64

How to convert an Image to base64 string in java? How to convert file to base64 in JavaScript? How to convert Base64 String to javascript file object like as from file input form? How can I encode a string to Base64 in Swift? ReadFile in Base64 Nodejs Base64: java.lang.IllegalArgumentException: Illegal character Converting file into Base64String and back again Convert base64 string to image How to encode text to base64 in python Convert base64 string to ArrayBuffer