What is the difference between a string and a byte string

Question

I am working with a library which returns a byte string and I need to convert this to a string    Although I m not sure what the difference is - if any

User · Answer

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

If you want to store music, you must first encode it using MP3, WAV, etc.
If you want to store a picture, you must first encode it using PNG, JPEG, etc.
If you want to store text, you must first encode it using ASCII, UTF-8, etc.

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.

In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

'I am a string'.encode('ASCII')

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

b'I am a string'.decode('ASCII')

The above code will return the original string 'I am a string'.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

User · Answer

From What is Unicode   Fundamentally  computers just deal with numbers  They store letters and other characters by assigning a number for each one         Unicode provides a unique number for every character  no matter what the platform  no matter what the program  no matter what the language   So when a computer represents a string  it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory  But you can t directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number  You should encode the string to byte string  such as UTF-8  UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes  it looks like this   So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere  When you open a text file encoded in UTF-8 from other systems  your computer will decode it and display characters in it through their unique Unicode number  When a browser receive string data encoded UTF-8 from network  it will decode the data to string  assume the browser in UTF-8 encoding  and display the string  In python3  you can transform string and byte string to each other   gt  gt  gt  print      encode  utf-8    b  xe4 xb8 xad xe6 x96 x87   gt  gt  gt  print b  xe4 xb8 xad xe6 x96 x87  decode  utf-8         In a word  string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission

User · Answer

The Python languages includes str and bytes as standard  Built-in Types   In other words  they are both classes  I don t think it s worthwhile trying to rationalize why Python has been implemented this way   Having said that  str and bytes are very similar to one another  Both share most of the same methods  The following methods are unique to the str class   casefold encode format format map isdecimal isidentifier isnumeric isprintable   The following methods are unique to the bytes class   decode fromhex hex

User · Answer

Let s have a simple one-character string      and encode it into a sequence of bytes    gt  gt  gt       encode  utf-8   b  xc5 xa1    For the purpose of this example let s display the sequence of bytes in its binary form    gt  gt  gt  bin int b  xc5 xa1  hex    16    0b1100010110100001    Now it is generally not possible to decode the information back without knowing how it was encoded  Only if you know that the utf-8 text encoding was used  you can follow the algorithm for decoding utf-8 and acquire the original string   11000101 10100001                      00101   100001   You can display the binary number 101100001 back as a string    gt  gt  gt  chr int  101100001   2

User · Answer

Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting  e g  lower case upper case  new line  carriage return   and other  things   e g  emojis    A computer is no less capable of storing a unicode representation  a series of bits   whether in memory or in a file  than it is of storing an ascii representation  a different series of bits   or any other representation  series of bits    For communication to take place  the parties to the communication must agree on what representation will be used   Because unicode seeks to represent all the possible characters  and other  things   used in inter-human and inter-computer communication  it requires a greater number of bits for the representation of many characters  or things  than other systems of representation that seek to represent a more limited set of characters things  To  simplify   and perhaps to accommodate historical usage  unicode representation is almost exclusively converted to some other system of representation  e g  ascii  for the purpose of storing characters in files   It is not the case that unicode cannot be used for storing characters in files  or transmitting them through any communications channel  simply that it is not   The term  string   is not precisely defined    String   in its common usage  refers to a set of characters things   In a computer  those characters may be stored in any one of many different bit-by-bit representations   A  byte string  is a set of characters stored using a representation that uses eight bits  eight bits being referred to as a byte    Since  these days  computers use the unicode system  characters represented by a variable number of bytes  to store characters in memory   and byte strings  characters represented by single bytes  to store characters to files  a conversion must be used before characters represented in memory will be moved into storage in files

User · Answer

Assuming Python 3  in Python 2  this difference is a little less well-defined  - a string is a sequence of characters  ie unicode codepoints  these are an abstract concept  and can t be directly stored on disk  A byte string is a sequence of  unsurprisingly  bytes - things that can be stored on disk  The mapping between them is an encoding - there are quite a lot of these  and infinitely many are possible  - and you need to know which applies in the particular case in order to do the conversion  since a different encoding may map the same bytes to a different string     gt  gt  gt  b  xcf x84o xcf x81 xce xbdo xcf x82  decode  utf-16            gt  gt  gt  b  xcf x84o xcf x81 xce xbdo xcf x82  decode  utf-8    to  o     Once you know which one to use  you can use the  decode   method of the byte string to get the right character string from it as above  For completeness  the  encode   method of a character string goes the opposite way    gt  gt  gt   to  o   encode  utf-8   b  xcf x84o xcf x81 xce xbdo xcf x82

User · Answer

Note  I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close   In Python 3  bytes consists of sequences of 8-bit unsigned values  while str consists of sequences of Unicode code points that represent textual characters from human languages    gt  gt  gt    bytes  gt  gt  gt  b   b h x65llo   gt  gt  gt  type b   lt class  bytes  gt   gt  gt  gt  list b   104  101  108  108  111   gt  gt  gt  print b  b hello   gt  gt  gt   gt  gt  gt    str  gt  gt  gt  s    nai u0308ve   gt  gt  gt  type s   lt class  str  gt   gt  gt  gt  list s    n    a    i          v    e    gt  gt  gt  print s  nai  ve   Even though bytes and str seem to work the same way  their instances are not compatible with each other  i e  bytes and str instances can t be used together with operators like  gt  and    In addition  keep in mind that comparing bytes and str instances for equality  i e  using     will always evaluate to False even when they contain exactly the same characters    gt  gt  gt    concatenation  gt  gt  gt  b hi    b bye    this is possible b hibye   gt  gt  gt   hi     bye    this is also possible  hibye   gt  gt  gt  b hi     bye    this will fail Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  TypeError  can t concat str to bytes  gt  gt  gt   hi    b bye    this will also fail Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  TypeError  can only concatenate str  not  bytes   to str  gt  gt  gt   gt  gt  gt    comparison  gt  gt  gt  b red   gt  b blue    this is possible True  gt  gt  gt   red  gt   blue    this is also possible True  gt  gt  gt  b red   gt   blue    you can t compare bytes with str Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  TypeError    gt   not supported between instances of  bytes  and  str   gt  gt  gt   red   gt  b blue    you can t compare str with bytes Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  TypeError    gt   not supported between instances of  str  and  bytes   gt  gt  gt  b blue      red    equality between str and bytes always evaluates to False False  gt  gt  gt  b blue      blue    equality between str and bytes always evaluates to False False   Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function  On one hand  if you want ot read or write binary data to from a file  always open the file using a binary mode like  rb  or  wb   On the other hand  if you want to read or write Unicode data to from a file  be aware of the default encoding of your computer  so if necessary pass the encoding parameter to avoid surprises   In Python 2  str consists of sequences of 8-bit values  while unicode consists of sequences of Unicode characters  One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters   It might be useful to use helper functions to convert between str and unicode in Python 2  and between bytes and str in Python 3

[python] What is the difference between a string and a byte string?

Examples related to python

Examples related to string

Examples related to character

Examples related to byte