What is the difference between UTF-8 and ISO-8859-1

Question

User · Answer

Wikipedia explains both reasonably well  UTF-8 vs Latin-1  ISO-8859-1   Former is a variable-length encoding  latter single-byte fixed length encoding  Latin-1 encodes just the first 256 code points of the Unicode character set  whereas UTF-8 can be used to encode all code points  At physical encoding level  only codepoints 0 - 127 get encoded identically  code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1

User · Answer

UTF-8 is a multibyte encoding that can represent any Unicode character  ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters  Both encode ASCII exactly the same way

User · Answer

My reason for researching this question was from the perspective  is in what way are they compatible  Latin1 charset  iso-8859  is 100  compatible to be stored in a utf8 datastore  All ascii  amp  extended-ascii chars will be stored as single-byte     Going the other way  from utf8 to Latin1 charset may or may not work  If there are any 2-byte chars  chars beyond extended-ascii 255  they will not store in a Latin1 datastore

User · Answer

One more important thing to realise  if you see iso-8859-1  it probably refers to Windows-1252 rather than ISO IEC 8859-1  They differ in the range 0x80   0x9F  where ISO 8859-1 has the C1 control codes  and Windows-1252 has useful visible characters instead   For example  ISO 8859-1 has 0x85 as a control character  in Unicode  U 0085       while Windows-1252 has a horizontal ellipsis  in Unicode  U 2026 HORIZONTAL ELLIPSIS         The WHATWG Encoding spec  as used by HTML  expressly declares iso-8859-1 to be a label for windows-1252  and web browsers do not support ISO 8859-1 in any way  the HTML spec says that all encodings in the Encoding spec must be supported  and no more   Also of interest  HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points  per https   html spec whatwg org  numeric-character-reference-end-state   amp  x85  will produce U 2026 rather than U 0085

User · Answer

UTF  UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2 31  roughly 2 billion  characters  UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2 21  roughly 2 million  code points   Long story short  any character with a code point ordinal representation below 127  aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings  Any character with a code point above 127 is represented by a sequence of two or more bytes  with the particulars of the encoding best explained here   ISO-8859  ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255  These various alphabets are defined as  parts  in the format ISO-8859-n  the most familiar of these likely being ISO-8859-1 aka  Latin-1   As with UTF-8  7-bit-safe ASCII remains unaffected regardless of the encoding family used   The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols  or to safely display more than one family of symbols at one time  As well  ISO-8859 encodings have fallen out of favor with the rise of UTF  The ISO  Working Group  in charge of it having disbanded in 2004  leaving maintenance up to its parent subcommittee

User · Answer

ASCII  7 bits  128 code points  ISO-8859-1  8 bits  256 code points  UTF-8  8-32 bits  1-4 bytes   1 112 064 code points    Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII  but UTF-8 is not backwards compatible with ISO-8859-1      usr bin env python3  c   chr 0xa9  print c  print c encode  utf-8    print c encode  iso-8859-1      Output      b  xc2 xa9  b  xa9

User · Answer

ISO-8859-1 is a legacy standards from back in 1980s  It can only represent 256 characters so only suitable for some languages in western world  Even for many supported languages  some characters are missing  If you create a text file in this encoding and try copy paste some Chinese characters  you will see weird results  So in other words  don t use it  Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons  like HTTP headers which needs to compatible with everything

User · Answer

From another perspective  files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them  seem to get read by iso-8859-1 properly  The caveat is that the file shouldn t have unicode characters in it of course

[utf-8] What is the difference between UTF-8 and ISO-8859-1?

Examples related to utf-8

Examples related to character-encoding

Examples related to iso-8859-1