What s the difference between UTF-8 and UTF-8 without BOM

Question

What s different between UTF-8 and UTF-8 without a BOM  Which is better

User · Answer

UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.

That is my opinion (30 years of computing and IT industry).

User · Answer

BOM tends to boom  no pun intended  sic   somewhere  someplace  And when it booms  for example  doesn t get recognized by browsers  editors  etc    it shows up as the weird characters        at the start of the document  for example  HTML file  JSON response  RSS  etc   and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter   It s very annoying when it shows up at places hard to debug or when testing is neglected  So it s best to avoid it unless you must use it

User · Answer

The Unicode Byte Order Mark  BOM  FAQ provides a concise answer      Q  How I should deal with BOMs       A  Here are some guidelines to follow          A particular protocol  e g  Microsoft conventions for  txt files  may require use of the BOM on certain Unicode data streams  such as   files  When you need to conform to such a protocol  use a BOM    Some protocols allow optional BOMs in the case of untagged text  In those cases          Where a text data stream is known to be plain text  but of unknown encoding  BOM can be used as a signature  If there is no BOM    the encoding could be anything    Where a text data stream is known to be plain Unicode text  but not which endian   then BOM can be used as a signature  If there   is no BOM  the text should be interpreted as big-endian       Some byte oriented protocols expect ASCII characters at the beginning of a file  If UTF-8 is used with these protocols  use of the   BOM as encoding form signature should be avoided    Where the precise type of the data stream is known  e g  Unicode big-endian or Unicode little-endian   the BOM should not be used  In   particular  whenever a data stream is declared to be UTF-16BE    UTF-16LE  UTF-32BE or UTF-32LE a BOM must not be used

User · Answer

Quoted at the bottom of the Wikipedia page on BOM  http   en wikipedia org wiki Byte-order mark cite note-2      Use of a BOM is neither required nor recommended for UTF-8  but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature

User · Answer

One practical difference is that if you write a shell script for Mac nbsp OS nbsp X and save it as plain UTF-8  you will get the response      bin bash  No such file or directory   in response to the shebang line specifying which shell you wish to use      bin bash   If you save as UTF-8  no BOM  say in BBEdit  all will be well

User · Answer

I look at this from a different perspective  I think UTF-8 with BOM is better as it provides more information about the file  I use UTF-8 without BOM only if I face problems   I am using multiple languages  even Cyrillic  on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor  as cherouvim also noted   some characters are corrupted   Note that Windows  classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding   I personally save server side scripting files   asp   ini   aspx  with BOM and  html files without BOM

User · Answer

It should be noted that for some files you must not have the BOM even on Windows  Examples are SQL plus or VBScript files  In case such files contains a BOM you get an error when you try to execute them

User · Answer

From http   en wikipedia org wiki Byte-order mark      The byte order mark  BOM  is a Unicode   character used to signal the   endianness  byte order  of a text file   or stream  Its code point is U FEFF    BOM use is optional  and  if used    should appear at the start of the text   stream  Beyond its specific use as a   byte-order indicator  the BOM   character may also indicate which of   the several Unicode representations   the text is encoded in    Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM   My real problem with the absence of BOM is the following  Suppose we ve got a file which contains   abc   Without BOM this opens as ANSI in most editors  So another user of this file opens it and appends some native characters  for example   abg-a      Oops    Now the file is still in ANSI and guess what   a     does not occupy 6 bytes  but 3  This is not UTF-8 and this causes other problems later on in the development chain

User · Answer

UTF-8 without BOM has no BOM  which doesn t make it any better than UTF-8 with BOM  except when the consumer of the file needs to know  or would benefit from knowing  whether the file is UTF-8-encoded or not   The BOM is usually useful to determine the endianness of the encoding  which is not required for most use cases   Also  the BOM can be unnecessary noise pain for those consumers that don t know or care about it  and can result in user confusion

User · Answer

When you want to display information encoded in UTF-8 you may not face problems  Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document   But this is not the case when we have text  CSV and XML files  either on Windows or Linux    For example  a text file in Windows or Linux  one of the easiest things imaginable  it is not  usually  UTF-8   Save it as XML and declare it as UTF-8    lt  xml version  1 0  encoding  UTF-8   gt    It will not display  it will not be be read  correctly  even if it s declared as UTF-8   I had a string of data containing French letters  that needed to be saved as XML for syndication  Without creating a UTF-8 file from the very beginning  changing options in IDE and  Create New File   or adding the BOM at the beginning of the file   file   xEF xBB xBF   string    I was not able to save the French letters in an XML file

User · Answer

This question already has a million-and-one answers and many of them are quite good  but I wanted to try and clarify when a BOM should or should not be used   As mentioned  any use of the UTF BOM  Byte Order Mark  in determining whether a string is UTF-8 or not is educated guesswork  If there is proper metadata available  like charset  utf-8    then you already know what you re supposed to be using  but otherwise you ll need to test and make some assumptions  This involves checking whether the file a string comes from begins with the hexadecimal byte code  EF BB BF   If a byte code corresponding to the UTF-8 BOM is found  the probability is high enough to assume it s UTF-8 and you can go from there  When forced to make this guess  however  additional error checking while reading would still be a good idea in case something comes up garbled  You should only assume a BOM is not UTF-8  i e  latin-1 or ANSI  if the input definitely shouldn t be UTF-8 based on its source  If there is no BOM  however  you can simply determine whether it s supposed to be UTF-8 by validating against the encoding   Why is a BOM not recommended    Non-Unicode-aware or poorly compliant software may assume it s latin-1 or ANSI and won t strip the BOM from the string  which can obviously cause issues  It s not really needed  just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found    When should you encode with a BOM   If you re unable to record the metadata in any other way  through a charset tag or file system meta   and the programs being used like BOMs  you should encode with a BOM  This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page  The BOM tells programs like Office that  yes  the text in this file is Unicode  here s the encoding used   When it comes down to it  the only files I ever really have problems with are CSV  Depending on the program  it either must  or must not have a BOM  For example  if you re using Excel 2007  on Windows  it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data

User · Answer

The UTF-8 BOM is a sequence of bytes at the start of a text stream  0xEF  0xBB  0xBF  that allows the reader to more reliably guess a file as being encoded in UTF-8   Normally  the BOM is used to signal the endianness of an encoding  but since endianness is irrelevant to UTF-8  the BOM is unnecessary   According to the Unicode standard  the BOM for UTF-8 files is not recommended      2 6 Encoding Schemes          Use of a BOM is neither required nor recommended for UTF-8  but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature  See the    Byte Order Mark    subsection in Section 16 8  Specials  for more information

User · Answer

Here are examples of the BOM usage that actually cause real problems and yet many people don t know about it   BOM breaks scripts  Shell scripts  Perl scripts  Python scripts  Ruby scripts  Node js scripts or any other executable that needs to be run by an interpreter - all  start with a shebang line which looks like one of those      bin sh    usr bin python    usr local bin perl    usr bin env node   It tells the system which interpreter needs to be run when invoking such a script  If the script is encoded in UTF-8  one may be tempted to include a BOM at the beginning  But actually the      characters are not just characters  They are in fact a magic number that happens to be composed out of two ASCII characters  If you put something  like a BOM  before those characters  then the file will look like it had a different magic number and that can lead to problems   See Wikipedia  article  Shebang  section  Magic number      The shebang characters are represented by the same two bytes in   extended ASCII encodings  including UTF-8  which is commonly used for   scripts and other text files on current Unix-like systems  However    UTF-8 files may begin with the optional byte order mark  BOM   if the    exec  function specifically detects the bytes 0x23 and 0x21  then the   presence of the BOM  0xEF 0xBB 0xBF  before the shebang will prevent   the script interpreter from being executed  Some authorities recommend   against using the byte order mark in POSIX  Unix-like  scripts  14    for this reason and for wider interoperability and philosophical   concerns  Additionally  a byte order mark is not necessary in UTF-8    as that encoding does not have endianness issues  it serves only to   identify the encoding as UTF-8   emphasis added    BOM is illegal in JSON  See RFC 7159  Section 8 1      Implementations MUST NOT add a byte order mark to the beginning of a JSON text    BOM is redundant in JSON  Not only it is illegal in JSON  it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in  any JSON stream  see this answer for details    BOM breaks JSON parsers  Not only it is illegal in JSON and not needed  it actually breaks all software that determine the encoding using the method presented in RFC 4627   Determining the encoding and endianness of JSON  examining the first four bytes for the NUL byte   00 00 00 xx - UTF-32BE 00 xx 00 xx - UTF-16BE xx 00 00 00 - UTF-32LE xx 00 xx 00 - UTF-16LE xx xx xx xx - UTF-8   Now  if the file starts with BOM it will look like this   00 00 FE FF - UTF-32BE FE FF 00 xx - UTF-16BE FF FE 00 00 - UTF-32LE FF FE xx 00 - UTF-16LE EF BB BF xx - UTF-8   Note that    UTF-32BE doesn t start with three NULs  so it won t be recognized UTF-32LE the first byte is not followed by three NULs  so it won t be recognized UTF-16BE has only one NUL in the first four bytes  so it won t be recognized UTF-16LE has only one NUL in the first four bytes  so it won t be recognized   Depending on the implementation  all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8  or not recognized at all   Additionally  if the implementation tests for valid JSON as I recommend  it will reject even the input that is indeed encoded as UTF-8  because it doesn t start with an ASCII character  lt  128 as it should according to the RFC   Other data formats  BOM in JSON is not needed  is illegal and breaks software that works correctly according to the RFC  It should be a nobrainer to just not use it then and yet  there are always people who insist on breaking JSON by using BOMs  comments  different quoting rules or different data types  Of course anyone is free to use things like BOMs or anything else if you need it - just don t call it JSON then   For other data formats than JSON  take a look at how it really looks like  If the only encodings are UTF-  and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data  Adding BOMs even as an optional feature would only make it more complicated and error prone   Other uses of BOM  As for the uses outside of JSON or scripts  I think there are already very good answers here  I wanted to add more detailed info specifically about scripting and serialization  because it is an example of BOM characters causing real problems

User · Answer

UTF-8 with BOM only helps if the file actually contains some non-ASCII characters  If it is included and there aren t any  then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII   These applications will definitely fail when they come across a non ASCII character  so in my opinion the BOM should only be added when the file can  and should  no longer be interpreted as plain ASCII   I want to make it clear that I prefer to not have the BOM at all  Add it in if some old rubbish breaks without it  and replacing that legacy application is not feasible   Don t make anything expect a BOM for UTF-8

User · Answer

UTF-8 with BOM is better identified  I have reached this conclusion the hard way  I am working on a project where one of the results is a CSV file  including Unicode characters   If the CSV file is saved without a BOM  Excel thinks it s ANSI and shows gibberish  Once you add  EF BB BF  at the front  for example  by re-saving it using Notepad with UTF-8  or Notepad   with UTF-8 with BOM   Excel opens it fine   Prepending the BOM character to Unicode text files is recommended by RFC 3629   UTF-8  a transformation format of ISO 10646   November 2003  at http   tools ietf org html rfc3629  this last info found at  http   www herongyang com Unicode Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF html

User · Answer

As mentioned above  UTF-8 with BOM may cause problems with non-BOM-aware  or compatible  software  I once edited HTML files encoded as UTF-8   BOM with the Mozilla-based KompoZer  as a client required that WYSIWYG program   Invariably the layout would get destroyed when saving  It took my some time to fiddle my way around this  These files then worked well in Firefox  but showed a CSS quirk in Internet Explorer destroying the layout  again  After fiddling with the linked CSS files for hours to no avail I discovered that Internet nbsp Explorer didn t like the BOMfed HTML file  Never again   Also  I just found this in Wikipedia      The shebang characters are represented by the same two bytes in extended ASCII encodings  including UTF-8  which is commonly used for scripts and other text files on current Unix-like systems  However  UTF-8 files may begin with the optional byte order mark  BOM   if the  exec  function specifically detects the bytes 0x23 0x21  then the presence of the BOM  0xEF 0xBB 0xBF  before the shebang will prevent the script interpreter from being executed  Some authorities recommend against using the byte order mark in POSIX  Unix-like  scripts  15  for this reason and for wider interoperability and philosophical concerns

User · Answer

The other excellent answers already answered that    There is no official difference between UTF-8 and BOM-ed UTF-8 A BOM-ed UTF-8 string will start with the three following bytes  EF BB BF Those bytes  if present  must be ignored when extracting the string from the file stream    But  as additional information to this  the BOM for UTF-8 could be a good way to  smell  if a string was encoded in UTF-8    Or it could be a legitimate string in any other encoding     For example  the data  EF BB BF 41 42 43  could either be    The legitimate ISO-8859-1 string        ABC  The legitimate UTF-8 string  ABC    So while it can be cool to recognize the encoding of a file content by looking at the first bytes  you should not rely on this  as show by the example above  Encodings should be known  not divined

User · Answer

Question  What s different between UTF-8 and UTF-8 without a BOM  Which is better    Here are some excerpts from the Wikipedia article on the byte order mark  BOM  that I believe offer a solid answer to this question   On the meaning of the BOM and UTF-8      The Unicode Standard permits the BOM in UTF-8  but does not require   or recommend its use  Byte order has no meaning in UTF-8  so its   only use in UTF-8 is to signal at the start that the text stream is   encoded in UTF-8    Argument for NOT using a BOM      The primary motivation for not using a BOM is backwards-compatibility   with software that is not Unicode-aware    Another motivation for not   using a BOM is to encourage UTF-8 as the  default  encoding    Argument FOR using a BOM      The argument for using a BOM is that without it  heuristic analysis is   required to determine what character encoding a file is using    Historically such analysis  to distinguish various 8-bit encodings  is   complicated  error-prone  and sometimes slow  A number of libraries   are available to ease the task  such as Mozilla Universal Charset   Detector and International Components for Unicode       Programmers mistakenly assume that detection of UTF-8 is equally   difficult  it is not because of the vast majority of byte sequences   are invalid UTF-8  while the encodings these libraries are trying to   distinguish allow all possible byte sequences   Therefore not all   Unicode-aware programs perform such an analysis and instead rely on   the BOM       In particular  Microsoft compilers and interpreters  and many   pieces of software on Microsoft Windows such as Notepad will not   correctly read UTF-8 text unless it has only ASCII characters or it   starts with the BOM  and will add a BOM to the start when saving text   as UTF-8  Google Docs will add a BOM when a Microsoft Word document is   downloaded as a plain text file    On which is better  WITH or WITHOUT the BOM      The IETF recommends that if a protocol either  a  always uses UTF-8    or  b  has some other way to indicate what encoding is being used    then it    SHOULD forbid use of U FEFF as a signature       My Conclusion   Use the BOM only if compatibility with a software application is absolutely essential   Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8  this is not the case for all Microsoft applications  For example  as pointed out by  barlop  when using the Windows Command Prompt with UTF-8 dagger   commands such type and more do not expect the BOM to be present  If the BOM is present  it can be problematic as it is for other applications      dagger  The chcp command offers support for UTF-8  without the BOM  via code page 65001

User · Answer

What s different between UTF-8 and UTF-8 without BOM    Short answer  In UTF-8  a BOM is encoded as the bytes EF BB BF at the beginning of the file   Long answer   Originally  it was expected that Unicode would be encoded in UTF-16 UCS-2  The BOM was designed for this encoding form  When you have 2-byte code units  it s necessary to indicate which order those two bytes are in  and a common convention for doing this is to include the character U FEFF as a  Byte Order Mark  at the beginning of the data  The character U FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order   UTF-8 has the same byte order regardless of platform endianness  so a byte order mark isn t needed  However  it may occur  as the byte sequence EF BB FF  in data that was converted to UTF-8 from UTF-16  or as a  signature  to indicate that the data is UTF-8      Which is better     Without  As Martin Cote answered  the Unicode standard does not recommend it  It causes problems with non-BOM-aware software   A better way to detect whether a file is UTF-8 is to perform a validity check  UTF-8 has strict rules about what byte sequences are valid  so the probability of a false positive is negligible  If a byte sequence looks like UTF-8  it probably is

User · Answer

There are at least three problems with putting a BOM in UTF-8 encoded files    Files that hold no text are no longer empty because they always contain the BOM  Files that hold text that is within the ASCII subset of UTF-8 is no longer themselves ASCII because the BOM is not ASCII  which makes some existing tools break down  and it can be impossible for users to replace such legacy tools  It is not possible to concatenate several files together because each file now has a BOM at the beginning    And  as others have mentioned  it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8    It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM  It is not necessary because you can just read the bytes as if they were UTF-8  if that succeeds  it is  by definition  valid UTF-8

User · Answer

Here is my experience with Visual Studio  Sourcetree and Bitbucket pull requests  which has been giving me some problems   So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request  it can be quite annoying      If you hover on it  it will show a character like  ufeff   but it turns out Sourcetree does not show these types of bytemarks  so it will most likely end up in your pull requests  which should be ok because that s how Visual nbsp Studio nbsp 2017 encodes new files now  so maybe Bitbucket should ignore this or make it show in another way  more info here   Red dot marker BitBucket diff view

[unicode] What's the difference between UTF-8 and UTF-8 without BOM?

Examples related to unicode

Examples related to utf-8

Examples related to character-encoding

Examples related to byte-order-mark