[excel] What charset does Microsoft Excel use when saving files?

I have a Java app which reads CSV files which have been created in Excel (e.g. 2007). Does anyone know what charset MS Excel uses to save these files in?

I would have guessed either:

  • windows-1255 (Cp1255)
  • ISO-8859-1
  • UTF8

but I am unable to decode extended chars (e.g. french accentuated letters) using either of these charset types.

This question is related to excel encoding character-encoding

The answer is


Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)


While it is true that exporting an excel file that contains special characters to csv can be a pain in the ass, there is however a simple work around: simply copy/paste the cells into a google docs and then save from there.


From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.


Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)


Excel 2010 saves an UTF-16/UCS-2 TSV file, if you select File > Save As > Unicode Text (.txt). It's (force) suffixed ".txt", which you can change to ".tsv".

If you need CSV, you can then convert the TSV file in a text editor like Notepad++, Ultra Edit, Crimson Editor etc, replacing tabs by semi-colons, commas or the like. Note that e.g. for reading into a DB table, often TSV works fine already (and it is often easier to read manually).

If you need a different code page like UTF-8, use one of the above mentioned editors for converting.


cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.

http://en.wikipedia.org/wiki/Windows-1250

A simple way to confirm this would be to:

  1. Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
  2. Use your favourite scripting language to parse and decode the spreadsheet;
  3. Look at what your script produces when you print out the decoded data.

Example perl script:

#!perl

use strict;

use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );

my $file    = "my_spreadsheet.xls";

my $xls     = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet   = [ $xls->sheets ]->[0];

while ($sheet->has_data) {

    my @data = $sheet->next_row;

    for my $datum ( @data ) {
        print decode( 'cp1250', $datum );
    }

}

OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...


I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.

Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.


You can create CSV file using encoding UTF8 + BOM (https://en.wikipedia.org/wiki/Byte_order_mark).

First three bytes are BOM (0xEF,0xBB,0xBF) and then UTF8 content.


Waking up this old thread... We are now in 2017. And still Excel is unable to save a simple spreadsheet into a CSV format while preserving the original encoding ... Just amazing.

Luckily Google Docs lives in the right century. The solution for me is just to open the spreadsheet using Google Docs, than download it back down as CSV. The result is a correctly encoded CSV file (with all strings encoded in UTF8).


Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)


I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.

Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.


Excel 2010 saves an UTF-16/UCS-2 TSV file, if you select File > Save As > Unicode Text (.txt). It's (force) suffixed ".txt", which you can change to ".tsv".

If you need CSV, you can then convert the TSV file in a text editor like Notepad++, Ultra Edit, Crimson Editor etc, replacing tabs by semi-colons, commas or the like. Note that e.g. for reading into a DB table, often TSV works fine already (and it is often easier to read manually).

If you need a different code page like UTF-8, use one of the above mentioned editors for converting.


You could use this Visual Studio VB.Net code to get the encoding:

Dim strEncryptionType As String = String.Empty
Dim myStreamRdr As System.IO.StreamReader = New System.IO.StreamReader(myFileName, True)
Dim myString As String = myStreamRdr.ReadToEnd()
strEncryptionType = myStreamRdr.CurrentEncoding.EncodingName

OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...


You could use this Visual Studio VB.Net code to get the encoding:

Dim strEncryptionType As String = String.Empty
Dim myStreamRdr As System.IO.StreamReader = New System.IO.StreamReader(myFileName, True)
Dim myString As String = myStreamRdr.ReadToEnd()
strEncryptionType = myStreamRdr.CurrentEncoding.EncodingName

From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.


While it is true that exporting an excel file that contains special characters to csv can be a pain in the ass, there is however a simple work around: simply copy/paste the cells into a google docs and then save from there.


I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.

Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.


Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)


You can create CSV file using encoding UTF8 + BOM (https://en.wikipedia.org/wiki/Byte_order_mark).

First three bytes are BOM (0xEF,0xBB,0xBF) and then UTF8 content.


cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.

http://en.wikipedia.org/wiki/Windows-1250

A simple way to confirm this would be to:

  1. Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
  2. Use your favourite scripting language to parse and decode the spreadsheet;
  3. Look at what your script produces when you print out the decoded data.

Example perl script:

#!perl

use strict;

use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );

my $file    = "my_spreadsheet.xls";

my $xls     = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet   = [ $xls->sheets ]->[0];

while ($sheet->has_data) {

    my @data = $sheet->next_row;

    for my $datum ( @data ) {
        print decode( 'cp1250', $datum );
    }

}

Examples related to excel

Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Converting unix time into date-time via excel How to increment a letter N times per iteration and store in an array? 'Microsoft.ACE.OLEDB.16.0' provider is not registered on the local machine. (System.Data) How to import an Excel file into SQL Server? Copy filtered data to another sheet using VBA Better way to find last used row Could pandas use column as index? Check if a value is in an array or not with Excel VBA How to sort dates from Oldest to Newest in Excel?

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters