[windows] What encoding/code page is cmd.exe using?

Yes, it’s frustrating—sometimes type and other programs print gibberish, and sometimes they do not.

First of all, Unicode characters will only display if the current console font contains the characters. So use a TrueType font like Lucida Console instead of the default Raster Font.

But if the console font doesn’t contain the character you’re trying to display, you’ll see question marks instead of gibberish. When you get gibberish, there’s more going on than just font settings.

When programs use standard C-library I/O functions like printf, the program’s output encoding must match the console’s output encoding, or you will get gibberish. chcp shows and sets the current codepage. All output using standard C-library I/O functions is treated as if it is in the codepage displayed by chcp.

Matching the program’s output encoding with the console’s output encoding can be accomplished in two different ways:

  • A program can get the console’s current codepage using chcp or GetConsoleOutputCP, and configure itself to output in that encoding, or

  • You or a program can set the console’s current codepage using chcp or SetConsoleOutputCP to match the default output encoding of the program.

However, programs that use Win32 APIs can write UTF-16LE strings directly to the console with WriteConsoleW. This is the only way to get correct output without setting codepages. And even when using that function, if a string is not in the UTF-16LE encoding to begin with, a Win32 program must pass the correct codepage to MultiByteToWideChar. Also, WriteConsoleW will not work if the program’s output is redirected; more fiddling is needed in that case.

type works some of the time because it checks the start of each file for a UTF-16LE Byte Order Mark (BOM), i.e. the bytes 0xFF 0xFE. If it finds such a mark, it displays the Unicode characters in the file using WriteConsoleW regardless of the current codepage. But when typeing any file without a UTF-16LE BOM, or for using non-ASCII characters with any command that doesn’t call WriteConsoleW—you will need to set the console codepage and program output encoding to match each other.


How can we find this out?

Here’s a test file containing Unicode characters:

ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    aezznl
Russian   ??????? ???
CJK       ??

Here’s a Java program to print out the test file in a bunch of different Unicode encodings. It could be in any programming language; it only prints ASCII characters or encoded bytes to stdout.

import java.io.*;

public class Foo {

    private static final String BOM = "\ufeff";
    private static final String TEST_STRING
        = "ASCII     abcde xyz\n"
        + "German    äöü ÄÖÜ ß\n"
        + "Polish    aezznl\n"
        + "Russian   ??????? ???\n"
        + "CJK       ??\n";

    public static void main(String[] args)
        throws Exception
    {
        String[] encodings = new String[] {
            "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };

        for (String encoding: encodings) {
            System.out.println("== " + encoding);

            for (boolean writeBom: new Boolean[] {false, true}) {
                System.out.println(writeBom ? "= bom" : "= no bom");

                String output = (writeBom ? BOM : "") + TEST_STRING;
                byte[] bytes = output.getBytes(encoding);
                System.out.write(bytes);
                FileOutputStream out = new FileOutputStream("uc-test-"
                    + encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));
                out.write(bytes);
                out.close();
            }
        }
    }
}

The output in the default codepage? Total garbage!

Z:\andrew\projects\sx\1259084>chcp
Active code page: 850

Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII     abcde xyz
German    +ñ+Â++ +ä+û+£ +ƒ
Polish    -à-Ö+¦+++ä+é
Russian   ð¦ð¦ð¦ð¦ð¦ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢
= bom
´++ASCII     abcde xyz
German    +ñ+Â++ +ä+û+£ +ƒ
Polish    -à-Ö+¦+++ä+é
Russian   ð¦ð¦ð¦ð¦ð¦ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢
== UTF-16LE
= no bom
A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h         ????z?|?D?B?
 R u s s i a n       0?1?2?3?4?5?6?  M?N?O?
 C J K               `O}Y
 = bom
 ¦A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h         ????z?|?D?B?
 R u s s i a n       0?1?2?3?4?5?6?  M?N?O?
 C J K               `O}Y
 == UTF-16BE
= no bom
 A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h        ?????z?|?D?B
 R u s s i a n      ?0?1?2?3?4?5?6  ?M?N?O
 C J K              O`Y}
= bom
¦  A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h        ?????z?|?D?B
 R u s s i a n      ?0?1?2?3?4?5?6  ?M?N?O
 C J K              O`Y}
== UTF-32LE
= no bom
A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                   ??  ??  z?  |?  D?  B?
   R   u   s   s   i   a   n               0?  1?  2?  3?  4?  5?  6?      M?  N
?  O?
   C   J   K                               `O  }Y
   = bom
 ¦  A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                   ??  ??  z?  |?  D?  B?
   R   u   s   s   i   a   n               0?  1?  2?  3?  4?  5?  6?      M?  N
?  O?
   C   J   K                               `O  }Y
   == UTF-32BE
= no bom
   A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                  ??  ??  ?z  ?|  ?D  ?B
   R   u   s   s   i   a   n              ?0  ?1  ?2  ?3  ?4  ?5  ?6      ?M  ?N
  ?O
   C   J   K                              O`  Y}
= bom
  ¦    A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                  ??  ??  ?z  ?|  ?D  ?B
   R   u   s   s   i   a   n              ?0  ?1  ?2  ?3  ?4  ?5  ?6      ?M  ?N
  ?O
   C   J   K                              O`  Y}

However, what if we type the files that got saved? They contain the exact same bytes that were printed to the console.

Z:\andrew\projects\sx\1259084>type *.txt

uc-test-UTF-16BE-bom.txt


¦  A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h        ?????z?|?D?B
 R u s s i a n      ?0?1?2?3?4?5?6  ?M?N?O
 C J K              O`Y}

uc-test-UTF-16BE-nobom.txt


 A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h        ?????z?|?D?B
 R u s s i a n      ?0?1?2?3?4?5?6  ?M?N?O
 C J K              O`Y}

uc-test-UTF-16LE-bom.txt


ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    aezznl
Russian   ??????? ???
CJK       ??

uc-test-UTF-16LE-nobom.txt


A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h         ????z?|?D?B?
 R u s s i a n       0?1?2?3?4?5?6?  M?N?O?
 C J K               `O}Y

uc-test-UTF-32BE-bom.txt


  ¦    A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                  ??  ??  ?z  ?|  ?D  ?B
   R   u   s   s   i   a   n              ?0  ?1  ?2  ?3  ?4  ?5  ?6      ?M  ?N
  ?O
   C   J   K                              O`  Y}

uc-test-UTF-32BE-nobom.txt


   A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                  ??  ??  ?z  ?|  ?D  ?B
   R   u   s   s   i   a   n              ?0  ?1  ?2  ?3  ?4  ?5  ?6      ?M  ?N
  ?O
   C   J   K                              O`  Y}

uc-test-UTF-32LE-bom.txt


 A S C I I           a b c d e   x y z
 G e r m a n         ä ö ü   Ä Ö Ü   ß
 P o l i s h         a e z z n l
 R u s s i a n       ? ? ? ? ? ? ?   ? ? ?
 C J K               ? ?

uc-test-UTF-32LE-nobom.txt


A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       -   Í   _       ¯
   P   o   l   i   s   h                   ??  ??  z?  |?  D?  B?
   R   u   s   s   i   a   n               0?  1?  2?  3?  4?  5?  6?      M?  N
?  O?
   C   J   K                               `O  }Y

uc-test-UTF-8-bom.txt


´++ASCII     abcde xyz
German    +ñ+Â++ +ä+û+£ +ƒ
Polish    -à-Ö+¦+++ä+é
Russian   ð¦ð¦ð¦ð¦ð¦ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢

uc-test-UTF-8-nobom.txt


ASCII     abcde xyz
German    +ñ+Â++ +ä+û+£ +ƒ
Polish    -à-Ö+¦+++ä+é
Russian   ð¦ð¦ð¦ð¦ð¦ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢

The only thing that works is UTF-16LE file, with a BOM, printed to the console via type.

If we use anything other than type to print the file, we get garbage:

Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON
 ¦A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   - Í _   ¯
 P o l i s h         ????z?|?D?B?
 R u s s i a n       0?1?2?3?4?5?6?  M?N?O?
 C J K               `O}Y
         1 file(s) copied.

From the fact that copy CON does not display Unicode correctly, we can conclude that the type command has logic to detect a UTF-16LE BOM at the start of the file, and use special Windows APIs to print it.

We can see this by opening cmd.exe in a debugger when it goes to type out a file:

enter image description here

After type opens a file, it checks for a BOM of 0xFEFF—i.e., the bytes 0xFF 0xFE in little-endian—and if there is such a BOM, type sets an internal fOutputUnicode flag. This flag is checked later to decide whether to call WriteConsoleW.

But that’s the only way to get type to output Unicode, and only for files that have BOMs and are in UTF-16LE. For all other files, and for programs that don’t have special code to handle console output, your files will be interpreted according to the current codepage, and will likely show up as gibberish.

You can emulate how type outputs Unicode to the console in your own programs like so:

#include <stdio.h>
#define UNICODE
#include <windows.h>

static LPCSTR lpcsTest =
    "ASCII     abcde xyz\n"
    "German    äöü ÄÖÜ ß\n"
    "Polish    aezznl\n"
    "Russian   ??????? ???\n"
    "CJK       ??\n";

int main() {
    int n;
    wchar_t buf[1024];

    HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);

    n = MultiByteToWideChar(CP_UTF8, 0,
            lpcsTest, strlen(lpcsTest),
            buf, sizeof(buf));

    WriteConsole(hConsole, buf, n, &n, NULL);

    return 0;
}

This program works for printing Unicode on the Windows console using the default codepage.


For the sample Java program, we can get a little bit of correct output by setting the codepage manually, though the output gets messed up in weird ways:

Z:\andrew\projects\sx\1259084>chcp 65001
Active code page: 65001

Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    aezznl
Russian   ??????? ???
CJK       ??
? ???
CJK       ??
 ??
?
?
= bom
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    aezznl
Russian   ??????? ???
CJK       ??
?? ???
CJK       ??
  ??
?
?
== UTF-16LE
= no bom
A S C I I           a b c d e   x y z
…

However, a C program that sets a Unicode UTF-8 codepage:

#include <stdio.h>
#include <windows.h>

int main() {
    int c, n;
    UINT oldCodePage;
    char buf[1024];

    oldCodePage = GetConsoleOutputCP();
    if (!SetConsoleOutputCP(65001)) {
        printf("error\n");
    }

    freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);
    n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
    fwrite(buf, sizeof(buf[0]), n, stdout);

    SetConsoleOutputCP(oldCodePage);

    return 0;
}

does have correct output:

Z:\andrew\projects\sx\1259084>.\test
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    aezznl
Russian   ??????? ???
CJK       ??

The moral of the story?

  • type can print UTF-16LE files with a BOM regardless of your current codepage
  • Win32 programs can be programmed to output Unicode to the console, using WriteConsoleW.
  • Other programs which set the codepage and adjust their output encoding accordingly can print Unicode on the console regardless of what the codepage was when the program started
  • For everything else you will have to mess around with chcp, and will probably still get weird output.

Examples related to windows

"Permission Denied" trying to run Python on Windows 10 A fatal error occurred while creating a TLS client credential. The internal error state is 10013 How to install OpenJDK 11 on Windows? I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? git clone: Authentication failed for <URL> How to avoid the "Windows Defender SmartScreen prevented an unrecognized app from starting warning" XCOPY: Overwrite all without prompt in BATCH Laravel 5 show ErrorException file_put_contents failed to open stream: No such file or directory how to open Jupyter notebook in chrome on windows Tensorflow import error: No module named 'tensorflow'

Examples related to command-line

Git is not working after macOS Update (xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools) Flutter command not found Angular - ng: command not found how to run python files in windows command prompt? How to run .NET Core console app from the command line Copy Paste in Bash on Ubuntu on Windows How to find which version of TensorFlow is installed in my system? How to install JQ on Mac by command-line? Python not working in the command line of git bash Run function in script from command line (Node JS)

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space