[perl] Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?


Commenting @tchrist got too long, so I'm adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.

This question is related to perl unicode utf-8

The answer is


I think you misunderstand Unicode and its relationship to Perl. No matter which way you store data, Unicode, ISO-8859-1, or many other things, your program has to know how to interpret the bytes it gets as input (decoding) and how to represent the information it wants to output (encoding). Get that interpretation wrong and you garble the data. There isn't some magic default setup inside your program that's going to tell the stuff outside your program how to act.

You think it's hard, most likely, because you are used to everything being ASCII. Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with. If everything used nothing but UTF-8 and you had no choice, then UTF-8 would be just as easy. But not everything does use UTF-8. For instance, you don't want your input handle to think that it's getting UTF-8 octets unless it actually is, and you don't want your output handles to be UTF-8 if the thing reading from them can't handle UTF-8. Perl has no way to know those things. That's why you are the programmer.

I don't think Unicode in Perl 5 is too complicated. I think it's scary and people avoid it. There's a difference. To that end, I've put Unicode in Learning Perl, 6th Edition, and there's a lot of Unicode stuff in Effective Perl Programming. You have to spend the time to learn and understand Unicode and how it works. You're not going to be able to use it effectively otherwise.


We're all in agreement that it is a difficult problem for many reasons, but that's precisely the reason to try to make it easier on everybody.

There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".

As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.

If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.

`


You should enable the unicode strings feature, and this is the default if you use v5.14;

You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html

Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There's more wtf8, the windows aberration of utf8 around.

BTW: Moose is not really "Modern Perl", they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation. cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.


While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.

Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.


There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".

tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.

This post aims to solve that first problem

When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.

Marvelous.

Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }' and type in ???? and you get 12... not the correct answer, 4.

That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.

That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode (or use Encode qw(decode), of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.

(There is also Encode::decode_utf8, which assumes UTF-8 for the encoding.)

If we rewrite our one-liner:

perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'

We type in ???? and get "4" as the result. Success.

That, right there, is the solution to 99% of Unicode problems in Perl.

The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.

The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode), decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data) instead of just outputting $data.

This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.

To summarize: encode all outputs and decode all inputs.

Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content, not content, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.

Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode!

The second problem is UTF-8 encoded source code. Unless you say use utf8 at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = '??', you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.

The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "????", we got 12. Perl assumed that we were operating on the Latin-1 string "æå­åã" (which is 12 characters, some of which are non-printing).

This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.

People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.

Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO to each line, and you print it out. You write it like:

while(<>){
    chomp;
    say "$_ ";
}

And then run on some UTF-8 encoded data, like:

perl poo.pl input-data.txt

It prints the UTF-8 data with a poo at the end of each line. Perfect, my program works!

But nope, you're just doing binary concatenation. You're reading octets from the file, removing a \n with chomp, and then tacking on the bytes in the UTF-8 representation of the PILE OF POO character. When you revise your program to decode the data from the file and encode the output, you'll notice that you get garbage ("ð©") instead of the poo. This will lead you to believe that decoding the input file is the wrong thing to do. It's not.

The problem is that the poo is being implicitly upgraded as latin-1. If you use utf8 to make the literal text instead of binary, then it will work again!

(That's the number one problem I see when helping people with Unicode. They did part right and that broke their program. That's what's sad about undefined results: you can have a working program for a long time, but when you start to repair it, it breaks. Don't worry; if you are adding encode/decode statements to your program and it breaks, it just means you have more work to do. Next time, when you design with Unicode in mind from the beginning, it will be much easier!)

That's really all you need to know about Perl and Unicode. If you tell Perl what your data is, it has the best Unicode support among all popular programming languages. If you assume it will magically know what sort of text you are feeding it, though, then you're going to trash your data irrevocably. Just because your program works today on your UTF-8 terminal doesn't mean it will work tomorrow on a UTF-16 encoded file. So make it safe now, and save yourself the headache of trashing your users' data!

The easy part of handling Unicode is encoding output and decoding input. The hard part is finding all your input and output, and determining which encoding it is. But that's why you get the big bucks :)


There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).


Examples related to perl

The program can't start because api-ms-win-crt-runtime-l1-1-0.dll is missing while starting Apache server on my computer "End of script output before headers" error in Apache Perl - Multiple condition if statement without duplicating code? How to decrypt hash stored by bcrypt Split a string into array in Perl Turning multiple lines into one comma separated line String compare in Perl with "eq" vs "==" how to remove the first two columns in a file using shell (awk, sed, whatever) Find everything between two XML tags with RegEx Difference between \w and \b regular expression meta characters

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8