Why does modern Perl avoid UTF-8 by default

Question

I wonder why most modern solutions built using Perl don t enable UTF-8 by default   I understand there are many legacy problems for core Perl scripts  where it may break things  But  from my point of view  in the 21st century  big new projects  or projects with a big perspective  should make their software UTF-8 proof from scratch  Still I don t see it happening  For example  Moose enables strict and warnings  but not Unicode  Modern  Perl reduces boilerplate too  but no UTF-8 handling   Why  Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011     Commenting  tchrist got too long  so I m adding it here   It seems that I did not make myself clear  Let me try to add some things   tchrist and I see situation pretty similarly  but our conclusions are completely in opposite ends  I agree  the situation with Unicode is complicated  but this is why we  Perl users and coders  need some layer  or pragma  which makes UTF-8 handling as easy as it must be nowadays   tchrist pointed to many aspects to cover  I will read and think about them for days or even weeks  Still  this is not my point  tchrist tries to prove that there is not one single way  to enable UTF-8   I have not so much knowledge to argue with that  So  I stick to live examples   I played around with Rakudo and UTF-8 was just there as I needed  I didn t have any problems  it just worked  Maybe there are some limitation somewhere deeper  but at start  all I tested worked as I expected   Shouldn t that be a goal in modern Perl nbsp 5 too  I stress it more  I m not suggesting UTF-8 as the default character set for core Perl  I suggest the possibility to trigger it with a snap for those who develop new projects   Another example  but with a more negative tone  Frameworks should make development easier  Some years ago  I tried web frameworks  but just threw them away because  enabling UTF-8  was so obscure  I did not find how and where to hook Unicode support  It was so time-consuming that I found it easier to go the old way  Now I saw here there was a bounty to deal with the same problem with Mason 2  How to make Mason2 UTF-8 clean   So  it is pretty new framework  but using it with UTF-8 needs deep knowledge of its internals  It is like a big red sign  STOP  don t use me   I really like Perl  But dealing with Unicode is painful  I still find myself running against walls  Some way tchrist is right and answers my questions  new projects don t attract UTF-8 because it is too complicated in Perl nbsp 5

User · Answer

There s a truly horrifying amount of ancient code out there in the wild  much of it in the form of common CPAN modules   I ve found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it  and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly  in particular  iTiVo fails badly on anything that s not 7-bit ASCII due to transcoding issues

User · Answer

Set your PERL UNICODE envariable to AS  This makes all Perl scripts decode  ARGV as UTF-8 strings  and sets the encoding of all three of stdin  stdout  and stderr to UTF-8  Both these are global effects  not lexical ones  At the top of your source file  program  module  library  dohickey   prominently assert that you are running perl version 5 12 or better via   use v5 12     minimal for unicode string feature use v5 14     optimal for unicode string feature  Enable warnings  since the previous declaration only enables strictures and features  not warnings  I also suggest promoting Unicode warnings into exceptions  so use both these lines  not just one of them   Note however that under v5 14  the utf8 warning class comprises three other subwarnings which can all be separately enabled  nonchar  surrogate  and non unicode  These you may wish to exert greater control over   use warnings  use warnings qw  FATAL utf8     Declare that this source unit is encoded as UTF-8  Although once upon a time this pragma did other things  it now serves this one singular purpose alone and no other   use utf8   Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF-8 unless you tell it otherwise  That way you do not affect other module   s or other program   s code   use open qw   encoding UTF-8   std     Enable named characters via  N CHARNAME    use charnames qw   full  short     If you have a DATA handle  you must explicitly set its encoding  If you want this to be UTF-8  then say   binmode DATA    encoding UTF-8        There is of course no end of other matters with which you may eventually find yourself concerned  but these will suffice to approximate the state goal to    make everything just work with UTF-8     albeit for a somewhat weakened sense of those terms    One other pragma  although it is not Unicode related  is         use autodie    It is strongly recommended                                                                  My own boilerplate these days tends to look like this   use 5 014   use utf8  use strict  use autodie  use warnings   use warnings    qw lt  FATAL  utf8      gt   use open        qw lt   std   utf8      gt   use charnames   qw lt   full  gt   use feature     qw lt  unicode strings  gt    use File  Basename      qw lt  basename  gt   use Carp                qw lt  carp croak confess cluck  gt   use Encode              qw lt  encode decode  gt   use Unicode  Normalize  qw lt  NFD NFC  gt    END   close STDOUT    if  grep   P ASCII     gt   ARGV         ARGV   map   decode  UTF-8          ARGV      0   basename  0      shorter messages      1   binmode DATA    utf8       give a full stack dump on any untrapped exceptions local  SIG   DIE      sub       confess  Uncaught exception      unless   S        now promote run-time warnings into stack-dumped     exceptions  unless  we re in an try block  in     which case just cluck the stack dump instead local  SIG   WARN      sub       if    S    cluck    Trapped warning             else       confess  Deadly warning             while   lt  gt          chomp           NFD                continue       say NFC           END                                                          Saying that    Perl should  somehow   enable Unicode by default    doesn   t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case   Unicode is much much more than just a larger character repertoire  it   s also how those characters all interact in many  many ways   Even the simple-minded minimal measures that  some  people seem to think they want are guaranteed to miserably break millions of lines of code  code that has no chance to    upgrade    to your spiffy new Brave New World modernity    It is way way way more complicated than people pretend   I   ve thought about this a huge  whole lot over the past few years   I would love to be shown that I am wrong   But I don   t think I am   Unicode is fundamentally more complex than the model that you would like to impose on it  and there is complexity here that you can never sweep under the carpet  If you try  you   ll break either your own code or somebody else   s   At some point  you simply have to break down and learn what Unicode is about   You cannot pretend it is something it is not      goes out of its way to make Unicode easy  far more than anything else I   ve ever used  If you think this is bad  try something else for a while  Then come back to   either you will have returned to a better world  or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make   better at these things                                                              At a minimum  here  are some things that would appear to be required for  to    enable Unicode by default     as you put it    All   source code should be in UTF-8 by default   You can get that with use utf8 or export PERL5OPTS -Mutf8  The   DATA handle should be UTF-8  You will have to do this on a per-package basis  as in binmode DATA    encoding UTF-8     Program arguments to   scripts should be understood to be UTF-8 by default   export PERL UNICODE A  or perl -CA  or export PERL5OPTS -CA  The standard input  output  and error streams should default to UTF-8  export PERL UNICODE S for all of them  or I  O  and or E for just some of them  This is like perl -CS  Any other handles opened by  should be considered UTF-8 unless declared otherwise  export PERL UNICODE D or with i and o for particular ones of these  export PERL5OPTS -CD would work   That makes -CSAD for all of them  Cover both bases plus all the streams you open with export PERL5OPTS -Mopen  utf8  std  See uniquote  You don   t want to miss UTF-8 encoding errors  Try export PERL5OPTS -Mwarnings FATAL utf8   And make sure your input streams are always binmoded to  encoding UTF-8   not just to  utf8  Code points between 128   255 should be understood by  to be the corresponding Unicode code points  not just unpropertied binary values   use feature  unicode strings  or export PERL5OPTS -Mfeature unicode strings   That will make uc   xDF   eq  SS  and   xE9       w    A simple export PERL5OPTS -Mv5 12 or better will also get that  Named Unicode characters are not by default enabled  so add export PERL5OPTS -Mcharnames  full  short latin greek or some such  See uninames and tcgrep  You almost always need access to the functions from the standard Unicode  Normalize module various types of decompositions   export PERL5OPTS -MUnicode  Normalize NFD NFKD NFC NFKD  and then always run incoming stuff through NFD and outbound stuff from NFC  There   s no I O layer for these yet that I   m aware of  but see nfc  nfd  nfkd  and nfkc  String comparisons in  using eq  ne  lc  cmp  sort   amp c amp cc are always wrong   So instead of  a   sort  b  you need  a   Unicode  Collate- gt new- gt sort  b    Might as well add that to your export PERL5OPTS -MUnicode  Collate  You can cache the key for binary comparisons   built-ins like printf and write do the wrong thing with Unicode data   You need to use the Unicode  GCString module for the former  and both that and also the Unicode  LineBreak module as well for the latter  See uwc and unifmt  If you want them to count as integers  then you are going to have to run your  d  captures through the Unicode  UCD  num function because    s built-in atoi 3  isn   t currently clever enough  You are going to have filesystem issues on  filesystems  Some filesystems silently enforce a conversion to NFC  others silently enforce a conversion to NFD  And others do something else still  Some even ignore the matter altogether  which leads to even greater problems  So you have to do your own NFC NFD handling to keep sane  All your   code involving a-z or A-Z and such MUST BE CHANGED  including m    s     and tr     It   s should stand out as a screaming red flag that your code is broken  But it is not clear how it must change  Getting the right properties  and understanding their casefolds  is harder than you might think  I use unichars and uniprops every single day  Code that uses  p Lu  is almost as wrong as code that uses  A-Za-z    You need to use  p Upper  instead  and know the reason why  Yes   p Lowercase  and  p Lower  are different from  p Ll  and  p Lowercase Letter   Code that uses  a-zA-Z  is even worse   And it can   t use  pL or  p Letter   it needs to use  p Alphabetic   Not all alphabetics are letters  you know  If you are looking for  variables with           w    then you have a problem   You need to look for           p IDS  p IDC     and even that isn   t thinking about the punctuation variables or package variables  If you are checking for whitespace  then you should choose between  h and  v  depending   And you should never use  s  since it DOES NOT MEAN   h v   contrary to popular belief  If you are using  n for a line boundary  or even  r n  then you are doing it wrong   You have to use  R  which is not the same  If you don   t know when and whether to call Unicode  Stringprep  then you had better learn  Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such   The easiest way to do that is with the standard Unicode  Collate module  Unicode  Collate- gt new level   gt  1 - gt cmp  a   b    There are also eq methods and such  and you should probably learn about the match and substr methods  too  These are have distinct advantages over the   built-ins  Sometimes that   s still not enough  and you need the Unicode  Collate  Locale module instead  as in  Unicode  Collate  Locale- gt new locale   gt   de  phonebook   level   gt  1 - gt cmp  a   b  instead    Consider that Unicode  Collate  - gt new level   gt  1 - gt eq  d         is true  but Unicode  Collate  Locale- gt new locale  gt  is  level   gt  1 - gt eq  d          is false  Similarly   ae  and      are eq if you don   t use locales  or if you use the English one  but they are different in the Icelandic locale   Now what  It   s tough  I tell you   You can play with  ucsort to test some of these things out  Consider how to match the pattern CVCV  consonsant  vowel  consonant  vowel   in the string    ni  o      Its NFD form     which you had darned well better have remembered to put it in     becomes    nin x 303 o      Now what are you going to do   Even pretending that a vowel is  aeiou   which is wrong  by the way   you won   t be able to do something like     aeiou   X  either  because even in NFD a code point like          does not decompose   However  it will test equal to an    o    using the UCA comparison I just showed you  You can   t rely on NFD  you have to rely on UCA                                                                 And that   s not all  There are a million broken assumptions that people make about Unicode  Until they understand these things  their   code will be broken    Code that assumes it can open a text file without specifying the encoding is broken  Code that assumes the default encoding is some sort of native platform encoding is broken  Code that assumes that web pages in Japanese or Chinese take up less space in UTF-16 than in UTF-8 is wrong  Code that assumes Perl uses UTF-8 internally is wrong  Code that assumes that encoding errors will always raise an exception is wrong  Code that assumes Perl code points are limited to 0x10 FFFF is wrong  Code that assumes you can set    to something that will work with any valid line separator is wrong  Code that assumes roundtrip equality on casefolding  like lc uc  s   eq  s or uc lc  s   eq  s  is completely broken and wrong   Consider that the uc  s   and uc       are both  S   but lc  S   cannot possibly return both of those  Code that assumes every lowercase code point has a distinct uppercase one  or vice versa  is broken  For example       is a lowercase letter with no uppercase  whereas both     and     are letters  but they are not lowercase letters  however  they are both lowercase code points without corresponding uppercase versions  Got that  They are not  p Lowercase Letter   despite being both  p Letter  and  p Lowercase   Code that assumes changing the case doesn   t change the length of the string is broken  Code that assumes there are only two cases is broken  There   s also titlecase  Code that assumes only letters have case is broken  Beyond just letters  it turns out that numbers  symbols  and even marks have case  In fact  changing the case can even make something change its main general category  like a  p Mark  turning into a  p Letter   It can also make it switch from one script to another  Code that assumes that case is never locale-dependent is broken  Code that assumes Unicode gives a fig about POSIX locales is broken  Code that assumes you can remove diacritics to get at base ASCII letters is evil  still  broken  brain-damaged  wrong  and justification for capital punishment  Code that assumes that diacritics  p Diacritic  and marks  p Mark  are the same thing is broken  Code that assumes  p GC Dash Punctuation  covers as much as  p Dash  is broken  Code that assumes dash  hyphens  and minuses are the same thing as each other  or that there is only one of each  is broken and wrong  Code that assumes every code point takes up no more than one print column is broken  Code that assumes that all  p Mark  characters take up zero print columns is broken  Code that assumes that characters which look alike are alike is broken  Code that assumes that characters which do not look alike are not alike is broken  Code that assumes there is a limit to the number of code points in a row that just one  X can match is wrong  Code that assumes  X can never start with a  p Mark  character is wrong  Code that assumes that  X can never hold two non- p Mark  characters is wrong  Code that assumes that it cannot use   x FFFF   is wrong  Code that assumes a non-BMP code point that requires two UTF-16  surrogate  code units will encode to two separate UTF-8 characters  one per code unit  is wrong  It doesn   t  it encodes to single code point  Code that transcodes from UTF-16 or UTF-32 with leading BOMs into UTF-8 is broken if it puts a BOM at the start of the resulting UTF-8   This is so stupid the engineer should have their eyelids removed  Code that assumes the CESU-8 is a valid UTF encoding is wrong  Likewise  code that thinks encoding U 0000 as   xC0 x80  is UTF-8 is broken and wrong  These guys also deserve the eyelid treatment  Code that assumes characters like  gt  always points to the right and  lt  always points to the left are wrong     because they in fact do not  Code that assumes if you first output character X and then character Y  that those will show up as XY is wrong  Sometimes they don   t  Code that assumes that ASCII is good enough for writing English properly is stupid  shortsighted  illiterate  broken  evil  and wrong   Off with their heads  If that seems too extreme  we can compromise  henceforth they may type only with their big toe from one foot   The rest will be duct taped   Code that assumes that all  p Math  code points are visible characters is wrong  Code that assumes  w contains only letters  digits  and underscores is wrong  Code that assumes that   and   are punctuation marks is wrong  Code that assumes that    has an umlaut is wrong  Code that believes things like   contain any letters in them is wrong  Code that believes  p InLatin  is the same as  p Latin  is  heinously broken   Code that believe that  p InLatin  is almost ever useful is almost certainly wrong  Code that believes that given  FIRST LETTER as the first letter in some alphabet and  LAST LETTER as the last letter in that same alphabet  that    FIRST LETTER -  LAST LETTER   has any meaning whatsoever is almost always complete broken and wrong and meaningless  Code that believes someone   s name can only contain certain characters is stupid  offensive  and wrong  Code that tries to reduce Unicode to ASCII is not merely wrong  its perpetrator should never be allowed to work in programming again  Period  I   m not even positive they should even be allowed to see again  since it obviously hasn   t done them much good so far  Code that believes there   s some way to pretend textfile encodings don   t exist is broken and dangerous  Might as well poke the other eye out  too  Code that converts unknown characters to   is broken  stupid  braindead  and runs contrary to the standard recommendation  which says NOT TO DO THAT  RTFM for why not  Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal m  lange of hubris and na  vet   that only a lightning bolt from  Zeus will fix  Code that believes you can use  printf widths to pad and justify Unicode data is broken and wrong  Code that believes once you successfully create a file by a given name  that when you run ls or readdir on its enclosing directory  you   ll actually find that file with the name you created it under is buggy  broken  and wrong  Stop being surprised by this  Code that believes UTF-16 is a fixed-width encoding is stupid  broken  and wrong  Revoke their programming licence  Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong  Go back to school  Code that believes that stuff like  s i can only match  S  or  s  is broken and wrong   You   d be surprised  Code that uses  PM pM  to find grapheme clusters instead of using  X is broken and wrong  People who want to go back to the ASCII world should be whole-heartedly encouraged to do so  and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs   Messages sent to them should be sent via an       s telegraph at 40 characters per line and hand-delivered by a courier   STOP                                                   I don   t know how much more    default Unicode in     you can get than what I   ve written  Well  yes I do  you should be using Unicode  Collate and Unicode  LineBreak  too   And probably more   As you see  there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as    default to Unicode      What you   re going to discover  just as we did back in  5 8  that it is simply impossible to impose all these things on code that hasn   t been designed right from the beginning to account for them  Your well-meaning selfishness just broke the entire world   And even once you do  there are still critical issues that require a great deal of thought to get right   There is no switch you can flip   Nothing but brain  and I mean real brain  will suffice here  There   s a heck of a lot of stuff you have to learn  Modulo the retreat to the manual typewriter  you simply cannot hope to sneak by in ignorance  This is the 21   century  and you cannot wish Unicode away by willful ignorance    You have to learn it  Period  It will never be so easy that    everything just works     because that will guarantee that a lot of things don   t work     which invalidates the assumption that there can ever be a way to    make it all work      You may be able to get a few reasonable defaults for a very few and very limited operations  but not without thinking about things a whole lot more than I think you have   As just one example  canonical ordering is going to cause some real headaches    x F5              o x 303              o x 303  x 304            and  o x 304  x 303      o     should all match           but how in the world are you going to do that  This is harder than it looks  but it   s something you need to account for      If there   s one thing I know about Perl  it is what its Unicode bits do and do not do  and this thing I promise you                        s         U                                                   You cannot just change some defaults and get smooth sailing   It   s true that I run  with PERL UNICODE set to  SA   but that   s all  and even that is mostly for command-line stuff   For real work  I go through all the many steps outlined above  and I do it very     very   carefully           dl         do  pu     p    u           nl poo

User · Answer

There are two stages to processing Unicode text  The first is  how can I input it and output it without losing information   The second is  how do I treat text according to local language conventions    tchrist s post covers both  but the second part is where 99  of the text in his post comes from  Most programs don t even handle I O correctly  so it s important to understand that before you even begin to worry about normalization and collation   This post aims to solve that first problem  When you read data into Perl  it doesn t care what encoding it is  It allocates some memory and stashes the bytes away there  If you say print  str  it just blits those bytes out to your terminal  which is probably set to assume everything that is written to it is UTF-8  and your text shows up   Marvelous   Except  it s not  If you try to treat the data as text  you ll see that Something Bad is happening  You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree  Write a one-liner like  perl -E  while  lt  gt    chomp  say length    and type in      and you get 12    not the correct answer  4   That s because Perl assumes your string is not text  You have to tell it that it s text before it will give you the right answer   That s easy enough  the Encode module has the functions to do that  The generic entry point is Encode  decode  or use Encode qw decode   of course   That function takes some string from the outside world  what we ll call  octets   a fancy of way of saying  8-bit bytes    and turns it into some text that Perl will understand  The first argument is a character encoding name  like  UTF-8  or  ASCII  or  EUC-JP   The second argument is the string  The return value is the Perl scalar containing the text    There is also Encode  decode utf8  which assumes UTF-8 for the encoding    If we rewrite our one-liner   perl -MEncode decode -E  while  lt  gt    chomp  say length decode  UTF-8            We type in      and get  4  as the result  Success   That  right there  is the solution to 99  of Unicode problems in Perl   The key is  whenever any text comes into your program  you must decode it  The Internet cannot transmit characters  Files cannot store characters  There are no characters in your database  There are only octets  and you can t treat octets as characters in Perl  You must decode the encoded octets into Perl characters with the Encode module   The other half of the problem is getting data out of your program  That s easy to  you just say use Encode qw encode   decide what the encoding your data will be in  UTF-8 to terminals that understand UTF-8  UTF-16 for files on Windows  etc    and then output the result of encode  encoding   data  instead of just outputting  data   This operation converts Perl s characters  which is what your program operates on  to octets that can be used by the outside world  It would be a lot easier if we could just send characters over the Internet or to our terminals  but we can t  octets only  So we have to convert characters to octets  otherwise the results are undefined   To summarize  encode all outputs and decode all inputs   Now we ll talk about three issues that make this a little challenging  The first is libraries  Do they handle text correctly  The answer is    they try  If you download a web page  LWP will give you your result back as text  If you call the right method on the result  that is  and that happens to be decoded content  not content  which is just the octet stream that it got from the server   Database drivers can be flaky  if you use DBD  SQLite with just Perl  it will work out  but if some other tool has put text stored as some encoding other than UTF-8 in your database    well    it s not going to be handled correctly until you write code to handle it correctly   Outputting data is usually easier  but if you see  wide character in print   then you know you re messing up the encoding somewhere  That warning means  hey  you re trying to leak Perl characters to the outside world and that doesn t make any sense   Your program appears to work  because the other end usually handles the raw Perl characters correctly   but it is very broken and could stop working at any moment  Fix it with an explicit Encode  encode   The second problem is UTF-8 encoded source code  Unless you say use utf8 at the top of each file  Perl will not assume that your source code is UTF-8  This means that each time you say something like my  var         you re injecting garbage into your program that will totally break everything horribly  You don t have to  use utf8   but if you don t  you must not use any non-ASCII characters in your program   The third problem is how Perl handles The Past  A long time ago  there was no such thing as Unicode  and Perl assumed that everything was Latin-1 text or binary  So when data comes into your program and you start treating it as text  Perl treats each octet as a Latin-1 character  That s why  when we asked for the length of         we got 12  Perl assumed that we were operating on the  Latin-1 string               which is 12 characters  some of which are non-printing    This is called an  implicit upgrade   and it s a perfectly reasonable thing to do  but it s not what you want if your text is not Latin-1  That s why it s critical to explicitly decode input  if you don t do it  Perl will  and it might do it wrong   People run into trouble where half their data is a proper character string  and some is still binary  Perl will interpret the part that s still binary as though it s Latin-1 text and then combine it with the correct character data  This will make it look like handling your characters correctly broke your program  but in reality  you just haven t fixed it enough   Here s an example  you have a program that reads a UTF-8-encoded text file  you tack on a Unicode PILE OF POO to each line  and you print it out  You write it like   while  lt  gt        chomp      say            And then run on some UTF-8 encoded data  like   perl poo pl input-data txt   It prints the UTF-8 data with a poo at the end of each line  Perfect  my program works   But nope  you re just doing binary concatenation  You re reading octets from the file  removing a  n with chomp  and then tacking on the bytes in the UTF-8 representation of the PILE OF POO character  When you revise your program to decode the data from the file and encode the output  you ll notice that you get garbage          instead of the poo  This will lead you to believe that decoding the input file is the wrong thing to do  It s not   The problem is that the poo is being implicitly upgraded as latin-1  If you use utf8 to make the literal text instead of binary  then it will work again    That s the number one problem I see when helping people with Unicode  They did part right and that broke their program  That s what s sad about undefined results  you can have a working program for a long time  but when you start to repair it  it breaks  Don t worry  if you are adding encode decode statements to your program and it breaks  it just means you have more work to do  Next time  when you design with Unicode in mind from the beginning  it will be much easier    That s really all you need to know about Perl and Unicode  If you tell Perl what your data is  it has the best Unicode support among all popular programming languages  If you assume it will magically know what sort of text you are feeding it  though  then you re going to trash your data irrevocably  Just because your program works today on your UTF-8 terminal doesn t mean it will work tomorrow on a UTF-16 encoded file  So make it safe now  and save yourself the headache of trashing your users  data   The easy part of handling Unicode is encoding output and decoding input  The hard part is finding all your input and output  and determining which encoding it is  But that s why you get the big bucks

User · Answer

I think you misunderstand Unicode and its relationship to Perl  No matter which way you store data  Unicode  ISO-8859-1  or many other things  your program has to know how to interpret the bytes it gets as input  decoding  and how to represent the information it wants to output  encoding   Get that interpretation wrong and you garble the data  There isn t some magic default setup inside your program that s going to tell the stuff outside your program how to act  You think it s hard  most likely  because you are used to everything being ASCII  Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with  If everything used nothing but UTF-8 and you had no choice  then UTF-8 would be just as easy  But not everything does use UTF-8  For instance  you don t want your input handle to think that it s getting UTF-8 octets unless it actually is  and you don t want your output handles to be UTF-8 if the thing reading from them can t handle UTF-8  Perl has no way to know those things  That s why you are the programmer  I don t think Unicode in Perl 5 is too complicated  I think it s scary and people avoid it  There s a difference  To that end  I ve put Unicode in Learning Perl  6th Edition  and there s a lot of Unicode stuff in Effective Perl Programming  You have to spend the time to learn and understand Unicode and how it works  You re not going to be able to use it effectively otherwise

User · Answer

While reading this thread  I often get the impression that people are using  UTF-8  as a synonym to  Unicode   Please make a distinction between Unicode s  Code-Points  which are an enlarged relative of the ASCII code and Unicode s various  encodings   And there are a few of them  of which UTF-8  UTF-16 and UTF-32 are the current ones and a few more are obsolete   Please  UTF-8  as well as all other encodings  exists and have meaning in input or in output only  Internally  since Perl 5 8 1  all strings are kept as Unicode  Code-points   True  you have to enable some features as admiringly covered previously

User · Answer

You should enable the unicode strings feature  and this is the default if you use v5 14  You should not really use unicode identifiers esp  for foreign code via utf8 as they are insecure in perl5  only cperl got that right  See e g  http   perl11 org blog unicode-identifiers html Regarding utf8 for your filehandles streams  You need decide by yourself the encoding of your external data  A library cannot know that  and since not even libc supports utf8  proper utf8 data is rare  There s more wtf8  the windows aberration of utf8 around  BTW  Moose is not really  quot Modern Perl quot   they just hijacked the name  Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes  with an eclectic aberration of proper perl6 syntax  e g  using strings for variable names  horrible fields syntax  and a very immature naive implementation which is 10x slower than a proper implementation  cperl and perl6 are the true modern perls  where form follows function  and the implementation is reduced and optimized

User · Answer

We re all in agreement that it is a difficult problem for many reasons  but that s precisely the reason to try to make it easier on everybody   There is a recent module on CPAN  utf8  all  that attempts to  turn on Unicode  All of it    As has been pointed out  you can t magically make the entire system  outside programs  external web requests  etc   use Unicode as well  but we can work together to make sensible tools that make doing common problems easier   That s the reason that we re programmers   If utf8  all doesn t do something you think it should  let s improve it to make it better  Or let s make additional tools that together can suit people s varying needs as well as possible

[perl] Why does modern Perl avoid UTF-8 by default?

Examples related to perl

Examples related to unicode

Examples related to utf-8