What is the proper way to URL encode Unicode characters

Question

I know of the non-standard  uxxxx scheme but that doesn t seem like a wise choice since the scheme has been rejected by the W3C   Some interesting examples   The heart character  If I type this into my browser   http   www google com search q     Then copy and paste it  I see this URL  http   www google com search q  E2 99 A5   which makes it seem like Firefox  or Safari  is doing this   urllib quote plus x encode  latin-1      E2 99 A5    which makes sense  except for things that can t be encoded in Latin-1  like the triple dot character         If I type the URL  http   www google com search q       into my browser then copy and paste  I get   http   www google com search q  E2 80 A6   back   Which seems to be the result of doing  urllib quote plus x encode  utf-8      which makes sense since     can t be encoded with Latin-1   But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1   Since this seems to be ambiguous   In  67   u      encode  utf-8   decode  latin-1   Out 67   u  xc3 xa2 xc2 x80 xc2 xa6    works  so I don t know how the browser figures out whether to decode that with UTF-8 or Latin-1   What s the right thing to be doing with the special characters I need to deal with

User · Answer

IRI  RFC 3987  is the latest standard that replaces the URI URL  RFC 3986 and older  standards   URI URL do not natively support Unicode  well  RFC 3986 adds provisions for future URI URL-based protocols to support it  but does not update past RFCs    The   uXXXX  scheme is a non-standard extension to allow Unicode in some situations  but is not universally implemented by everyone   IRI  on the other hand  fully supports Unicode  and requires that text be encoded as UTF-8 before then being percent-encoded

User · Answer

The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from  This is a guess that if the server sends us  text xml  charset iso-8859-1   then they expect responses back in the same format   If you re just entering a URL in the URL bar  then the browser doesn t have a base page to work on and therefore just has to guess  So in this case it seems to be doing utf-8 all the time  since both your inputs produced three-octet form values    The sad truth is that AFAIK there s no standard for what character set the values in a query string  or indeed any characters in the URL  should be interpreted as  At least in the case of values in the query string  there s no reason to suppose that they necessarily do correspond to characters   It s a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance  in Tomcat  you have to call request setEncoding    or some similar method  before you call any of the request getParameter   methods  The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers   I regularly ask Java interviewees what the difference between a Reader and an InputStream is  and regularly get blank looks

User · Answer

The first question is what are your needs   UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages   In regards to the browser identifying the encoding  the response  from the web server  should tell the browser the encoding   Still most browsers will attempt to guess  because this is either missing or wrong in so many cases   They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding   Currently all browser   I did not check this  but it is pretty close to true  use utf-8 as the default   So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes

User · Answer

I would always encode in UTF-8  From the Wikipedia page on percent encoding      The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must  in effect  represent characters from the unreserved set without translation  and should convert all other characters to bytes according to UTF-8  and then percent-encode those values  This requirement was introduced in January 2005 with the publication of RFC 3986  URI schemes introduced before this date are not affected    It seems like because there were other accepted ways of doing URL encoding in the past  browsers attempt several methods of decoding a URI  but if you re the one doing the encoding you should use UTF-8

User · Answer

IRIs do not replace URIs  because only URIs  effectively  ASCII  are permissible in some contexts -- including HTTP   Instead  you specify an IRI and it gets transformed into a URI when going out on the wire

[unicode] What is the proper way to URL encode Unicode characters?

Examples related to unicode

Examples related to utf-8

Examples related to character-encoding

Examples related to urlencode

Examples related to web-standards