Unicode characters in URLs

Question

In 2010  would you serve URLs containing UTF-8 characters in a large web portal   Unicode characters are forbidden as per the RFC on URLs  see here   They would have to be percent encoded to be standards compliant   My main point  though  is serving the unencoded characters for the sole purpose of having nice-looking URLs  so percent encoding is out    All major browsers seem to be parsing those URLs okay no matter what the RFC says  My general impression  though  is that it gets very shaky when leaving the domain of web browsers    URLs getting copy pasted into text files  E-Mails  even Web sites with a different encoding  HTTP Client libraries  Exotic browsers  RSS readers   Is my impression correct that trouble is to be expected here  and thus it s not a practical solution  yet  if you re serving a non-technical audience and it s important that all your links work properly even if quoted and passed on   Is there some magic way of serving nice-looking URLs in HTML  http   www example com d  sseldorf neighbourhood L  rick   that can be copy pasted with the special characters intact  but work correctly when re-used in older clients

User · Answer

Use percent-encoded form  Some  mainly old  computers running Windows XP for example do not support Unicode  but rather ISO encodings  That is the reason percent-encoded URLs were invented  Also  if you give a URL printed on paper to a user  containing characters that cannot be easily typed  that user may have a hard time typing it  or just ignore it   Percent-encoded form can even be used in many of the oldest machines that ever existed  although they don t support internet of course    There is a downside though  as percent-encoded characters are longer than the original ones  thus possibly resulting in really long URLs  But just try to ignore it  or use a URL shortener  I would recommend goo gl in this case  which makes a 13-character long URL   Also  if you don t want to register for a Google account  try bit ly  bit ly makes slightly longer URLs  with the length being 14 characters

User · Answer

Use percent encoding  Modern browsers will take care of display  amp  paste issues and make it human-readable  E  g  http   ko wikipedia org wiki          Edit  when you copy such an url in Firefox  the clipboard will hold the percent-encoded form  which is usually a good thing   but if you copy only a part of it  it will remain unencoded

User · Answer

Depending on your URL scheme  you can make the UTF-8 encoded part  not important   For example  if you look at Stack Overflow URLs  they re of the following form   http   stackoverflow com questions 2742852 unicode-characters-in-urls  However  the server doesn t actually care if you get the part after the identifier wrong  so this also works   http   stackoverflow com questions 2742852                    So if you had a layout like this  then you could potentially use UTF-8 in the part after the identifier and it wouldn t really matter if it got garbled  Of course this probably only works in somewhat specialised circumstances

User · Answer

For me this is the correct way  This just worked        linker   rawurldecode   link         lt a href   lt  php echo  link   gt     target   blank  gt  lt  php echo  linker    gt  lt  a gt    This worked  and now links are displayed properly   http   newspaper annahar com article 121638-    --    -   -  -      -     -    -      -      -    -      -     -           Link found on   http   www galeriejaninerubeiz com newsite news

User · Answer

What Tgr said  Background   http   www example com d  sseldorf neighbourhood L  rick   That s not a URI  But it is an IRI   You can t include an IRI in an HTML4 document  the type of attributes like href is defined as URI and not IRI  Some browsers will handle an IRI here anyway  but it s not really a good idea   To encode an IRI into a URI  take the path and query parts  UTF-8-encode them then percent-encode the non-ASCII bytes   http   www example com d C3 BCsseldorf neighbourhood L C3 B6rick   If there are non-ASCII characters in the hostname part of the IRI  eg  http            they have be encoded using Punycode instead   Now you have a URI  It s an ugly URI  But most browsers will hide that for you  copy and paste it into the address bar or follow it in a link and you ll see it displayed with the original Unicode characters  Wikipedia have been using this for years  eg    http   en wikipedia org wiki     The one browser whose behaviour is unpredictable and doesn t always display the pretty IRI version is        well  you know

User · Answer

As all of these comments are true  you should note that as far as ICANN approved Arabic  Persian  and Chinese characters to be registered as Domain Name  all of the browser-making companies  Microsoft  Mozilla  Apple  etc   have to support Unicode in URLs without any encoding  and those should be searchable by Google  etc   So this issue will resolve ASAP

User · Answer

Not sure if it is a good idea  but as mentioned in other comments and as I interpret it  many Unicode chars are valid in HTML5 URLs   E g   href docs say http   www w3 org TR html5 links html attr-hyperlink-href      The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces    Then the definition of  valid URL  points to http   url spec whatwg org   which defines URL code points as      ASCII alphanumeric              amp                                   -                                                and code points in the ranges U 00A0 to U D7FF  U E000 to U FDCF  U FDF0 to U FFFD  U 10000 to U 1FFFD  U 20000 to U 2FFFD  U 30000 to U 3FFFD  U 40000 to U 4FFFD  U 50000 to U 5FFFD  U 60000 to U 6FFFD  U 70000 to U 7FFFD  U 80000 to U 8FFFD  U 90000 to U 9FFFD  U A0000 to U AFFFD  U B0000 to U BFFFD  U C0000 to U CFFFD  U D0000 to U DFFFD  U E1000 to U EFFFD  U F0000 to U FFFFD  U 100000 to U 10FFFD     The term  URL code points  is then used in a few parts of the parsing algorithm  e g  for the relative path state      If c is not a URL code point and not      parse error    Also the validator http   validator w3 org  passes for URLs like       and does not pass for URLs with characters like spaces  a b   Related  Which characters make a URL invalid

[html] Unicode characters in URLs

Examples related to html

Examples related to url

Examples related to unicode

Examples related to utf-8