[html] Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?

Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.

My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.

All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:

  • URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
  • HTTP Client libraries
  • Exotic browsers, RSS readers

Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?

Is there some magic way of serving nice-looking URLs in HTML

http://www.example.com/düsseldorf?neighbourhood=Lörick

that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?

This question is related to html url unicode utf-8

The answer is


Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.

E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:

The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.

Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:

ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:

If c is not a URL code point and not "%", parse error.

Also the validator http://validator.w3.org/ passes for URLs like "??", and does not pass for URLs with characters like spaces "a b"

Related: Which characters make a URL invalid?


As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.

So this issue will resolve ASAP.


What Tgr said. Background:

http://www.example.com/düsseldorf?neighbourhood=Lörick

That's not a URI. But it is an IRI.

You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.

To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:

http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick

If there are non-ASCII characters in the hostname part of the IRI, eg. http://??.???/, they have be encoded using Punycode instead.

Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:

http://en.wikipedia.org/wiki/?

The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...

...well, you know.


Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:

http://stackoverflow.com/questions/2742852/unicode-characters-in-urls

However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:

http://stackoverflow.com/questions/2742852/?????????????????

So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...


For me this is the correct way, This just worked:

    $linker = rawurldecode("$link");
    <a href="<?php echo $link;?>"   target="_blank"><?php echo $linker ;?></a>

This worked, and now links are displayed properly:

http://newspaper.annahar.com/article/121638-????--????-???-??-??????-?????-????-??????-??????-????-??????-?????-????????

Link found on:

http://www.galeriejaninerubeiz.com/newsite/news


Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).

There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).


Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to url

What is the difference between URL parameters and query strings? Allow Access-Control-Allow-Origin header using HTML5 fetch API File URL "Not allowed to load local resource" in the Internet Browser Slack URL to open a channel from browser Getting absolute URLs using ASP.NET Core How do I load an HTTP URL with App Transport Security enabled in iOS 9? Adding form action in html in laravel React-router urls don't work when refreshing or writing manually URL for public Amazon S3 bucket How can I append a query parameter to an existing URL?

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8