[unicode] What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.

Some interesting examples:

The heart character. If I type this into my browser:

http://www.google.com/search?q=?

Then copy and paste it, I see this URL

http://www.google.com/search?q=%E2%99%A5

which makes it seem like Firefox (or Safari) is doing this.

urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'

which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.

If I type the URL

http://www.google.com/search?q=…

into my browser then copy and paste, I get

http://www.google.com/search?q=%E2%80%A6

back. Which seems to be the result of doing

urllib.quote_plus(x.encode("utf-8"))

which makes sense since … can't be encoded with Latin-1.

But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.

Since this seems to be ambiguous:

In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'

works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.

What's the right thing to be doing with the special characters I need to deal with?

The answer is


I would always encode in UTF-8. From the Wikipedia page on percent encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.


IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.


IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.

Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.


The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.

If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).

The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.

It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)


The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.

So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.


Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to urlencode

How to URL encode in Python 3? How to encode a URL in Swift Swift - encode URL Java URL encoding of query string parameters Objective-C and Swift URL encoding How to URL encode a string in Ruby Should I URL-encode POST data? How to set the "Content-Type ... charset" in the request header using a HTML link Parsing GET request parameters in a URL that contains another URL How to urlencode a querystring in Python?

Examples related to web-standards

input type="submit" Vs button tag are they interchangeable? Create a HTML table where each TR is a FORM Valid content-type for XML, HTML and XHTML documents Is it a good practice to use an empty URL for a HTML form's action attribute? (action="") Is a DIV inside a TD a bad idea? What is the proper way to URL encode Unicode characters? How does one target IE7 and IE8 with valid CSS? Is there a query language for JSON? Is it valid to have a html form inside another html form? What's the difference between ISO 8601 and RFC 3339 Date Formats?