Characters allowed in a URL

Question

Does anyone know the full list of characters that can be used within a GET without being encoded  At the moment I am using A-Z a-z and 0-9    but I am looking to find out the full list   I am also interested into if there is a specification released for the up coming addition of Chinese  Arabic url s  as obviously that will have a big impact on my question

User · Answer

If you like to give a special kind of experience to the users you could use pushState to bring a wide range of characters to the browser's url:

var u="";var tt=168;
for(var i=0; i< 250;i++){
 var x = i+250*tt;
console.log(x);
 var c = String.fromCharCode(x);
 u+=c; 
}
history.pushState({},"",250*tt+u);

User · Answer

These are listed in RFC3986  See the Collected ABNF for URI  to see what is allowed where and the regex for parsing validation

User · Answer

The upcoming change is for chinese  arabic domain names not URIs   The internationalised URIs are called IRIs and are defined in RFC 3987   However  having said that I d recommend not doing this yourself but relying on an existing  tested library since there are lots of choices of URI encoding decoding and what are considered safe by specification  versus what are safe by actual use  browsers

User · Answer

I tested it by requesting my website  apache  with all available chars on my german keyboard as URL parameter   http   example com   1234567890    qwertzuiop   asdfghjkl      lt yxcvbnm  -         amp        QWERTZUIOP   ASDFGHJKL       gt YXCVBNM                       These were not encoded    0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ  -                     Not encoded after urlencode     0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ -    Not encoded after rawurlencode     0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ -     Note  Before PHP 5 3 0 rawurlencode   encoded   because of RFC 1738  But this was replaced by RFC 3986 so its safe to use  now  But I do not understand why for example    are encoded through rawurlencode   because they are not mentioned in RFC 3986   An additional test I made was regarding auto-linking in mail texts  I tested Mozilla Thunderbird  aol com  outlook com  gmail com  gmx de and yahoo de and they fully linked URLs containing these chars   0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ -       amp         Of course the   was linked  too  but only if it was used once   Some people would now suggest to use only the rawurlencode   chars  but did you ever hear that someone had problems to open these websites   Asterisk http   wayback archive org web   http   google com  Colon https   en wikipedia org wiki Wikipedia About  Plus https   plus google com  google  At sign  Colon  Comma and Exclamation mark https   www google com maps place USA  36 2218457      Because of that these chars should be usable unencoded without problems  Of course you should not use  amp   because of encoding sequences like  amp amp   The same reason is valid for   as it used to encode chars in general  And   as it assigns a value to a parameter name   Finally I would say its ok to use these unencoded   0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ -           But if you expect randomly generated URLs you should not use     because those mark the end of a sentence and some mail apps will not auto-link the last char of the url  Example   Visit http   example com foo bar

User · Answer

From here      Thus  only alphanumerics  the special characters  -            and reserved characters used for their   reserved purposes may be used unencoded within a URL

User · Answer

The full list of the 66 unreserved characters is in RFC3986  here  http   tools ietf org html rfc3986 section-2 3  This is any character in the following regex set     A-Za-z0-9   -

User · Answer

EDIT  As  Jukka K  Korpela correctly points out  RFC 1738 was updated by RFC 3986  This has expanded and clarified the characters valid for host  unfortunately it s not easily copied and pasted  but I ll do my best  In first matched order  host          IP-literal   IPv4address   reg-name  IP-literal     quot   quot    IPv6address   IPvFuture     quot   quot   IPvFuture      quot v quot  1 HEXDIG  quot   quot  1   unreserved   sub-delims    quot   quot     IPv6address           6  h16  quot   quot    ls32                                            quot    quot  5  h16  quot   quot    ls32                                     h16    quot    quot  4  h16  quot   quot    ls32                        1  h16  quot   quot    h16    quot    quot  3  h16  quot   quot    ls32                        2  h16  quot   quot    h16    quot    quot  2  h16  quot   quot    ls32                        3  h16  quot   quot    h16    quot    quot     h16  quot   quot    ls32                        4  h16  quot   quot    h16    quot    quot               ls32                        5  h16  quot   quot    h16    quot    quot               h16                        6  h16  quot   quot    h16    quot    quot   ls32            h16  quot   quot  h16     IPv4address                     least-significant 32 bits of address  h16           1 4HEXDIG                   16 bits of address represented in hexadecimal  IPv4address   dec-octet  quot   quot  dec-octet  quot   quot  dec-octet  quot   quot  dec-octet  dec-octet     DIGIT                   0-9                  x31-39 DIGIT           10-99                  quot 1 quot  2DIGIT              100-199                  quot 2 quot   x30-34 DIGIT       200-249                  quot 25 quot   x30-35            250-255  reg-name         unreserved   pct-encoded   sub-delims    unreserved    ALPHA   DIGIT    quot - quot     quot   quot     quot   quot     quot   quot       lt ---This seems like a practical shortcut  most closely resembling original answer  reserved      gen-delims   sub-delims  gen-delims     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot   sub-delims     quot   quot     quot   quot     quot  amp  quot     quot   quot     quot   quot     quot   quot                   quot   quot     quot   quot     quot   quot     quot   quot     quot   quot   pct-encoded    quot   quot  HEXDIG HEXDIG  Original answer from RFC 1738 specification   Thus  only alphanumerics  the special characters  quot  -          quot   and reserved characters used for their reserved purposes may be used unencoded within a URL     obsolete since 1998

User · Answer

RFC3986 defines two sets of characters you can use in a URI    Reserved Characters            amp              reserved      gen-delims   sub-delims      gen-delims                                                 sub-delims                  amp                                                        The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI   URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent   Unreserved Characters  A-Za-z0-9-        unreserved    ALPHA   DIGIT    -                         Characters that are allowed in a URI but do not have a reserved purpose are called unreserved

User · Answer

The characters allowed in a URI are either reserved or unreserved  or a percent character as part of a percent-encoding    http   en wikipedia org wiki Percent-encoding Types of URI characters  says these are RFC 3986 unreserved characters  sec  2 3  as well as reserved characters  sec 2 2  if they need to retain their special meaning  And also a percent character as part of a percent-encoding

[url] Characters allowed in a URL

Examples related to url