Which characters make a URL invalid

Question

Which characters make a URL invalid   Are these valid URLs    example com file    html http   example com file    html

User · Answer

I am implementing old http  0 9  1 0  1 1  request and response reader writer  Request URI is the most problematic place  You can t just use RFC 1738  2396 or 3986 as it is  There are many old HTTP clients and servers that allows more characters  So I ve made research based on accidentally published webserver access logs   quot GET URI HTTP 1 0 quot  200  I ve found that the following non-standard characters are often used in URI         lt   gt         quot   These characters were described in RFC 1738 as unsafe  If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in request URI  Please read more information about this research in oghttp-request-collector

User · Answer

In general URIs as defined by RFC 3986  see Section 2  Characters  may contain any of the following 84 characters    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-             amp            Note that this list doesn t state where in the URI these characters may occur   Any other character needs to be encoded with the percent-encoding   hh   Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word

User · Answer

In your supplementary question you asked if www example com file    html is a valid URL   That URL isn t valid because a URL is a type of URI and a valid URI must have a scheme like http   see RFC 3986    If you meant to ask if http   www example com file    html is a valid URL then the answer is still no because the square bracket characters aren t valid there   The square bracket characters are reserved for URLs in this format  http    2001 db8 85a3  8a2e 370 7334  foo bar  i e  an IPv6 literal instead of a host name   It s worth reading RFC 3986 carefully if you want to understand the issue fully

User · Answer

Not really an answer to your question but validating url s is really a serious p i t a You re probably just better off validating the domainname and leave query part of the url be  That is my experience  You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task   Regular expressions to detect url s are abundant  google it

User · Answer

To add some clarification and directly address the question above  there are several classes of characters that cause problems for URLs and URIs  There are some characters that are disallowed and should never appear in a URL URI  reserved characters  described below   and other characters that may cause problems in some cases  but are marked as  quot unwise quot  or  quot unsafe quot   Explanations for why the characters are restricted are clearly spelled out in RFC-1738  URLs  and RFC-2396  URIs   Note the newer RFC-3986  update to RFC-1738  defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules  Excluded US-ASCII Characters disallowed within the URI syntax     control        lt US-ASCII coded characters 00-1F and 7F hexadecimal gt     space          lt US-ASCII coded character 20 hexadecimal gt     delims         quot  lt  quot     quot  gt  quot     quot   quot     quot   quot     lt  quot  gt   The character  quot   quot  is excluded because it is used to delimit a URI from a fragment identifier  The percent character  quot   quot  is excluded because it is used for the encoding of escaped characters  In other words  the  quot   quot  and  quot   quot  are reserved characters that must be used in a specific context  List of unwise characters are allowed but may cause problems     unwise         quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot   quot   Characters that are reserved within a query component and or have special meaning within a URI URL    reserved       quot   quot     quot   quot     quot   quot     quot   quot     quot   quot     quot  amp  quot     quot   quot     quot   quot     quot   quot     quot   quot   The  quot reserved quot  syntax class above refers to those characters that are allowed within a URI  but which may not be allowed within a particular component of the generic URI syntax  Characters in the  quot reserved quot  set are not reserved in all contexts  The hostname  for example  can contain an optional username so it could be something like ftp   user hostname  where the     character has special meaning  Here is an example of a URL that has invalid and unwise characters  e g                 and should be properly encoded  http   mw1 google com mw-earth-vectordb kml-samples gp seattle gigapxl   level  r  y  c  x  jpg  Some of the character restrictions for URIs and URLs are programming language-dependent  For example  the      0x7C  character although only marked as  quot unwise quot  in the URI spec will throw a URISyntaxException in the Java java net URI constructor so a URL like http   api google com q exp a b is not allowed and must be encoded instead as http   api google com q exp a 7Cb if using Java with a URI object instance

User · Answer

All valid characters that can be used in a URI  a URL is a type of URI  are defined in RFC 3986   All other characters can be used in a URL provided that they are  URL Encoded  first   This involves changing the invalid character for specific  codes   usually in the form of the percent symbol     followed by a hexadecimal number    This link  HTML URL Encoding Reference  contains a list of the encodings for invalid characters

User · Answer

Several of Unicode character ranges are valid HTML5  although it might still not be a good idea to use them   E g   href docs say http   www w3 org TR html5 links html attr-hyperlink-href      The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces    Then the definition of  valid URL  points to http   url spec whatwg org   which says it aims to      Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process    That document defines URL code points as      ASCII alphanumeric              amp                                   -                                                and code points in the ranges U 00A0 to U D7FF  U E000 to U FDCF  U FDF0 to U FFFD  U 10000 to U 1FFFD  U 20000 to U 2FFFD  U 30000 to U 3FFFD  U 40000 to U 4FFFD  U 50000 to U 5FFFD  U 60000 to U 6FFFD  U 70000 to U 7FFFD  U 80000 to U 8FFFD  U 90000 to U 9FFFD  U A0000 to U AFFFD  U B0000 to U BFFFD  U C0000 to U CFFFD  U D0000 to U DFFFD  U E1000 to U EFFFD  U F0000 to U FFFFD  U 100000 to U 10FFFD     The term  URL code points  is then used in the statement      If c is not a URL code point and not      parse error    in a several parts of the parsing algorithm  including the schema  authority  relative path  query and fragment states  so basically the entire URL   Also  the validator http   validator w3 org  passes for URLs like       and does not pass for URLs with characters like spaces  a b   Of course  as mentioned by Stephen C  it is not just about characters but also about context  you have to understand the entire algorithm  But since class  URL code points  is used on key points of the algorithm  it that gives a good idea of what you can use or not   See also  Unicode characters in URLs

User · Answer

Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like    https   en wikipedia org wiki M  bius strip or https   zh wikipedia org wiki Wikipedia          en    First  a digression into terminology  What are these addresses  Are they valid URLs   Historically  the answer was  no   According to RFC 3986  from 2005  such addresses are not URIs  and therefore not URLs  since URLs are a type of URIs   Per the terminology of 2005 IETF standards  we should properly call them IRIs  Internationalized Resource Identifiers   as defined in RFC 3987  which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI   Per modern spec  the answer is  yes   The WHATWG Living Standard simply classifies everything that would previously be called  URIs  or  IRIs  as  URLs   This aligns the specced terminology with how normal people who haven t read the spec use the word  URL   which was one of the spec s goals   What characters are allowed under the WHATWG Living Standard   Per this newer meaning of  URL   what characters are allowed  In many parts of the URL  such as the query string and path  we re allowed to use arbitrary  URL units   which are     URL code points and percent-encoded bytes    What are  URL code points       The URL code points are ASCII alphanumeric  U 0021      U 0024      U 0026   amp    U 0027      U 0028 LEFT PARENTHESIS  U 0029 RIGHT PARENTHESIS  U 002A      U 002B      U 002C      U 002D  -   U 002E      U 002F      U 003A      U 003B      U 003D      U 003F      U 0040      U 005F      U 007E      and code points in the range U 00A0 to U 10FFFD  inclusive  excluding surrogates and noncharacters     Note that the list of  URL code points  doesn t include    but that  s are allowed in  URL code units  if they re part of a percent-encoding sequence    The only place I can spot where the spec permits the use of any character that s not in this set is in the host  where IPv6 addresses are enclosed in   and   characters  Everywhere else in the URL  either URL units are allowed or some even more restrictive set of characters   What characters were allowed under the old RFCs   For the sake of history  and since it s not explored fully elsewhere in the answers here  let s examine was allowed under the older pair of specs   First of all  we have two types of RFC 3986 reserved characters             which are part of the generic syntax for a URI defined in RFC 3986    amp           which aren t part of the RFC s generic syntax  but are reserved for use as syntactic components of particular URI schemes  For instance  semicolons and commas are used as part of the syntax of data URIs  and  amp  and   are used as part of the ubiquitous  foo bar amp qux baz format in query strings  which isn t specified by RFC 3986     Any of the reserved characters above can be legally used in a URI without encoding  either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose   For example  although   has syntactic meaning in a URL  you can use it unencoded in a query string  because it doesn t have meaning in a query string    RFC 3986 also specifies some unreserved characters  which can always be used simply to represent data without any encoding    abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-      Finally  the   character itself is allowed for percent-encodings   That leaves only the following ASCII characters that are forbidden from appearing in a URL    The control characters  chars 0-1F and 7F   including new line  tab  and carriage return     lt           Every other character from ASCII can legally feature in a URL   Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges      xA0-D7FF    xF900-FDCF    xFDF0-FFEF    x10000-1FFFD    x20000-2FFFD    x30000-3FFFD    x40000-4FFFD    x50000-5FFFD    x60000-6FFFD    x70000-7FFFD    x80000-8FFFD    x90000-9FFFD    xA0000-AFFFD    xB0000-BFFFD    xC0000-CFFFD    xD0000-DFFFD    xE1000-EFFFD   These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions  this is probably because the blocks have been added to in the decade since RFC 3987 was written     Finally  it s perhaps worth noting that simply knowing which characters can legally appear in a URL isn t sufficient to recognise whether some given string is a legal URL or not  since some characters are only legal in particular parts of the URL  For example  the reserved characters   and   are legal as part of an IPv6 literal host in a URL like http    1080  8 800 200C 417A  foo but aren t legal in any other context  so the OP s example of http   example com file    html is illegal

User · Answer

I need to select character to split urls in string  so I decided to create list of characters which could not be found in URL by myself    gt  gt  gt  allowed    -            amp             ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789   gt  gt  gt  from string import printable  gt  gt  gt     join set printable  difference set allowed         lt  x0b n r x0c   t     gt     So  the possible choices are the newline  tab  space  backslash and   lt  gt       I guess I ll go with the space or newline

User · Answer

I came up with a couple regular expressions for PHP that will convert urls in text to anchor tags   First it converts all www  urls to http    then converts all urls with https     to a href     html links    string   preg replace    https             amp -    -     a-z      sim     lt a href   1 2  gt  2 lt  a gt          preg replace     s   www         amp -    -     a-z       sim     1http    2    string

[validation] Which characters make a URL invalid?

Examples related to validation

Examples related to url

Examples related to rfc3986