What are the safe characters for making URLs

Question

I am making a website with articles  and I need the articles to have  quot friendly quot  URLs  based on the title  For example  if the title of my article is  quot Article Test quot   I would like the URL to be http   www example com articles article test  However  article titles  as any string  can contain multiple special characters that would not be possible to put literally in my URL  For instance  I know that   or   need to be replaced  but I don t know all the others  What characters are permissible in URLs  What is safe to keep

User · Accepted Answer

To quote section 2 3 of RFC 3986   Characters that are allowed in a URI  but do not have a reserved purpose  are called unreserved   These include uppercase and lowercase letters  decimal digits  hyphen  period  underscore  and tilde    ALPHA  DIGIT   quot - quot     quot   quot     quot   quot     quot   quot    Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396

User · Answer

The format for an URI is defined in RFC 3986  See section 3 3 for details

User · Answer

You are best keeping only some characters  whitelist  instead of removing certain characters  blacklist    You can technically allow any character  just as long as you properly encode it  But  to answer in the spirit of the question  you should only allow these characters    Lower case letters  convert upper case to lower  Numbers  0 through 9 A dash - or underscore   Tilde      Everything else has a potentially special meaning  For example  you may think you can use    but it can be replaced with a space   amp  is dangerous  too  especially if using some rewrite rules   As with the other comments  check out the standards and specifications for complete details

User · Answer

unreserved    ALPHA   DIGIT    -

User · Answer

Between 3-50 characters  Can contain lowercase letters  numbers and special characters - dot     dash -   underscore    and at the rate

User · Answer

Always Safe In theory and by the specification  these are safe basically anywhere  except the domain name  Percent-encode anything not listed  and you re good to go      A-Z a-z 0-9 -                          Sometimes Safe Only safe when used within specific URL components  use with care      Paths         amp        Queries            Fragments           amp          Never Safe According to the URI specification  RFC 3986   all other characters must be percent-encoded  This  includes       lt space gt   lt control-characters gt   lt extended-ascii gt   lt unicode gt         lt   gt                      If maximum compatibility is a concern  limit the character set to A-Z a-z 0-9 -      with periods only for filename extensions   Keep Context in Mind Even if valid per the specification  a URL can still be  quot unsafe quot   depending on context  Such as a file     URL containing invalid filename characters  or a query component containing  quot   quot    quot   quot   and  quot  amp  quot  when not used as delimiters  Correct handling of these cases are generally up to your scripts and can be worked around  but it s something to keep in mind

User · Answer

I found it very useful to encode my URL to a safe one when I was returning a value through Ajax PHP to a URL which was then read by the page again  PHP output with URL encoder for the special character  amp      PHP returning the success information of an Ajax request echo  quot  quot  str replace   amp      26     POST  name       quot  category was changed quot       JavaScript sending the value to the URL window location href    time php return updated amp val     msg      JavaScript PHP executing the function printing the value of the URL     now with the text normally lost in space because of the reserved  amp  character   setTimeout  quot infoApp  updated    lt  php echo   GET  val     gt     quot   360

User · Answer

Looking at RFC3986 - Uniform Resource Identifier  URI   Generic Syntax  your question revolves around the path component of a URI         foo   example com 8042 over there name ferret nose                                                                                                                   scheme     authority       path        query   fragment                                                                             urn example animal ferret nose    Citing section 3 3  valid characters for a URI segment are of type pchar        pchar           unreserved   pct-encoded   sub-delims               Which breaks down to      ALPHA   DIGIT    -                         pct-encoded                    amp                                                                    Or in other words  You may use any  non-control-  character from the ASCII table  except            and     This understanding is backed by RFC1738 - Uniform Resource Locators  URL

User · Answer

I had a similar problem  I wanted to have pretty URLs and reached the conclusion that I have to allow only letters  digits  - and   in URLs  That is fine  but then I wrote some nice regex and I realized that it recognizes all UTF-8 characters are not letters in  NET and was screwed  This appears to be a know problem for the  NET regex engine  So I got to this solution  private static string GetTitleForUrlDisplay string title        if   string IsNullOrEmpty title                 return Regex Replace Regex Replace title    quot   A-Za-z0-9 -  quot   new MatchEvaluator CharacterTester   Replace       -   TrimStart  -   TrimEnd  -     quot  -   quot    quot - quot   ToLower              return string Empty           lt summary gt      All characters that do not match the patter  will get to this method  i e  useful for Unicode characters  because      NET implementation of regex do not handle Unicode characters  So we use char IsLetterOrDigit   which works nicely and we     return what we approve and return - for everything else       lt  summary gt       lt param name  quot m quot  gt  lt  param gt       lt returns gt  lt  returns gt  private static string CharacterTester Match m        string x   m ToString        if  x Length  gt  0  amp  amp  char IsLetterOrDigit x 0                  return x ToLower              else               return  quot - quot

User · Answer

There are two sets of characters you need to watch out for  reserved and unsafe  The reserved characters are   ampersand   quot  amp  quot   dollar   quot   quot   plus sign   quot   quot   comma   quot   quot   forward slash   quot   quot   colon   quot   quot   semi-colon   quot   quot   equals   quot   quot   question mark   quot   quot    At  symbol   quot   quot   pound   quot   quot     The characters generally considered unsafe are   space   quot   quot   less than and greater than   quot  lt  gt  quot   open and close brackets   quot    quot   open and close braces   quot    quot   pipe   quot   quot   backslash   quot  quot   caret   quot   quot   percent   quot   quot    I may have forgotten one or more  which leads to me echoing Carl V s answer  In the long run you are probably better off using a  quot white list quot  of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems

User · Answer

From the context you describe  I suspect that what you re actually trying to make is something called an  SEO slug    The best general known practice for those is    Convert to lower-case Convert entire sequences of characters other than a-z and 0-9 to one hyphen  -   not underscores  Remove  stop words  from the URL  i e  not-meaningfully-indexable words like  a    an   and  the   Google  stop words  for extensive lists   So  as an example  an article titled  The Usage of       to Represent Swearing In Comics  would get a slug of  usage-represent-swearing-comics

User · Answer

I think you re looking for something like  quot URL encoding quot  - encoding a URL so that it s  quot safe quot  to use on the web  Here s a reference for that  If you don t want any special characters  just remove any that require URL encoding  HTML URL Encoding Reference

User · Answer

From an SEO perspective  hyphens are preferred over underscores  Convert to lowercase  remove all apostrophes  then replace all non-alphanumeric strings of characters with a single hyphen  Trim excess hyphens off the start and finish

[url] What are the safe characters for making URLs?

Examples related to url

Examples related to friendly-url