Getting parts of a URL Regex

Question

Given the URL  single line   http   test example com dir subdir file html  How can I extract the following parts using regular expressions    The Subdomain  test  The Domain  example com  The path without the file   dir subdir   The file  file html  The path with the file   dir subdir file html  The URL without the path  http   test example com   add any other that you think would be useful    The regex should work correctly even if I enter the following URL     http   example example com example example example html

User · Answer

I realize I m late to the party  but there is a simple way to let the browser parse a url for you without a regex   var a   document createElement  a    a href    http   www example com 123 foo bar html fox trot foo      href   protocol   host   hostname   port   pathname   search   hash   forEach function k        console log k      a k             Output  href  http   www example com 123 foo bar html fox trot foo protocol  http  host  www example com 123 hostname  www example com port  123 pathname   foo bar html search   fox trot hash   foo

User · Answer

I was trying to solve this in javascript  which should be handled by   var url   new URL  http   a b example com 890 path wah t foo js foo bar amp bingobang  amp king kong kong com foobar bing bo ng bang      since  in Chrome  at least  it parses to        hash     foobar bing bo ng bang      search     foo bar amp bingobang  amp king kong kong com      pathname     path wah t foo js      port    890      hostname    example com      host    example com 890      password    b      username    a      protocol    http       origin    http   example com 890      href    http   a b example com 890 path wah t foo js foo bar amp bingobang  amp king kong kong com foobar bing bo ng bang      However  this isn t cross browser  https   developer mozilla org en-US docs Web API URL   so I cobbled this together to pull the same parts out as above                                                                                                                          0-9                                                                       Credit for this regex goes to https   gist github com rpflorence who posted this jsperf http   jsperf com url-parsing  originally found here  https   gist github com jlong 2428561 comment-310066  who came up with the regex this was originally based on   The parts are in this order   var keys          href                         http   user pass host com 81 directory file ext query 1 anchor      origin                       http   user pass host com 81      protocol                     http       username                     user      password                     pass      host                         host com 81      hostname                     host com      port                         81      pathname                      directory file ext      search                        query 1      hash                          anchor      There is also a small library which wraps it and provides query params   https   github com sadams lite-url  also available on bower   If you have an improvement  please create a pull request with more tests and I will accept and merge with thanks

User · Answer

Try the following      ht f tp s                  w    w      a-zA-Z  1    w -         w  2 5       d  1 5        w          w     w  3 4        w   w     amp  w   w        It supports HTTP   FTP  subdomains  folders  files etc   I found it from a quick google search   http   geekswithblogs net casualjim archive 2005 12 01 61722 aspx

User · Answer

I like the regex that was published in  Javascript  The Good Parts   Its not too short and not too complex  This page on github also has the JavaScript code that uses it  But it an be adapted for any language  https   gist github com voodooGQ 4057330

User · Answer

The best answer suggested here didn t work for me because my URLs also contain a port  However modifying it to the following regex worked for me     http s   ftp                s      d        w         w -         s            w -

User · Answer

I build this one  Very permissive it s not to check url juste divide it     http s            a-zA-Z0-9-                n              n              n      match 1   full protocole with      http or https  match 2   protocole without     match 3   host match 4   slug match 5   param match 6   anchor  work http    https    www demo com  slug  foo bar  anchor  https   demo com https   demo com  https   demo com slug https   demo com slug foo https   demo com  foo bar https   demo com  foo bar anchor https   demo com  foo bar amp bar foo anchor https   www greate-demo com   crash  anchor   toto

User · Answer

subdomain and domain are difficult because the subdomain can have several parts  as can the top level domain  http   sub1 sub2 domain co uk    the path without the file   http                                     the file   http                                              the path with the file   http                 the URL without the path    http                Markdown isn t very friendly to regexes

User · Answer

I would recommend not using regex   An API call like WinHttpCrackUrl   is less error prone   http   msdn microsoft com en-us library aa384092 28VS 85 29 aspx

User · Answer

You can get all the http https  host  port  path as well as query by using Uri object in  NET  just the difficult task is to break the host into sub domain  domain name and TLD   There is no standard to do so and can t be simply use string parsing or RegEx to produce the correct result  At first  I am using RegEx function but not all URL can be parse the subdomain correctly  The practice way is to use a list of TLDs  After a TLD for a URL is defined the left part is domain and the remaining is sub domain   However the list need to maintain it since new TLDs is possible  The current moment I know is publicsuffix org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain  domain and TLD easily by using DomainName object  domainName SubDomain  domainName Domain and domainName TLD   This answers also helpfull  Get the subdomain from a URL  CaLLMeLaNN

User · Answer

Java offers a URL class that will do this   Query URL Objects   On a side note  PHP offers parse url

User · Answer

Here is one that is complete  and doesnt rely on any protocol   function getServerURL url            var m   url match                                        console log m 1      Remove this         return m 1          getServerURL  http   dev test se   getServerURL  http   dev test se    getServerURL    ajax googleapis com ajax libs jquery 1 8 3 jquery min js   getServerURL       getServerURL  www dev test se sdas dsads   getServerURL  www dev test se    getServerURL  www dev test se abc 32   getServerURL  www dev test se abc   getServerURL    dev test se sads   getServerURL  http   www dev test se 321   getServerURL  http   localhost 8080 sads   getServerURL  https   localhost 8080 sdsa     Prints  http   dev test se  http   dev test se    ajax googleapis com      www dev test se  www dev test se  www dev test se  www dev test se    dev test se  http   www dev test se  http   localhost 8080  https   localhost 8080

User · Answer

I found the highest voted answer  hometoast s answer  doesn t work perfectly for me  Two problems    It can not handle port number  The hash part is broken    The following is a modified version      http s   ftp                s                    w         w -         s                           Position of parts are as follows   int SCHEMA   2  DOMAIN   3  PORT   5  PATH   6  FILE   8  QUERYSTRING   9  HASH   12   Edit posted by anon user   function getFileName path        return path match     http s   ftp                s                     w  -          w -         s                         i  8

User · Answer

P lt scheme gt https  ftp            P lt username gt         P lt password gt            P lt hostname gt       s     P lt port gt              P lt path gt     w        P lt filename gt  - w        s      P lt query gt              P lt fragment gt             From my answer on a similar question   Works better than some of the others mentioned because they had some bugs  such as not supporting username password  not supporting single-character filenames  fragment identifiers being broken

User · Answer

I tried a few of these that didn t cover my needs  especially the highest voted which didn t catch a url without a path  http   example com    also lack of group names made it unusable in ansible  or perhaps my jinja2 skills are lacking    so this is my version slightly modified with the source being the highest voted version here       P lt protocol gt http s   ftp           P lt host gt       s     P lt path gt      w         w -         s              w -

User · Answer

A single regex to parse and breakup a   full URL including query parameters   and anchors e g       https   www google com dir 1 2 search html arg 0-a amp arg1 1-b amp arg3-c hash          http s   ftp                s        w         w -         s            w -           RexEx positions       url  RegExp    amp           protocol RegExp  2       host RegExp  3        path RegExp  4       file RegExp  6        query RegExp  7       hash RegExp  8   you could then further parse the host      delimited  quite easily   What I would do is use something like this                    A-Za-z0-9 -        0-9             proto  1 host  2 port  3 the-rest  4   the further parse  the rest  to be as specific as possible  Doing it in one regex is  well  a bit crazy

User · Answer

I needed a regular Expression to match all urls and made this one                                                                                                                                        0-9                                                                  It matches all urls  any protocol  even urls like  ftp   user pass www cs server com 8080 dir1 dir2 file php param1 value1 hashtag   The result  in JavaScript  looks like this     ftp    user    pass    www cs    server    com    8080     dir1 dir2     file php    param1 value1    hashtag     An url like  mailto   admin www cs server com   looks like this     mailto    admin   undefined   www cs    server    com   undefined  undefined  undefined  undefined  undefined

User · Answer

I m a few years late to the party  but I m surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression  The regular expression  written by Berners-Lee  et al   is                                                               12            3  4          5       6  7        8 9       The numbers in the second line above are only to assist readability    they indicate the reference points for each subexpression  i e   each   paired parenthesis    We refer to the value matched for subexpression    as     For example  matching the above expression to      http   www ics uci edu pub ietf uri  Related      results in the following subexpression matches    1   http   2   http  3     www ics uci edu  4   www ics uci edu  5    pub ietf uri   6    lt undefined gt   7    lt undefined gt   8    Related  9   Related    For what it s worth  I found that I had to escape the forward slashes in JavaScript

User · Answer

I tried this regex for parsing url partitions     http s   ftp                s                                                                       URL  https   www google com my path sample asd-dsa this key1 value1 amp key2 value2 Matches  Group 1     0-7 https   Group 2     0-5 https Group 3     8-22    www google com Group 6     22-50    my path sample asd-dsa this Group 7     22-46    my path sample asd-dsa  Group 8     46-50   this Group 9     50-74    key1 value1 amp key2 value2 Group 10    51-74   key1 value1 amp key2 value2

User · Answer

Propose a much more readable solution  in Python  but applies to any regex    def url path to dict path       pattern    r                   r    P lt schema gt                           r    P lt user gt         P lt password gt                           r   P lt host gt                      r     P lt port gt  d                      r   P lt path gt                        r   P lt query gt                          r                         regex   re compile pattern      m   regex match path      d   m groupdict   if m is not None else None      return d  def main        print url path to dict  http   example example com example example example html     Prints       host    example example com     user   None    path     example example example html     query   None    password   None    port   None    schema    http

User · Answer

This improved version should work as reliably as a parser         Applies to URI  not just URL or URN           http   en wikipedia org wiki Uniform Resource Identifier Relationship to URL and URN             http   labs apache org webarch uri rfc rfc3986 html regexp                                                                                          http   en wikipedia org wiki URI scheme Generic syntax                matches the entire uri        1 matches scheme  ftp  http  mailto  mshelp  ymsgr  etc         2 matches authority  host  user pwd host  etc         3 matches path        4 matches query  http GET REST api  etc         5 matches fragment  html anchor  etc              Match specific schemes  non-optional authority  disallow white-space so can delimit in text  and allow  www   w o scheme       Note the schemes must match     s               s                             www     s          s        schemes        s           s              s          S                 Validate the authority with an orthogonal RegExp  so the RegExp above won   t fail to match any valid urls     function uriRegExp  flags  schemes     null    noSubMatches     false                if   schemes            schemes        s               else if   RegExp       s                s               test  schemes              throw TypeError   expected URI schemes          return noSubMatches   new RegExp      www       s            s          schemes           s          s              s           S      flags              new RegExp         www       s            s            schemes             s            s                s             S       flags               http   en wikipedia org wiki URI scheme Official IANA-registered schemes    function uriSchemesRegExp              return  about callto ftp gtalk http https irc ircs javascript mailto mshelp sftp ssh steam tel view-source ymsgr

User · Answer

The regex to do full parsing is quite horrendous  I ve included named backreferences for legibility  and broken each part into separate lines  but it still looks like this         P lt protocol gt  w                            P lt host gt        amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                      P lt port gt  0-9                P lt path gt        amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                     P lt file gt        amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                      P lt querystring gt        amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                      P lt fragment gt          The thing that requires it to be so verbose is that except for the protocol or the port  any of the parts can contain HTML entities  which makes delineation of the fragment quite tricky  So in the last few cases - the host  path  file  querystring  and fragment  we allow either any html entity or any character that isn t a   or    The regex for an html entity looks like this    htmlentity     amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z        When that is extracted  I used a mustache syntax to represent it   it becomes a bit more legible         P lt protocol gt    ht f tps   w                            P lt host gt      htmlentity                   P lt port gt  0-9                P lt path gt      htmlentity                  P lt file gt      htmlentity                   P lt querystring gt      htmlentity                    P lt fragment gt          In JavaScript  of course  you can t use named backreferences  so the regex becomes        w                                 amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                      0-9                     amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                          amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                           amp    amp apos gt lt nbsp quot bull hellip  lr  ds quo  mn dash permil    1-9  0-9  1 3   A-Za-z  0-9A-Za-z                             and in each match  the protocol is  1  the host is  2  the port is  3  the path  4  the file  5  the querystring  6  and the fragment  7

User · Answer

I know you re claiming language-agnostic on this  but can you tell us what you re using just so we know what regex capabilities you have   If you have the capabilities for non-capturing matches  you can modify hometoast s expression so that subexpressions that you aren t interested in capturing are set up like this      SOMESTUFF   You d still have to copy and paste  and slightly modify  the Regex into multiple places  but this makes sense--you re not just checking to see if the subexpression exists  but rather if it exists as part of a URL  Using the non-capturing modifier for subexpressions can give you what you need and nothing more  which  if I m reading you correctly  is what you want   Just as a small  small note  hometoast s expression doesn t need to put brackets around the  s  for  https   since he only has one character in there  Quantifiers quantify the one character  or character class or subexpression  directly preceding them  So   https   would match  http  or  https  just fine

User · Answer

String s    https   www thomas-bayer com axis2 services BLZService wsdl    String regex      http                   1           System out println  1      s replaceAll regex    1     System out println  2      s replaceAll regex    2     System out println  3      s replaceAll regex    3     System out println  4      s replaceAll regex    4       Will provide the following output      1  https        2  www thomas-bayer com     3        4  axis2 services BLZService wsdl  If you change the URL to      String s    https   www thomas-bayer com wsdl qwerwer amp ttt 888   the output will be the following       1  https        2  www thomas-bayer com     3        4  wsdl qwerwer amp ttt 888  enjoy   Yosi Lev

User · Answer

None of the above worked for me   Here s what I ended up using            https  s ftp               s         d             s

User · Answer

USING REGEX        Parse URL to get information        param   url     the URL string to parse     return  parsed  the URL parsed or null     var UrlParser   function  url         use strict        var regx                                                                                                                 0-9                                                                           matches   regx exec url           parser   null       if  null     matches            parser                 href                matches 0               withoutHash         matches 1               url                 matches 2               origin              matches 3               protocol            matches 4               protocolseparator   matches 5               credhost            matches 6               cred                matches 7               user                matches 8               pass                matches 9               host                matches 10               hostname            matches 11               port                matches 12               pathname            matches 13               segment1            matches 14               segment2            matches 15               search              matches 16               hash                matches 17                        return parser      var parsedURL UrlParser url   console log parsedURL

User · Answer

regexp to get the URL path without the file   url    http   domain dir1 dir2 somefile  url scan    http                                          i  to s  It can be useful for adding a relative path to this url

User · Answer

Using http   www fileformat info tool regex htm hometoast s regex works great   But here is the deal  I want to use different regex patterns in different situations in my program   For example  I have this URL  and I have an enumeration that lists all supported URLs in my program  Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL  If the particular regex pattern returns true  then I know that this URL is supported by my program  So  each enumeration has it s own regex depending on where it should look inside the URL   Hometoast s suggestion is great  but in my case  I think it wouldn t help  unless I copy paste the same regex in all enumerations    That is why I wanted the answer to give the regex for each situation separately  Although  1 for hometoast

[regex] Getting parts of a URL (Regex)

Examples related to regex

Examples related to language-agnostic

Examples related to url