[regex] how to get domain name from URL

How can I fetch a domain name from a URL String?

Examples:

+----------------------+------------+
| input                | output     |
+----------------------+------------+
| www.google.com       | google     |
| www.mail.yahoo.com   | mail.yahoo |
| www.mail.yahoo.co.in | mail.yahoo |
| www.abc.au.uk        | abc        |
+----------------------+------------+

Related:

This question is related to regex url

The answer is


I once had to write such a regex for a company I worked for. The solution was this:

  • Get a list of every ccTLD and gTLD available. Your first stop should be IANA. The list from Mozilla looks great at first sight, but lacks ac.uk for example so for this it is not really usable.
  • Join the list like the example below. A warning: Ordering is important! If org.uk would appear after uk then example.org.uk would match org instead of example.

Example regex:

.*([^\.]+)(com|net|org|info|coop|int|co\.uk|org\.uk|ac\.uk|uk|__and so on__)$

This worked really well and also matched weird, unofficial top-levels like de.com and friends.

The upside:

  • Very fast if regex is optimally ordered

The downside of this solution is of course:

  • Handwritten regex which has to be updated manually if ccTLDs change or get added. Tedious job!
  • Very large regex so not very readable.

It is not possible without using a TLD list to compare with as their exist many cases like http://www.db.de/ or http://bbc.co.uk/ that will be interpreted by a regex as the domains db.de (correct) and co.uk (wrong).

But even with that you won't have success if your list does not contain SLDs, too. URLs like http://big.uk.com/ and http://www.uk.com/ would be both interpreted as uk.com (the first domain is big.uk.com).

Because of that all browsers use Mozilla's Public Suffix List:

https://en.wikipedia.org/wiki/Public_Suffix_List

You can use it in your code by importing it through this URL:

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Feel free to extend my function to extract the domain name, only. It won't use regex and it is fast:

http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878


Use this (.)(.*?)(.) then just extract the leading and end points. Easy, right?


#!/usr/bin/perl -w
use strict;

my $url = $ARGV[0];
if($url =~ /([^:]*:\/\/)?([^\/]*\.)*([^\/\.]+)\.[^\/]+/g) {
  print $3;
}

I know the question is seeking a regex solution but in every attempt it won't work to cover everything

I decided to write this method in Python which only works with urls that have a subdomain (i.e. www.mydomain.co.uk) and not multiple level subdomains like www.mail.yahoo.com

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

For a certain purpose I did this quick Python function yesterday. It returns domain from URL. It's quick and doesn't need any input file listing stuff. However, I don't pretend it works in all cases, but it really does the job I needed for a simple text mining script.

Output looks like this :

http://www.google.co.uk => google.co.uk
http://24.media.tumblr.com/tumblr_m04s34rqh567ij78k_250.gif => tumblr.com

def getDomain(url):    
        parts = re.split("\/", url)
        match = re.match("([\w\-]+\.)*([\w\-]+\.\w{2,6}$)", parts[2]) 
        if match != None:
            if re.search("\.uk", parts[2]): 
                match = re.match("([\w\-]+\.)*([\w\-]+\.[\w\-]+\.\w{2,6}$)", parts[2])
            return match.group(2)
        else: return ''  

Seems to work pretty well.
However, it has to be modified to remove domain extensions on output as you wished.


Basically, what you want is:

google.com        -> google.com    -> google
www.google.com    -> google.com    -> google
google.co.uk      -> google.co.uk  -> google
www.google.co.uk  -> google.co.uk  -> google
www.google.org    -> google.org    -> google
www.google.org.uk -> google.org.uk -> google

Optional:

www.google.com     -> google.com    -> www.google
images.google.com  -> google.com    -> images.google
mail.yahoo.co.uk   -> yahoo.co.uk   -> mail.yahoo
mail.yahoo.com     -> yahoo.com     -> mail.yahoo
www.mail.yahoo.com -> yahoo.com     -> mail.yahoo

You don't need to construct an ever-changing regex as 99% of domains will be matched properly if you simply look at the 2nd last part of the name:

(co|com|gov|net|org)

If it is one of these, then you need to match 3 dots, else 2. Simple. Now, my regex wizardry is no match for that of some other SO'ers, so the best way I've found to achieve this is with some code, assuming you've already stripped off the path:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 $dest=$d[$c-2].'.'.$d[$c-1];             # use the last 2 parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3].'.'.$dest;              # if so, add a third part
 };
 print $dest;                             # show it

To just get the name, as per your question:

 my @d=split /\./,$domain;                # split the domain part into an array
 $c=@d;                                   # count how many parts
 if ($d[$c-2]=~m/(co|com|gov|net|org)/) { # is the second-last part one of these?
   $dest=$d[$c-3];                        # if so, give the third last
   $dest=$d[$c-4].'.'.$dest if ($c>3);    # optional bit
 } else {
   $dest=$d[$c-2];                        # else the second last
   $dest=$d[$c-3].'.'.$dest if ($c>2);    # optional bit 
 };
 print $dest;                             # show it

I like this approach because it's maintenance-free. Unless you want to validate that it's actually a legitimate domain, but that's kind of pointless because you're most likely only using this to process log files and an invalid domain wouldn't find its way in there in the first place.

If you'd like to match "unofficial" subdomains such as bozo.za.net, or bozo.au.uk, bozo.msf.ru just add (za|au|msf) to the regex.

I'd love to see someone do all of this using just a regex, I'm sure it's possible.


import urlparse

GENERIC_TLDS = [
    'aero', 'asia', 'biz', 'com', 'coop', 'edu', 'gov', 'info', 'int', 'jobs', 
    'mil', 'mobi', 'museum', 'name', 'net', 'org', 'pro', 'tel', 'travel', 'cat'
    ]

def get_domain(url):
    hostname = urlparse.urlparse(url.lower()).netloc
    if hostname == '':
        # Force the recognition as a full URL
        hostname = urlparse.urlparse('http://' + uri).netloc

    # Remove the 'user:passw', 'www.' and ':port' parts
    hostname = hostname.split('@')[-1].split(':')[0].lstrip('www.').split('.')

    num_parts = len(hostname)
    if (num_parts < 3) or (len(hostname[-1]) > 2):
        return '.'.join(hostname[:-1])
    if len(hostname[-2]) > 2 and hostname[-2] not in GENERIC_TLDS:
        return '.'.join(hostname[:-1])
    if num_parts >= 3:
        return '.'.join(hostname[:-2])

This code isn't guaranteed to work with all URLs and doesn't filter those that are grammatically correct but invalid like 'example.uk'.

However it'll do the job in most cases.


So if you just have a string and not a window.location you could use...

String.prototype.toUrl = function(){

if(!this && 0 < this.length)
{
    return undefined;
}
var original = this.toString();
var s = original;
if(!original.toLowerCase().startsWith('http'))
{
    s = 'http://' + original;
}

s = this.split('/');

var protocol = s[0];
var host = s[2];
var relativePath = '';

if(s.length > 3){
    for(var i=3;i< s.length;i++)
    {
        relativePath += '/' + s[i];
    }
}

s = host.split('.');
var domain = s[s.length-2] + '.' + s[s.length-1];    

return {
    original: original,
    protocol: protocol,
    domain: domain,
    host: host,
    relativePath: relativePath,
    getParameter: function(param)
    {
        return this.getParameters()[param];
    },
    getParameters: function(){
        var vars = [], hash;
        var hashes = this.original.slice(this.original.indexOf('?') + 1).split('&');
        for (var i = 0; i < hashes.length; i++) {
            hash = hashes[i].split('=');
            vars.push(hash[0]);
            vars[hash[0]] = hash[1];
        }
        return vars;
    }
};};

How to use.

var str = "http://en.wikipedia.org/wiki/Knopf?q=1&t=2";
var url = str.toUrl;

var host = url.host;
var domain = url.domain;
var original = url.original;
var relativePath = url.relativePath;
var paramQ = url.getParameter('q');
var paramT = url.getParamter('t');

Just for knowledge:

'http://api.livreto.co/books'.replace(/^(https?:\/\/)([a-z]{3}[0-9]?\.)?(\w+)(\.[a-zA-Z]{2,3})(\.[a-zA-Z]{2,3})?.*$/, '$3$4$5');

# returns livreto.co 

  1. how is this

    =((?:(?:(?:http)s?:)?\/\/)?(?:(?:[a-zA-Z0-9]+)\.?)*(?:(?:[a-zA-Z0-9]+))\.[a-zA-Z0-9]{2,3}) (you may want to add "\/" to end of pattern

  2. if your goal is to rid url's passed in as a param you may add the equal sign as the first char, like:

    =((?:(?:(?:http)s?:)?//)?(?:(?:[a-zA-Z0-9]+).?)*(?:(?:[a-zA-Z0-9]+)).[a-zA-Z0-9]{2,3}/)

    and replace with "/"

The goal of this example to get rid of any domain name regardless of the form it appears in. (i.e. to ensure url parameters don't incldue domain names to avoid xss attack)


I don't know of any libraries, but the string manipulation of domain names is easy enough.

The hard part is knowing if the name is at the second or third level. For this you will need a data file you maintain (e.g. for .uk is is not always the third level, some organisations (e.g. bl.uk, jet.uk) exist at the second level).

The source of Firefox from Mozilla has such a data file, check the Mozilla licensing to see if you could reuse that.


Extracting the Domain name accurately can be quite tricky mainly because the domain extension can contain 2 parts (like .com.au or .co.uk) and the subdomain (the prefix) may or may not be there. Listing all domain extensions is not an option because there are hundreds of these. EuroDNS.com for example lists over 800 domain name extensions.

I therefore wrote a short php function that uses 'parse_url()' and some observations about domain extensions to accurately extract the url components AND the domain name. The function is as follows:

function parse_url_all($url){
    $url = substr($url,0,4)=='http'? $url: 'http://'.$url;
    $d = parse_url($url);
    $tmp = explode('.',$d['host']);
    $n = count($tmp);
    if ($n>=2){
        if ($n==4 || ($n==3 && strlen($tmp[($n-2)])<=3)){
            $d['domain'] = $tmp[($n-3)].".".$tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-3)];
        } else {
            $d['domain'] = $tmp[($n-2)].".".$tmp[($n-1)];
            $d['domainX'] = $tmp[($n-2)];
        }
    }
    return $d;
}

This simple function will work in almost every case. There are a few exceptions, but these are very rare.

To demonstrate / test this function you can use the following:

$urls = array('www.test.com', 'test.com', 'cp.test.com' .....);
echo "<div style='overflow-x:auto;'>";
echo "<table>";
echo "<tr><th>URL</th><th>Host</th><th>Domain</th><th>Domain X</th></tr>";
foreach ($urls as $url) {
    $info = parse_url_all($url);
    echo "<tr><td>".$url."</td><td>".$info['host'].
    "</td><td>".$info['domain']."</td><td>".$info['domainX']."</td></tr>";
}
echo "</table></div>";

The output will be as follows for the URL's listed:

enter image description here

As you can see, the domain name and the domain name without the extension are consistently extracted whatever the URL that is presented to the function.

I hope that this helps.


/^(?:www\.)?(.*?)\.(?:com|au\.uk|co\.in)$/

/^(?:https?:\/\/)?(?:www\.)?([^\/]+)/i

/[^w{3}\.]([a-zA-Z0-9]([a-zA-Z0-9\-]{0,65}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}/gim

usage of this javascript regex ignores www and following dot, while retaining the domain intact. also properly matches no www and cc tld


There are two ways

Using split

Then just parse that string

var domain;
//find & remove protocol (http, ftp, etc.) and get domain
if (url.indexOf('://') > -1) {
    domain = url.split('/')[2];
} if (url.indexOf('//') === 0) {
    domain = url.split('/')[2];
} else {
    domain = url.split('/')[0];
}

//find & remove port number
domain = domain.split(':')[0];

Using Regex

 var r = /:\/\/(.[^/]+)/;
 "http://stackoverflow.com/questions/5343288/get-url".match(r)[1] 
 => stackoverflow.com

Hope this helps


A little late to the party, but:

_x000D_
_x000D_
const urls = [_x000D_
  'www.abc.au.uk',_x000D_
  'https://github.com',_x000D_
  'http://github.ca',_x000D_
  'https://www.google.ru',_x000D_
  'http://www.google.co.uk',_x000D_
  'www.yandex.com',_x000D_
  'yandex.ru',_x000D_
  'yandex'_x000D_
]_x000D_
_x000D_
urls.forEach(url => console.log(url.replace(/.+\/\/|www.|\..+/g, '')))
_x000D_
_x000D_
_x000D_


You need a list of what domain prefixes and suffixes can be removed. For example:

Prefixes:

  • www.

Suffixes:

  • .com
  • .co.in
  • .au.uk