Get protocol host name from URL

Question

In my Django app  I need to get the host name from the referrer in request META get  HTTP REFERER   along with its protocol so that from URLs like    https   docs google com spreadsheet ccc key blah-blah-blah-blah gid 1 https   stackoverflow com questions 1234567 blah-blah-blah-blah http   www example com https   www other-domain com whatever blah blah  v1 0 amp v2 blah blah       I should get    https   docs google com  https   stackoverflow com  http   www example com https   www other-domain com    I looked over other related questions and found about urlparse  but that didn t do the trick since   gt  gt  gt  urlparse request META get  HTTP REFERER    hostname  docs google com

User · Answer

Here is a slightly improved version   urls          http   stackoverflow com 8080 some folder test  questions 9626535 get-domain-name-from-url        Stackoverflow com 8080 some folder test  questions 9626535 get-domain-name-from-url        http   stackoverflow com some folder test  questions 9626535 get-domain-name-from-url        https   StackOverflow com 8080 test  questions 9626535 get-domain-name-from-url        stackoverflow com test questions amp v get-domain-name-from-url   for url in urls      spltAr   url split             i    0 1  len spltAr  gt 1       dm   spltAr i  split      0  split      0  split      0  lower        print dm   Output  stackoverflow com stackoverflow com stackoverflow com stackoverflow com stackoverflow com   Fiddle  https   pyfiddle io fiddle 23e4976e-88d2-4757-993e-532aa41b7bf0  i true

User · Answer

Python3 using urlsplit   from urllib parse import urlsplit url    http   stackoverflow com questions 9626535 get-domain-name-from-url  base url     0 scheme     0 netloc    format urlsplit url   print base url    http   stackoverflow com

User · Answer

https   github com john-kurkowski tldextract  This is a more verbose version of urlparse   It detects domains and subdomains for you   From their documentation    gt  gt  gt  import tldextract  gt  gt  gt  tldextract extract  http   forums news cnn com    ExtractResult subdomain  forums news   domain  cnn   suffix  com    gt  gt  gt  tldextract extract  http   forums bbc co uk      United Kingdom ExtractResult subdomain  forums   domain  bbc   suffix  co uk    gt  gt  gt  tldextract extract  http   www worldbank org kg      Kyrgyzstan ExtractResult subdomain  www   domain  worldbank   suffix  org kg     ExtractResult is a namedtuple  so it s simple to access the parts you want    gt  gt  gt  ext   tldextract extract  http   forums bbc co uk    gt  gt  gt  ext domain  bbc   gt  gt  gt      join ext  2     rejoin subdomain and domain  forums bbc

User · Answer

It could be solved by re search    import re url    https   docs google com spreadsheet ccc key blah-blah-blah-blah gid 1  result   re search r  http s         w       url  group   print result    result  https   docs google com

User · Answer

if you think your url is valid then this will work all the time  domain    http   google com  split        1  split      0

User · Answer

You can simply use urljoin with relative root     as second argument  import urllib parse   url    https   stackoverflow com questions 9626535 get-protocol-host-name-from-url  root url   urllib parse urljoin url       print root url

User · Answer

Pure string operations       gt  gt  gt  url    http   stackoverflow com questions 9626535 get-domain-name-from-url   gt  gt  gt  url split       -1  split      0  split      0   stackoverflow com   gt  gt  gt  url    stackoverflow com questions 9626535 get-domain-name-from-url   gt  gt  gt  url split       -1  split      0  split      0   stackoverflow com   gt  gt  gt  url    http   foo bar haha whatever   gt  gt  gt  url split       -1  split      0  split      0   foo bar    That s all  folks

User · Answer

Is there anything wrong with pure string operations   url    http   stackoverflow com questions 9626535 get-domain-name-from-url  parts   url split       1  print parts 0       parts 1  split      1  0   gt  gt  gt  http   stackoverflow com   If you prefer having a trailing slash appended  extend this script a bit like so   parts   url split       1  base   parts 0       parts 1  split      1  0  print base    len url   gt  len base  and url len base       and    or       That can probably be optimized a bit

User · Answer

This is a bit obtuse  but uses urlparse in both directions   import urlparse def uri2schemehostname uri       urlparse urlunparse urlparse urlparse uri   2            4    that odd         4 bit is because urlparse expects a sequence of exactly len urlparse ParseResult  fields    6

User · Answer

I know it s an old question  but I too encountered it today  Solved this with an one-liner   import re result   re sub r                        g lt 1 gt  g lt 2 gt    url

User · Answer

You should be able to do it with urlparse  docs  python2  python3    from urllib parse import urlparse   from urlparse import urlparse    Python 2 parsed uri   urlparse  http   stackoverflow com questions 1234567 blah-blah-blah-blah    result     uri scheme     uri netloc    format uri parsed uri  print result     gives  http   stackoverflow com

User · Answer

The standard library function urllib parse urlsplit   is all you need  Here is an example for Python3    gt  gt  gt  import urllib parse  gt  gt  gt  o   urllib parse urlsplit  https   user pass www example com 8080 dir page html q1 test amp q2 a2 anchor1    gt  gt  gt  o scheme  https   gt  gt  gt  o netloc  user pass www example com 8080   gt  gt  gt  o hostname  www example com   gt  gt  gt  o port 8080  gt  gt  gt  o path   dir page html   gt  gt  gt  o query  q1 test amp q2 a2   gt  gt  gt  o fragment  anchor1   gt  gt  gt  o username  user   gt  gt  gt  o password  pass

User · Answer

If it contains less than 3 slashes thus you ve it got and if not then we can find the occurrence between it   import re  link   http   forum unisoftdev com something  slash count   len re findall      link   print slash count   output  3  if slash count  gt  2     regex   r                   pattern    re compile regex     path   re findall pattern  url      print path

User · Answer

to get domain hostname and Origin   url    https   stackoverflow com questions 9626535 get-protocol-host-name-from-url  hostname   url split      2    stackoverflow com origin       join url split       3     https   stackoverflow com    Origin is used in XMLHttpRequest headers

User · Answer

gt  gt  gt  import urlparse  gt  gt  gt  url    http   stackoverflow com questions 1234567 blah-blah-blah-blah   gt  gt  gt  urlparse urljoin url        http   stackoverflow com

[python] Get protocol + host name from URL

Examples related to python

Examples related to django