How to validate a url in Python Malformed or not

Question

I have url from the user and I have to reply with the fetched HTML  How can I check for the URL to be malformed or not  For example   url  google      Malformed url  google com      Malformed url  http   google com      Valid url  http   google       Malformed

User · Answer

All of the above solutions recognize a string like  http   www google com path www yahoo com path  as valid  This solution always works as it should  import re    URL-link validation ip middle octet   u         1  d 1 2  2 0-4  d 25 0-5     ip last octet   u          1-9  d  1 d d 2 0-4  d 25 0-4      URL PATTERN   re compile                          u                              protocol identifier                         u       https  ftp rtsp rtp mmp                                 user pass authentication                         u     S      S                                u                              u   P lt private ip gt                             IP address exclusion                           private  amp  local networks                         u    localhost                            u       10 127     ip middle octet   u  2     ip last octet   u                             u       169  254 192  168     ip middle octet   ip last octet   u                             u    172     1 6-9  2 d 3 0-1      ip middle octet   ip last octet   u                             u                              IP address dotted notation octets                           excludes loopback network 0 0 0 0                           excludes reserved space  gt   224 0 0 0                           excludes network  amp  broadcast addresses                            first  amp  last IP address of each class                          u   P lt public ip gt                           u     1-9  d  1 d d 2 01  d 22 0-3                            u     ip middle octet   u  2                           u     ip last octet   u                            u                              host name                         u        a-z u00a1- uffff0-9 - -    a-z u00a1- uffff0-9 -                               domain name                         u          a-z u00a1- uffff0-9 - -    a-z u00a1- uffff0-9 -                                TLD identifier                         u          a-z u00a1- uffff  2                              u                              port number                         u      d 2 5                               resource path                         u      S                               query string                         u       S                             u                             re UNICODE   re IGNORECASE                          def url validate url              URL string validation                                                                                                                                                                   return re compile URL PATTERN  match url

User · Answer

EDIT     As pointed out by  Kwame   the below code does validate the url even if the  com or  co etc are not present       also pointed out by  Blaise  URLs like https   www google is a valid URL    and you need to do a DNS check for checking if it resolves or not  separately    This is simple and works   So min attr contains the basic set of strings that needs to be present to define the validity of a URL  i e http    part and google com part   urlparse scheme stores http    and   urlparse netloc store the domain name google com  from urlparse import urlparse def url check url        min attr     scheme     netloc       try          result   urlparse url          if all  result scheme  result netloc                return True         else              return False     except          return False   all   returns true if all the variables inside it return true  So if result scheme and result netloc is present i e  has some value then the URL is valid and hence returns True

User · Answer

Use the validators package   gt  gt  gt  import validators  gt  gt  gt  validators url  quot http   google com quot   True  gt  gt  gt  validators url  quot http   google quot   ValidationFailure func url  args   value    http   google    require tld   True    gt  gt  gt  if not validators url  quot http   google quot            print  quot not valid quot       not valid  gt  gt  gt   Install it from PyPI with pip  pip install validators

User · Answer

Actually  I think this is the best way   from django core validators import URLValidator from django core exceptions import ValidationError  val   URLValidator verify exists False  try      val  http   www google com   except ValidationError  e      print e   If you set verify exists to True  it will actually verify that the URL exists  otherwise it will just check if it s formed correctly   edit  ah yeah  this question is a duplicate of this  How can I check if a URL exists with Django   s validators

User · Answer

I landed on this page trying to figure out a sane way to validate strings as  valid  urls  I share here my solution using python3  No extra libraries required   See https   docs python org 2 library urlparse html if you are using python2   See https   docs python org 3 0 library urllib parse html if you are using python3 as I am   import urllib from pprint import pprint  invalid url    dkakasdkjdjakdjadjfalskdjfalk  valid url    https   stackoverflow com  tokens    urllib parse urlparse url  for url in  invalid url  valid url    for token in tokens      pprint token   min attributes     scheme    netloc      add attrs to your liking for token in tokens      if not all  getattr token  attr  for attr in min attributes            error      url   string has no scheme or netloc   format url token geturl            print error      else          print    url   is probably a valid url   format url token geturl          ParseResult scheme     netloc     path  dkakasdkjdjakdjadjfalskdjfalk   params     query     fragment          ParseResult scheme  https   netloc  stackoverflow com   path     params     query     fragment           dkakasdkjdjakdjadjfalskdjfalk  string has no scheme or netloc        https   stackoverflow com  is probably a valid url    Here is a more concise function   from urllib parse import urlparse  min attributes     scheme    netloc     def is valid url  qualifying min attributes       tokens   urlparse url      return all  getattr tokens  qualifying attr                  for qualifying attr in qualifying

User · Answer

note - lepl is no longer supported  sorry  you re welcome to use it  and i think the code below works  but it s not going to get updates    rfc 3696 http   www faqs org rfcs rfc3696 html defines how to do this  for http urls and email    i implemented its recommendations in python using lepl  a parser library    see http   acooke org lepl rfc3696 html  to use    gt  easy install lepl      gt  python      gt  gt  gt  from lepl apps rfc3696 import HttpUrl  gt  gt  gt  validator   HttpUrl    gt  gt  gt  validator  google   False  gt  gt  gt  validator  http   google   False  gt  gt  gt  validator  http   google com   True

User · Answer

Not directly relevant  but often it s required to identify whether some token CAN be a url or not  not necessarily 100  correctly formed  ie  https part omitted and so on   I ve read this post and did not find the solution  so I am posting my own here for the sake of completeness  def get domain suffixes        import requests     res requests get  https   publicsuffix org list public suffix list dat       lst set       for line in res text split   n            if not line startswith                    domains line split                  cand domains -1              if cand                  lst add     cand      return tuple sorted lst    domain suffixes get domain suffixes    def reminds url txt str        quot  quot  quot       gt  gt  gt  reminds url  yandex ru com somepath       True           quot  quot  quot      ltext txt lower   split      0      return ltext startswith   http   www   ftp    or ltext endswith domain suffixes

User · Answer

Validate URL with urllib and Django-like regex  The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case  Feel free to adapt it to yours   Python 3 7  import re import urllib    Check https   regex101 com r A326u1 5 for reference DOMAIN FORMAT   re compile      r       w 1 255      1 255          http basic authentication  optional      r           S 0 253             check full domain length to be less than or equal to 253  starting after http basic auth  stopping before port      r      a-z0-9     a-z0-9-  0 61  a-z0-9           check for at least one subdomain  maximum length per subdomain  63 characters   dashes in between allowed     r     a-z0-9  1 63        check for top level domain  no dashes allowed     r  localhost     accept also  localhost  only     r    d 1 5        port  optional      re IGNORECASE   SCHEME FORMAT   re compile      r   http hxxp ftp fxp s       scheme  http s  or ftp s      re IGNORECASE    def validate url url  str       url   url strip        if not url          raise Exception  No URL specified        if len url   gt  2048          raise Exception  URL exceeds its maximum length of 2048 characters  given length      format len url         result   urllib parse urlparse url      scheme   result scheme     domain   result netloc      if not scheme          raise Exception  No URL scheme specified        if not re fullmatch SCHEME FORMAT  scheme           raise Exception  URL scheme must either be http s  or ftp s   given scheme      format scheme        if not domain          raise Exception  No URL domain specified        if not re fullmatch DOMAIN FORMAT  domain           raise Exception  URL domain malformed  domain      format domain        return url   Explanation   The code only validates the scheme and netloc part of a given URL   To do this properly  I split the URL with urllib parse urlparse   in the two according parts which are then matched with the corresponding regex terms   The netloc part stops before the first occurrence of a slash    so port numbers are still part of the netloc  e g     https   www google com 80 search q python                                                                    -- netloc  aka  domain  in my code     -- scheme  IPv4 addresses are also validated   IPv6 Support  If you want the URL validator to also work with IPv6 addresses  do the following    Add is valid ipv6 ip  from Markus Jarderot s answer  which has a really good IPv6 validator regex Add and not is valid ipv6 domain  to the last if   Examples  Here are some examples of the regex for the netloc  aka domain  part in action     IPv4 and alphanumeric  https   regex101 com r A326u1 5 IPv6  https   regex101 com r lKIIgq 1  with the regex from Markus Jarderot s answer

User · Answer

A True or False version  based on  DMfll answer   try        python2     from urlparse import urlparse except        python3     from urllib parse import urlparse  a    http   www cwi nl 80  7Eguido Python html  b     data Python html  c   532 d   u dkakasdkjdjakdjadjfalskdjfalk   def uri validator x       try          result   urlparse x          return all  result scheme  result netloc  result path       except          return False  print uri validator a   print uri validator b   print uri validator c   print uri validator d     Gives   True False False False

User · Answer

Nowadays  I use the following  based on the Padam s answer     python --version Python 3 6 5   And this is how it looks   from urllib parse import urlparse  def is url url     try      result   urlparse url      return all  result scheme  result netloc     except ValueError      return False   Just use is url  http   www asdf com     Hope it helps

User · Answer

django url validation regex  source    import re regex   re compile          r     http ftp s        http    or https            r        A-Z0-9     A-Z0-9-  0 61  A-Z0-9           A-Z  2 6      A-Z0-9-  2          domain            r localhost    localhost            r  d 1 3    d 1 3    d 1 3    d 1 3         or ip         r      d       optional port         r            S      re IGNORECASE   print re match regex   http   www example com   is not None    True print re match regex   example com   is not None               False

[python] How to validate a url in Python? (Malformed or not)

Examples related to python

Examples related to url

Examples related to malformedurlexception