How do I make a simple crawler in PHP

Question

I have a web page with a bunch of links  I want to write a script which would dump all the data contained in those links in a local file   Has anybody done that with PHP  General guidelines and gotchas would suffice as an answer

User · Answer

In it s simplest form   function crawl page  url   depth   5        if  depth  gt  0             html   file get contents  url            preg match all    lt a   href            gt      html   matches            foreach  matches 1  as  newurl                crawl page  newurl   depth - 1                      file put contents  results txt    newurl   n n   html   n n   FILE APPEND            crawl page  http   www domain com index php   5     That function will get contents from a page  then crawl all found links and save the contents to  results txt   The functions accepts an second parameter  depth  which defines how long the links should be followed  Pass 1 there if you want to parse only links from the given page

User · Answer

Thank you  hobodave   However I found two weaknesses in your code  Your parsing of the original url to get the  host  segment stops at the first single slash  This presumes that all relative links start in the root directory  This only true sometimes   original url      http   example com game index html href in  lt a gt  tag   highscore html author s intent   http   example com game highscore html   lt -200- gt  crawler result    http   example com highscore html        lt -404- gt    fix this by breaking at the last single slash not the first  a second unrelated bug  is that  depth does not really track recursion depth  it tracks breadth of the first level of recursion    If I believed this page were in active use I might debug this second issue  but I suspect the text I am writing now will never be read by anyone  human or robot  since this issue is six years old and I do not even have enough reputation to notify  hobodave directly about these defects by commmenting on his code  Thanks anyway hobodave

User · Answer

I used  hobodave s code  with this little tweak to prevent re-crawling all fragment variants of the same URL    lt  php function crawl page  url   depth   5       parts   parse url  url     if array key exists  fragment    parts        unset  parts  fragment          url   http build url  parts          static  seen   array            Then you can also omit the  parts   parse url  url   line within the for loop

User · Answer

Hobodave you were very close  The only thing I have changed is within the if statement that checks to see if the href attribute of the found anchor tag begins with  http   Instead of simply adding the  url variable which would contain the page that was passed in you must first strip it down to the host which can be done using the parse url php function    lt  php function crawl page  url   depth   5      static  seen   array      if  isset  seen  url       depth     0        return          seen  url    true      dom   new DOMDocument  1 0        dom- gt loadHTMLFile  url       anchors    dom- gt getElementsByTagName  a      foreach   anchors as  element         href    element- gt getAttribute  href        if  0     strpos  href   http                this is where I changed hobodave s code             host    http     parse url  url PHP URL HOST            href    host        ltrim  href                  crawl page  href   depth - 1          echo  New Page  lt br   gt       echo  URL    url PHP EOL   lt br   gt    CONTENT   PHP EOL  dom- gt saveHTML   PHP EOL PHP EOL     lt br   gt  lt br   gt       crawl page  http   hobodave com    5     gt

User · Answer

It s an old question  A lot of good things happened since then  Here are my two cents on this topic    To accurately track the visited pages you have to normalize URI first  The normalization algorithm includes multiple steps    Sort query parameters  For example  the following URIs are equivalent after normalization   GET http   www example com query id 111 amp cat 222 GET http   www example com query cat 222 amp id 111  Convert the empty path  Example  http   example org   http   example org  Capitalize percent encoding  All letters within a percent-encoding triplet  e g     3A   are case-insensitive  Example  http   example org a c2 B1b   http   example org a C2 B1b Remove unnecessary dot-segments  Example  http   example org    a b    c   d html   http   example org a c d html Possibly some other normalization rules  Not only  lt a gt  tag has href attribute   lt area gt  tag has it too https   html com tags area   If you don t want to miss anything  you have to scrape  lt area gt  tag too  Track crawling progress  If the website is small  it is not a problem  Contrarily it might be very frustrating if you crawl half of the site and it failed  Consider using a database or a filesystem to store the progress  Be kind to the site owners  If you are ever going to use your crawler outside of your website  you have to use delays  Without delays  the script is too fast and might significantly slow down some small sites  From sysadmins perspective  it looks like a DoS attack  A static delay between the requests will do the trick    If you don t want to deal with that  try Crawlzone and let me know your feedback  Also  check out the article I wrote a while back https   www codementor io zstate this-is-how-i-crawl-n98s6myxm

User · Answer

With some little changes to hobodave s code  here is a codesnippet you can use to crawl pages  This needs the curl extension to be enabled in your server    lt  php   set time limit  0   function crawl page  url   depth   5    seen   array    if   depth    0  or  in array  url   seen         return        ch   curl init    curl setopt  ch  CURLOPT URL   url   curl setopt  ch  CURLOPT TIMEOUT  30   curl setopt  ch  CURLOPT RETURNTRANSFER 1    result   curl exec   ch   curl close   ch   if   result         stripped file   strip tags  result    lt a gt         preg match all    lt a  s     gt    href  s     s                         gt        lt          lt   a gt      stripped file   matches  PREG SET ORDER         foreach  matches as  match            href    match 1               if  0     strpos  href   http                       path         ltrim  href                        if  extension loaded  http                           href   http build url  href   array  path    gt   path                      else                        parts   parse url  href                        href    parts  scheme                                if  isset  parts  user     amp  amp  isset  parts  pass                                href     parts  user            parts  pass                                                     href     parts  host                        if  isset  parts  port                                href           parts  port                                               href     path                                              crawl page  href   depth - 1                  echo  Crawled   href         crawl page  http   www sitename com   3     gt    I have explained this tutorial in this crawler script tutorial

User · Answer

I created a small class to grab data from the provided url  then extract html elements of your choice  The class makes use of CURL and DOMDocument   php class   class crawler        public static  timeout   2     public static  agent      Mozilla 5 0  compatible  Googlebot 2 1   http   www google com bot html         public static function http request  url           ch   curl init          curl setopt  ch  CURLOPT URL              url         curl setopt  ch  CURLOPT USERAGENT       self   agent         curl setopt  ch  CURLOPT CONNECTTIMEOUT  self   timeout         curl setopt  ch  CURLOPT TIMEOUT         self   timeout         curl setopt  ch  CURLOPT RETURNTRANSFER  true          response   curl exec  ch         curl close  ch         return  response            public static function strip whitespace  data           data   preg replace    s           data         return trim  data             public static function extract elements  tag   data           response   array           dom        new DOMDocument          dom- gt loadHTML  data         foreach    dom- gt getElementsByTagName  tag  as  index   gt   element               response  index   text     self  strip whitespace  element- gt nodeValue            foreach    element- gt attributes as  attribute                  response  index   attributes   strtolower  attribute- gt nodeName     self  strip whitespace  attribute- gt nodeValue                            return  response               example usage    data    crawler  http request  https   stackoverflow com questions 2313107 how-do-i-make-a-simple-crawler-in-php     links   crawler  extract elements  a    data   if   count  links   gt  0        file put contents  links json   json encode  links  JSON PRETTY PRINT          example response                    text    Stack Overflow            attributes                  href    https     stackoverflow com                class    -logo js-gps-track                data-gps-track    top nav click  is current false  location 2  destination 8                                    text    Questions            attributes                  id    nav-questions                href      questions                class    -link js-gps-track                data-gps-track    top nav click  is current true  location 2  destination 1                                    text    Developer Jobs            attributes                  id    nav-jobs                href      jobs med site-ui amp ref jobs-tab                class    -link js-gps-track                data-gps-track    top nav click  is current false  location 2  destination 6

User · Answer

Why use PHP for this  when you can use wget  e g   wget -r -l 1 http   www example com   For how to parse the contents  see Best Methods to parse HTML and use the search function for examples  How to parse HTML has been answered multiple times before

User · Answer

Here my implementation based on the above example answer    It is class based  uses Curl support HTTP Auth Skip Url not belonging to the base domain Return Http header Response Code for each page Return time for each page   CRAWL CLASS   class crawler       protected   url      protected   depth      protected   host      protected   useHttpAuth   false      protected   user      protected   pass      protected   seen   array        protected   filter   array         public function   construct  url   depth   5                 this- gt  url    url           this- gt  depth    depth           parse   parse url  url            this- gt  host    parse  host               protected function  processAnchors  content   url   depth                 dom   new DOMDocument  1 0              dom- gt loadHTML  content            anchors    dom- gt getElementsByTagName  a             foreach   anchors as  element                 href    element- gt getAttribute  href                if  0     strpos  href   http                       path         ltrim  href                        if  extension loaded  http                           href   http build url  url  array  path    gt   path                      else                        parts   parse url  url                        href    parts  scheme                                if  isset  parts  user     amp  amp  isset  parts  pass                                href     parts  user            parts  pass                                                     href     parts  host                        if  isset  parts  port                                href           parts  port                                               href     path                                                 Crawl only link that belongs to the start domain              this- gt crawl page  href   depth - 1                        protected function  getContent  url                 handle   curl init  url           if   this- gt  useHttpAuth                curl setopt  handle  CURLOPT HTTPAUTH  CURLAUTH ANY               curl setopt  handle  CURLOPT USERPWD   this- gt  user          this- gt  pass                        follows 302 redirect  creates problem wiht authentication           curl setopt  handle  CURLOPT FOLLOWLOCATION  TRUE              return the content         curl setopt  handle  CURLOPT RETURNTRANSFER  TRUE               Get the HTML or whatever is linked in  url              response   curl exec  handle              response total time          time   curl getinfo  handle  CURLINFO TOTAL TIME              Check for 404  file not found               httpCode   curl getinfo  handle  CURLINFO HTTP CODE            curl close  handle           return array  response   httpCode   time              protected function  printResult  url   depth   httpcode   time                ob end flush             currentDepth    this- gt  depth -  depth           count   count  this- gt  seen           echo  N   count CODE   httpcode TIME   time DEPTH   currentDepth URL   url  lt br gt            ob start            flush               protected function isValid  url   depth                if  strpos  url   this- gt  host      false                 depth     0                isset  this- gt  seen  url                           return false                    foreach   this- gt  filter as  excludePath                if  strpos  url   excludePath      false                    return false                                  return true             public function crawl page  url   depth                if    this- gt isValid  url   depth                 return                       add to the seen URL          this- gt  seen  url    true             get Content and Return Code         list  content   httpcode   time     this- gt  getContent  url              print Result for current Page          this- gt  printResult  url   depth   httpcode   time              process subPages          this- gt  processAnchors  content   url   depth              public function setHttpAuth  user   pass                 this- gt  useHttpAuth   true           this- gt  user    user           this- gt  pass    pass             public function addFilterPath  path                 this- gt  filter      path             public function run                  this- gt crawl page  this- gt  url   this- gt  depth             USAGE      USAGE  startURL    http   YOUR URL     depth   6   username    YOURUSER    password    YOURPASS    crawler   new crawler  startURL   depth    crawler- gt setHttpAuth  username   password      Exclude path with the following structure to be processed   crawler- gt addFilterPath  customer account login referer     crawler- gt run

User · Answer

I came up with the following spider code  I adapted it a bit from the following  PHP - Is the there a safe way to perform deep recursion  it seems fairly rapid           lt  php function  spider   base url    search urls array            queue      base url       done               array         found urls         array        while  queue                 link   array shift  queue               if  is array  link                      done      link                  foreach   search urls as  s    if  strstr   link    s       found urls      link                      if  empty  search urls      found urls      link                    if  empty  link      echo  LINK      link                         content      file get contents   link      echo  P      content                      preg match all    lt a   href            gt      content   sublink                       if   in array  sublink    done   amp  amp   in array  sublink    queue                                   queue      sublink                                                        else                        result array                         return   array                           flatten multi dimensional array of URLs to one dimensional                      while count  link                               value   array shift  link                            if is array  value                                foreach  value as  sub                                   link      sub                           else                                 return      value                                                 now loop over one dimensional array                       foreach  return as  link                                       echo  L     link                                     url may be in form  lt a href   so extract what s in the href bit                                  preg match all    lt a   gt   href           lt href gt      1   gt    gt  i    link   result                                   if   isset   result  href   0        link    result  href   0                                        add the new URL to the queue                                  if    strstr   link    http     amp  amp    in array  base url  link    done    amp  amp    in array  base url  link    queue                                             queue    base url  link                                    else                                       if    strstr   link    base url       amp  amp    in array  base url  link    done    amp  amp    in array  base url  link    queue                                                 queue      link                                                                                                                            return  found urls               base url            https   www houseofcheese co uk         search urls        array    base url  acatalog           done   spider   base url     search urls                    RESULT                   echo   lt br   gt  lt br   gt        echo  RESULT          foreach    done as  r              echo  URL      r   lt br   gt

User · Answer

Its worth remembering that when crawling external links  I do appreciate the OP relates to a users own page  you should be aware of robots txt  I have found the following which will hopefully help http   www the-art-of-web com php parse-robots

User · Answer

Check out PHP Crawler  http   sourceforge net projects php-crawler   See if it helps

User · Answer

Meh  Don t parse HTML with regexes   Here s a DOM version inspired by Tatu s    lt  php function crawl page  url   depth   5        static  seen   array        if  isset  seen  url       depth     0            return              seen  url    true        dom   new DOMDocument  1 0          dom- gt loadHTMLFile  url         anchors    dom- gt getElementsByTagName  a        foreach   anchors as  element             href    element- gt getAttribute  href            if  0     strpos  href   http                   path         ltrim  href                    if  extension loaded  http                       href   http build url  url  array  path    gt   path                  else                    parts   parse url  url                    href    parts  scheme                            if  isset  parts  user     amp  amp  isset  parts  pass                            href     parts  user            parts  pass                                             href     parts  host                    if  isset  parts  port                            href           parts  port                                       href    dirname  parts  path    1   path                                  crawl page  href   depth - 1             echo  URL    url PHP EOL  CONTENT   PHP EOL  dom- gt saveHTML   PHP EOL PHP EOL    crawl page  http   hobodave com   2     Edit  I fixed some bugs from Tatu s version  works with relative URLs now    Edit  I added a new bit of functionality that prevents it from following the same URL twice   Edit  echoing output to STDOUT now so you can redirect it to whatever file you want  Edit  Fixed a bug pointed out by George in his answer  Relative urls will no longer append to the end of the url path  but overwrite it  Thanks to George for this  Note that George s answer doesn t account for any of  https  user  pass  or port  If you have the http PECL extension loaded this is quite simply done using http build url  Otherwise  I have to manually glue together using parse url  Thanks again George

User · Answer

As mentioned  there are crawler frameworks all ready for customizing out there  but if what you re doing is as simple as you mentioned  you could make it from scratch pretty easily   Scraping the links  http   www phpro org examples Get-Links-With-DOM html  Dumping results to a file  http   www tizag com phpT filewrite php

User · Answer

You can try this it may be help to you   search string    american golf News  Fowler beats stellar field in Abu Dhabi    html   file get contents url of the site    dom   new DOMDocument   titalDom   new DOMDocument   tmpTitalDom   new DOMDocument  libxml use internal errors true     dom- gt loadHTML  html   libxml use internal errors false    xpath   new DOMXPath  dom    videos    xpath- gt query    div  class  primary-content      foreach   videos as  key   gt   video     newdomaindom   new DOMDocument       newnode    newdomaindom- gt importNode  video  true    newdomaindom- gt appendChild  newnode     titalDom- gt loadHTML  newdomaindom- gt saveHTML      xpath1   new DOMXPath  titalDom    titles    xpath1- gt query    div  class  listingcontainer   div  class  list      if strcmp preg replace    s           titles- gt item 0 - gt nodeValue   search string              tmpNode    tmpTitalDom- gt importNode  video  true        tmpTitalDom- gt appendChild  tmpNode       break      echo  tmpTitalDom- gt saveHTML

[php] How do I make a simple crawler in PHP?

Examples related to php

Examples related to web-crawler