how to detect search engine bots with php

Question

How can one detect the search engine bots using php

User · Answer

Check the   SERVER  HTTP USER AGENT   for some of the strings listed here   http   www useragentstring com pages useragentstring php  Or more specifically for crawlers   http   www useragentstring com pages useragentstring php typ Crawler  If you want to -say- log the number of visits of most common search engine crawlers  you could use   interestingCrawlers   array   google    yahoo      pattern          implode       interestingCrawlers          matches   array     numMatches   preg match  pattern  strtolower   SERVER  HTTP USER AGENT      matches   i    if  numMatches  gt  0     Found a match         matches 1  contains an array of all text matches to either  google  or  yahoo

User · Answer

I m using this code  pretty good  You will very easy to know user-agents visitted your site  This code is opening a file and write the user agent down the file  You can check each day this file by go to yourdomain com useragent txt and know about new user agents and put them in your condition of if clause    user agent   strtolower   SERVER  HTTP USER AGENT     if  preg match   Googlebot MJ12bot yandexbot i    user agent           if not meet the conditions then        do what you need         here open a file and write the user agent down the file  You can check each day this file useragent txt and know about new user agents and put them in your condition of if clause     if  user agent                myfile   fopen  useragent txt    a   or die  Unable to open file useragent txt             fwrite  myfile   user agent            user agent     n           fwrite  myfile   user agent           fclose  myfile             This is the content of useragent txt  Mozilla 5 0  compatible  Googlebot 2 1   http   www google com bot html  Mozilla 5 0  compatible  MJ12bot v1 4 6  http   mj12bot com  Mozilla 5 0  compatible  Googlebot 2 1   http   www google com bot html  Mozilla 5 0  Linux  Android 6 0 1  Nexus 5X Build MMB29P  AppleWebKit 537 36  KHTML  like Gecko  Chrome 41 0 2272 96 Mobile Safari 537 36  compatible  Googlebot 2 1   http   www google com bot html mozilla 5 0  compatible  yandexbot 3 0   http   yandex com bots  mozilla 5 0  compatible  yandexbot 3 0   http   yandex com bots  mozilla 5 0  compatible  yandexbot 3 0   http   yandex com bots  mozilla 5 0  compatible  yandexbot 3 0   http   yandex com bots  mozilla 5 0  compatible  yandexbot 3 0   http   yandex com bots  mozilla 5 0  iphone  cpu iphone os 9 3 like mac os x  applewebkit 601 1 46  khtml  like gecko  version 9 0 mobile 13e198 safari 601 1 mozilla 5 0  windows nt 6 1  wow64  applewebkit 537 36  khtml  like gecko  chrome 53 0 2785 143 safari 537 36 mozilla 5 0  compatible  linkdexbot 2 2   http   www linkdex com bots   mozilla 5 0  windows nt 6 1  wow64  rv 49 0  gecko 20100101 firefox 49 0 mozilla 5 0  windows nt 6 1  wow64  rv 33 0  gecko 20100101 firefox 33 0 mozilla 5 0  windows nt 6 1  wow64  rv 49 0  gecko 20100101 firefox 49 0 mozilla 5 0  windows nt 6 1  wow64  rv 33 0  gecko 20100101 firefox 33 0 mozilla 5 0  windows nt 6 1  wow64  rv 49 0  gecko 20100101 firefox 49 0 mozilla 5 0  windows nt 6 1  wow64  rv 33 0  gecko 20100101 firefox 33 0 mozilla 5 0  windows nt 6 1  wow64  rv 49 0  gecko 20100101 firefox 49 0 mozilla 5 0  windows nt 6 1  wow64  rv 33 0  gecko 20100101 firefox 33 0 mozilla 5 0  windows nt 6 1  wow64  applewebkit 537 36  khtml  like gecko  chrome 53 0 2785 143 safari 537 36 mozilla 5 0  windows nt 6 1  wow64  applewebkit 537 36  khtml  like gecko  chrome 53 0 2785 143 safari 537 36 mozilla 5 0  compatible  baiduspider 2 0   http   www baidu com search spider html  zoombot  linkbot 1 0 http   suite seozoom it bot html  mozilla 5 0  windows nt 10 0  wow64  applewebkit 537 36  khtml  like gecko  chrome 44 0 2403 155 safari 537 36 opr 31 0 1889 174 mozilla 5 0  windows nt 10 0  wow64  applewebkit 537 36  khtml  like gecko  chrome 44 0 2403 155 safari 537 36 opr 31 0 1889 174 sogou web spider 4 0  http   www sogou com docs help webmasters htm 07  mozilla 5 0  windows nt 10 0  wow64  applewebkit 537 36  khtml  like gecko  chrome 44 0 2403 155 safari 537 36 opr 31 0 1889 174

User · Answer

I made one good and fast function for this  function is bot             if isset   SERVER  HTTP USER AGENT                           return preg match   rambler abacho acoi accona aspseek altavista estyle scrubby lycos geona ia archiver alexa sogou skype facebook twitter pinterest linkedin naver bing google yahoo duckduckgo yandex baidu teoma xing java  1 7 0 45 bot crawl slurp spider mediapartners  sask s  saol s i     SERVER  HTTP USER AGENT                        return false          This cover 99  of all possible bots  search engines etc

User · Answer

If you really need to detect GOOGLE engine bots you should never rely on  quot user agent quot  or  quot IP quot  address because  quot user agent quot  can be changed  and acording to what google said in  Verifying Googlebot  To verify Googlebot as the caller  1 Run a reverse DNS lookup on the accessing IP address from your logs  using the host command  2 Verify that the domain name is in either googlebot com or google com 3 Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name  Verify that it is the same as the original accessing IP address from your logs   Here is my tested code    lt  php  remote add   SERVER  REMOTE ADDR     hostname   gethostbyaddr  remote add    googlebot    googlebot com    google    google com   if  stripos strrev  hostname   strrev  googlebot       0 or stripos strrev  hostname  strrev  google       0        add your code      gt   In this code we check  quot hostname quot  which should contain  quot googlebot com quot  or  quot google com quot  at the end of  quot hostname quot  which is really important to check exact domain not subdomain  I hope you enjoy

User · Answer

You could analyse the user agent    SERVER  HTTP USER AGENT    or compare the client   s IP address    SERVER  REMOTE ADDR    with a list of IP addresses of search engine bots

User · Answer

lt  php    IPCLOACK HOOK if  CLOAKING LEVEL    4         lastupdated   date  Ymd   filemtime FILE BOTS        if   lastupdated    date  Ymd               lists   array           http   labs getyacg com spiders google txt            http   labs getyacg com spiders inktomi txt            http   labs getyacg com spiders lycos txt            http   labs getyacg com spiders msn txt            http   labs getyacg com spiders altavista txt            http   labs getyacg com spiders askjeeves txt            http   labs getyacg com spiders wisenut txt                      foreach  lists as  list                 opt    fetch  list                      opt   preg replace       r n     r n     s t    r n        n    opt            fp    fopen FILE BOTS  w            fwrite  fp  opt           fclose  fp              ip   isset   SERVER  REMOTE ADDR        SERVER  REMOTE ADDR              ref   isset   SERVER  HTTP REFERER        SERVER  HTTP REFERER              agent   isset   SERVER  HTTP USER AGENT        SERVER  HTTP USER AGENT              host   strtolower gethostbyaddr  ip         file   implode      file FILE BOTS         exp   explode       ip        class    exp 0       exp 1       exp 2            threshold   CLOAKING LEVEL       cloak   0      if  stristr  host   googlebot    amp  amp  stristr  host   inktomi    amp  amp  stristr  host   msn               cloak              if  stristr  file   class              cloak              if  stristr  file   agent              cloak              if  strlen  ref   gt  0             cloak   0             if   cloak  gt    threshold             cloakdirective   1        else            cloakdirective   0            gt    That would be the ideal way to cloak for spiders  It s from an open source script called  YACG  - http   getyacg com  Needs a bit of work  but definitely the way to go

User · Answer

Here s a Search Engine Directory of Spider names  Then you use   SERVER  HTTP USER AGENT    to check if the agent is said spider   if strstr strtolower   SERVER  HTTP USER AGENT      googlebot             what to do

User · Answer

You can checkout if it s a search engine with this function     lt  php function crawlerDetect  USER AGENT     crawlers   array   Google    gt   Google    MSN    gt   msnbot          Rambler    gt   Rambler          Yahoo    gt   Yahoo          AbachoBOT    gt   AbachoBOT          accoona    gt   Accoona          AcoiRobot    gt   AcoiRobot          ASPSeek    gt   ASPSeek          CrocCrawler    gt   CrocCrawler          Dumbot    gt   Dumbot          FAST-WebCrawler    gt   FAST-WebCrawler          GeonaBot    gt   GeonaBot          Gigabot    gt   Gigabot          Lycos spider    gt   Lycos          MSRBOT    gt   MSRBOT          Altavista robot    gt   Scooter          AltaVista robot    gt   Altavista          ID-Search Bot    gt   IDBot          eStyle Bot    gt   eStyle          Scrubby robot    gt   Scrubby          Facebook    gt   facebookexternalhit             to get crawlers string used in function uncomment it      it is better to save it in string than use implode every time      global  crawlers     crawlers agents   implode      crawlers     if  strpos  crawlers agents   USER AGENT      false        return false      else       return TRUE            gt    Then you can use it like     lt  php  USER AGENT     SERVER  HTTP USER AGENT      if crawlerDetect  USER AGENT   return  no need to lang redirection    gt

User · Answer

I m using this to detect bots   if  preg match   bot crawl curl dataprovider search get spider find java majesticsEO google yahoo teoma contaxe yandex libwww-perl facebookexternalhit i     SERVER  HTTP USER AGENT              is bot     In addition I use a whitelist to block unwanted bots   if  preg match   apple baidu bingbot facebookexternalhit googlebot -google ia archiver msnbot naverbot pingdom seznambot slurp teoma twitter yandex yeti i     SERVER  HTTP USER AGENT              allowed bot     An unwanted bot    false-positive user  is then able to solve a captcha to unblock himself for 24 hours  And as no one solves this captcha  I know it does not produce false-positives  So the bot detection seem to work perfectly   Note  My whitelist is based on Facebooks robots txt

User · Answer

I use the following code which seems to be working fine   function  bot detected        return       isset   SERVER  HTTP USER AGENT         amp  amp  preg match   bot crawl slurp spider mediapartners i     SERVER  HTTP USER AGENT             update 16-06-2017  https   support google com webmasters answer 1061943 hl en  added mediapartners

User · Answer

Use Device Detector open source library  it offers a isBot   function  https   github com piwik device-detector

User · Answer

I use this function     part of the regex comes from prestashop but I added some more bot to it          public function isBot          bot regex     BotLink bingbot AhrefsBot ahoy AlkalineBOT anthill appie arale araneo AraybOt ariadne arks ATN Worldwide Atomz bbot Bjaaland Ukonline borg -bot  0  9 boxseabot bspider calif christcrawler CMC  0  01 combine confuzzledbot CoolBot cosmos Internet Cruiser Robot cusco cyberspyder cydralspider desertrealm  desert realm digger DIIbot grabber downloadexpress DragonBot dwcp ecollector ebiness elfinbot esculapio esther fastcrawler FDSE FELIX IDE ESI fido H m h kki KIT -Fireball fouineur Freecrawl gammaSpider gazz gcreep golem googlebot griffon Gromit gulliver gulper hambot havIndex hotwired htdig iajabot INGRID  0  1 Informant InfoSpiders inspectorwww irobot Iron33 JBot jcrawler Teoma Jeeves jobo image  kapsi  net KDD -Explorer ko yappo robot label -grabber larbin legs Linkidator linkwalker Lockon logo gif crawler marvin mattie mediafox MerzScope NEC -MeshExplorer MindCrawler udmsearch moget Motor msnbot muncher muninn MuscatFerret MwdSearch sharp -info -agent WebMechanic NetScoop newscan -online ObjectsSearch Occam Orbsearch  1  0 packrat pageboy ParaSite patric pegasus perlcrawler phpdig piltdownman Pimptrain pjspider PlumtreeWebAccessor PortalBSpider psbot Getterrobo -Plus Raven RHCS RixBot roadrunner Robbie robi RoboCrawl robofox Scooter Search -AU searchprocess Senrigan Shagseeker sift SimBot Site Valet skymob SLCrawler  2  0 slurp ESI snooper solbot speedy spider monkey SpiderBot  1  0 spiderline nil suke http     www  sygol  com tach bw TechBOT templeton titin topiclink UdmSearch urlck Valkyrie libwww -perl verticrawl Victoria void -bot Voyager VWbot K crawlpaper wapspider WebBandit  1  0 webcatcher T -H -U -N -D -E -R -S -T -O -N -E WebMoose webquest webreaper webs webspider WebWalker wget winona whowhere wlm WOLP WWWC none XGET Nederland  zoek AISearchBot woriobot NetSeer Nutch YandexBot YandexMobileBot SemrushBot FatBot MJ12bot DotBot AddThis baiduspider SeznamBot mod pagespeed CCBot openstat ru  Bot m2e i        userAgent   empty   SERVER  HTTP USER AGENT      FALSE     SERVER  HTTP USER AGENT         isBot     userAgent    preg match  bot regex   userAgent        return  isBot      Anyway take care that some bots uses browser like user agent to fake their identity    I got many russian ip that has this behaviour on my site    One distinctive feature of most of the bot is that they don t carry any cookie and so no session is attached to them    I am not sure how but this is for sure the best way to track them

User · Answer

function bot detected        if preg match   bot crawl slurp spider mediapartners i     SERVER  HTTP USER AGENT         return true        else      return false

User · Answer

For Google i m using this method   function is google          ip       SERVER  REMOTE ADDR         host   gethostbyaddr   ip        if   strpos   host    google com        false    strpos   host    googlebot com        false               forward lookup   gethostbyname   host             if    forward lookup     ip                 return true                     return false        else           return false            var dump  is google        Credits  https   support google com webmasters answer 80553

User · Answer

Because any client can set the user-agent to what they want  looking for  Googlebot    bingbot  etc is only half the job   The 2nd part is verifying the client s IP  In the old days this required maintaining IP lists  All the lists you find online are outdated  The top search engines officially support verification through DNS  as explained by Google https   support google com webmasters answer 80553 and Bing http   www bing com webmaster help how-to-verify-bingbot-3905dc26  At first perform a reverse DNS lookup of the client IP  For Google this brings a host name under googlebot com  for Bing it s under search msn com  Then  because someone could set such a reverse DNS on his IP  you need to verify with a forward DNS lookup on that hostname  If the resulting IP is the same as the one of the site s visitor  you re sure it s a crawler from that search engine   I ve written a library in Java that performs these checks for you  Feel free to port it to PHP  It s on GitHub  https   github com optimaize webcrawler-verifier

User · Answer

100  Working Bot detector  It is working on my website successfully   function isBotDetected          if   preg match   abacho accona AddThis AdsBot ahoy AhrefsBot AISearchBot alexa altavista anthill appie applebot arale araneo AraybOt ariadne arks aspseek ATN Worldwide Atomz baiduspider baidu bbot bingbot bing Bjaaland BlackWidow BotLink bot boxseabot bspider calif CCBot ChinaClaw christcrawler CMC  0  01 combine confuzzledbot contaxe CoolBot cosmos crawler crawlpaper crawl curl cusco cyberspyder cydralspider dataprovider digger DIIbot DotBot downloadexpress DragonBot DuckDuckBot dwcp EasouSpider ebiness ecollector elfinbot esculapio ESI esther eStyle Ezooms facebookexternalhit facebook facebot fastcrawler FatBot FDSE FELIX IDE fetch fido find Firefly fouineur Freecrawl froogle gammaSpider gazz gcreep geona Getterrobo-Plus get girafabot golem googlebot  -google grabber GrabNet griffon Gromit gulliver gulper hambot havIndex hotwired htdig HTTrack ia archiver iajabot IDBot Informant InfoSeek InfoSpiders INGRID  0  1 inktomi inspectorwww Internet Cruiser Robot irobot Iron33 JBot jcrawler Jeeves jobo KDD -Explorer KIT -Fireball ko yappo robot label -grabber larbin legs libwww-perl linkedin Linkidator linkwalker Lockon logo gif crawler Lycos m2e majesticsEO marvin mattie mediafox mediapartners MerzScope MindCrawler MJ12bot mod pagespeed moget Motor msnbot muncher muninn MuscatFerret MwdSearch NationalDirectory naverbot NEC -MeshExplorer NetcraftSurveyAgent NetScoop NetSeer newscan -online nil none Nutch ObjectsSearch Occam openstat ru  Bot packrat pageboy ParaSite patric pegasus perlcrawler phpdig piltdownman Pimptrain pingdom pinterest pjspider PlumtreeWebAccessor PortalBSpider psbot rambler Raven RHCS RixBot roadrunner Robbie robi RoboCrawl robofox Scooter Scrubby Search -AU searchprocess search SemrushBot Senrigan seznambot Shagseeker sharp -info -agent sift SimBot Site Valet SiteSucker skymob SLCrawler  2  0 slurp snooper solbot speedy spider monkey SpiderBot  1  0 spiderline spider suke tach bw TechBOT TechnoratiSnoop templeton teoma titin topiclink twitterbot twitter UdmSearch Ukonline UnwindFetchor URL Spider SQL urlck urlresolver Valkyrie libwww -perl verticrawl Victoria void -bot Voyager VWbot K wapspider WebBandit  1  0 webcatcher WebCopier WebFindBot WebLeacher WebMechanic WebMoose webquest webreaper webspider webs WebWalker WebZip wget whowhere winona wlm WOLP woriobot WWWC XGET xing yahoo YandexBot YandexMobileBot yandex yeti Zeus i     SERVER  HTTP USER AGENT                    return true      Above given bots detected             return false        End    isBotDetected

[php] how to detect search engine bots with php?

Examples related to php

Examples related to web-crawler

Examples related to bots