[php] How do I make a simple crawler in PHP?

It's an old question. A lot of good things happened since then. Here are my two cents on this topic:

  1. To accurately track the visited pages you have to normalize URI first. The normalization algorithm includes multiple steps:

    • Sort query parameters. For example, the following URIs are equivalent after normalization: GET http://www.example.com/query?id=111&cat=222 GET http://www.example.com/query?cat=222&id=111
    • Convert the empty path. Example: http://example.org ? http://example.org/

    • Capitalize percent encoding. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive. Example: http://example.org/a%c2%B1b ? http://example.org/a%C2%B1b

    • Remove unnecessary dot-segments. Example: http://example.org/../a/b/../c/./d.html ? http://example.org/a/c/d.html

    • Possibly some other normalization rules

  2. Not only <a> tag has href attribute, <area> tag has it too https://html.com/tags/area/. If you don't want to miss anything, you have to scrape <area> tag too.

  3. Track crawling progress. If the website is small, it is not a problem. Contrarily it might be very frustrating if you crawl half of the site and it failed. Consider using a database or a filesystem to store the progress.

  4. Be kind to the site owners. If you are ever going to use your crawler outside of your website, you have to use delays. Without delays, the script is too fast and might significantly slow down some small sites. From sysadmins perspective, it looks like a DoS attack. A static delay between the requests will do the trick.

If you don't want to deal with that, try Crawlzone and let me know your feedback. Also, check out the article I wrote a while back https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm