It's an old question. A lot of good things happened since then. Here are my two cents on this topic:
To accurately track the visited pages you have to normalize URI first. The normalization algorithm includes multiple steps:
GET http://www.example.com/query?id=111&cat=222
GET http://www.example.com/query?cat=222&id=111
Convert the empty path.
Example: http://example.org ? http://example.org/
Capitalize percent encoding. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive.
Example: http://example.org/a%c2%B1b ? http://example.org/a%C2%B1b
Remove unnecessary dot-segments.
Example: http://example.org/../a/b/../c/./d.html ? http://example.org/a/c/d.html
Possibly some other normalization rules
Not only <a>
tag has href
attribute, <area>
tag has it too https://html.com/tags/area/. If you don't want to miss anything, you have to scrape <area>
tag too.
Track crawling progress. If the website is small, it is not a problem. Contrarily it might be very frustrating if you crawl half of the site and it failed. Consider using a database or a filesystem to store the progress.
Be kind to the site owners. If you are ever going to use your crawler outside of your website, you have to use delays. Without delays, the script is too fast and might significantly slow down some small sites. From sysadmins perspective, it looks like a DoS attack. A static delay between the requests will do the trick.
If you don't want to deal with that, try Crawlzone and let me know your feedback. Also, check out the article I wrote a while back https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm