Options for HTML scraping

Question

I m thinking of trying Beautiful Soup  a Python package for HTML scraping  Are there any other HTML scraping packages I should be looking at  Python is not a requirement  I m actually interested in hearing about other languages as well  The story so far   Python  Beautiful Soup lxml HTQL Scrapy Mechanize   Ruby  Nokogiri Hpricot Mechanize scrAPI scRUBYt  wombat Watir    NET  Html Agility Pack WatiN   Perl  WWW  Mechanize Web-Scraper   Java  Tag Soup HtmlUnit Web-Harvest  jARVEST  21 jsoup Jericho HTML Parser   JavaScript  request cheerio artoo node-horseman phantomjs   PHP   Goutte  29 htmlSQL PHP Simple HTML DOM Parser PHP Scraping with CURL ScarletsQuery   Go  goquery Dataflow kit   Most of them  Screen-Scraper

User · Answer

For those that would prefer a graphical workflow tool, RapidMiner (FOSS) has a nice web crawling and scraping facility.

Here's a series of videos:

http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html

User · Answer

Simple HTML DOM Parser  is a good option for PHP  if your familiar with jQuery or JavaScript selectors then you will find yourself at home   Find it here  There is also a blog post about it here

User · Answer

Regular expressions work pretty well for HTML scraping as well  -   Though after looking at Beautiful Soup  I can see why this would be a valuable tool

User · Answer

I ve been using Feedity - http   feedity com for some of the scraping work  and conversion into RSS feeds  at my library  It works well for most webpages

User · Answer

I ve used Beautiful Soup a lot with Python  It is much better than regular expression checking  because it works like using the DOM  even if the HTML is poorly formatted  You can quickly find HTML tags and text with simpler syntax than regular expressions  Once you find an element  you can iterate over it and its children  which is more useful for understanding the contents in code than it is with regular expressions  I wish Beautiful Soup existed years ago when I had to do a lot of screenscraping -- it would have saved me a lot of time and headache since HTML structure was so poor before people started validating it

User · Answer

I have used LWP and HTML  TreeBuilder with Perl and have found them very useful   LWP  short for libwww-perl  lets you connect to websites and scrape the HTML  you can get the module here and the O Reilly book seems to be online here   TreeBuilder allows you to construct a tree from the HTML  and documentation and source are available in HTML  TreeBuilder - Parser that builds a HTML syntax tree   There might be too much heavy-lifting still to do with something like this approach though  I have not looked at the Mechanize module suggested by another answer  so I may well do that

User · Answer

Scraping Stack Overflow is especially easy with Shoes and Hpricot   require  hpricot   Shoes app  title   gt   Ask Stack Overflow    width   gt  370 do   SO URL    http   stackoverflow com    stack do     stack do       caption  What is your question         flow do          lookup   edit line  stackoverflow    width   gt   -115px          button  Ask    width   gt   90px  do           download SO URL     search s      lookup text do  s              doc   Hpricot s response body               rez clear                doc  a  each do  l                href   l  href                 if href to s       questions   0-9    then                  rez append do                   para link l inner text    visit SO URL   href                     end               end             end              rez show             end         end       end     end     stack  margin   gt  25 do       background white   radius   gt  20        rez   stack do       end     end      rez hide     end end

User · Answer

For Perl  there s WWW  Mechanize

User · Answer

The Ruby world s equivalent to Beautiful Soup is why the lucky stiff s Hpricot

User · Answer

I ve also had great success using Aptana s Jaxer   jQuery to parse pages  It s not as fast or  script-like  in nature  but jQuery selectors   real JavaScript DOM is a lifesaver on more complicated  or malformed  pages

User · Answer

I ve had mixed results in  NET using SgmlReader which was originally started by Chris Lovett and appears to have been updated by MindTouch

User · Answer

In the  NET world  I recommend the HTML Agility Pack  Not near as simple as some of the above options  like HTMLSQL   but it s very flexible  It lets you maniuplate poorly formed HTML as if it were well formed XML  so you can use XPATH or just itereate over nodes   http   www codeplex com htmlagilitypack

User · Answer

Another tool for  NET is MhtBuilder

User · Answer

Yahoo  Query Language or YQL can be used alongwith jQuery  AJAX  JSONP to screen scrape web pages

User · Answer

SharpQuery  It s basically jQuery for C   It depends on HTML Agility Pack for parsing the HTML

User · Answer

When it comes to extracting data from an HTML document on the server-side  Node js is a fantastic option  I have used it successfully with two modules called request and cheerio   You can see an example how it works here

User · Answer

I ve had some success with HtmlUnit  in Java  It s a simple framework for writing unit tests on web UI s  but equally useful for HTML scraping

User · Answer

I would first find out if the site s  in question provide an API server or RSS Feeds for access the data you require

User · Answer

I found HTMLSQL to be a ridiculously simple way to screenscrape  It takes literally minutes to get results with it   The queries are super-intuitive - like   SELECT title from img WHERE  class     userpic    There are now some other alternatives that take the same approach

User · Answer

I know and love Screen-Scraper   Screen-Scraper is a tool for extracting data from websites  Screen-Scraper automates     Clicking links on websites   Entering data into forms and submitting   Iterating through search result pages   Downloading files  PDF  MS Word  images  etc     Common uses     Download all products  records from a website   Build a shopping comparison site   Perform market research   Integrate or migrate data   Technical     Graphical interface--easy automation   Cross platform  Linux  Mac  Windows  etc     Integrates with most programming languages  Java  PHP   NET  ASP  Ruby  etc     Runs on workstations or servers   Three editions of screen-scraper     Enterprise  The most feature-rich edition of screen-scraper  All capabilities are enabled    Professional  Designed to be capable of handling most common scraping projects    Basic  Works great for simple projects  but not nearly as many features as its two older brothers

User · Answer

In Java  you can use TagSoup

User · Answer

Well  if you want it done from the client side using only a browser you have jcrawl com  After having designed your scrapping service from the web application  http   www jcrawl com app html   you only need to add the generated script to an HTML page to start using presenting your data   All the scrapping logic happens on the the browser via JavaScript  I hope you find it useful  Click this link for a live example that extracts the latest news from Yahoo tennis

User · Answer

The Python lxml library acts as a Pythonic binding for the libxml2 and libxslt libraries  I like particularly its XPath support and pretty-printing of the in-memory XML structure  It also supports parsing broken HTML  And I don t think you can find other Python libraries bindings that parse XML faster than lxml

User · Answer

Although it was designed for  NET web-testing  I ve been using the WatiN framework for this purpose  Since it is DOM-based  it is pretty easy to capture HTML  text  or images  Recentely  I used it to dump a list of links from a MediaWiki All Pages namespace query into an Excel spreadsheet  The following VB NET code fragement is pretty crude  but it works     Sub GetLinks ByVal PagesIE As IE  ByVal MyWorkSheet As Excel Worksheet       Dim PagesLink As Link     For Each PagesLink In PagesIE TableBodies 2  Links         With MyWorkSheet              Cells XLRowCounterInt  1    PagesLink Text              Cells XLRowCounterInt  2    PagesLink Url         End With         XLRowCounterInt   XLRowCounterInt   1     Next End Sub

User · Answer

BeautifulSoup is a great way to go for HTML scraping  My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started  It s like the DOM with a lot more useful options and is a lot more pythonic  If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn t been updated in a while   Other useful tools are HTMLParser or sgmllib SGMLParser which are part of the standard Python library  These work by calling methods every time you enter exit a tag and encounter html text  They re like Expat if you re familiar with that  These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive   Regular expressions aren t very necessary  BeautifulSoup handles regular expressions so if you need their power you can utilize it there  I say go with BeautifulSoup unless you need speed and a smaller memory footprint  If you find a better HTML parser on Python  let me know

User · Answer

You probably have as much already  but I think this is what you are trying to do   from   future   import with statement import re  os  profile       os system  wget --no-cookies --header  Cookie  soba  SeCreTCODe   http   stackoverflow com users 30 myProfile html   with open  myProfile html   as f      for line in f          profile   profile   line f close   p   re compile  summarycount  gt   d   lt  div gt     Rep is found here print p m   p search profile  print m print m group 1  os system  espeak   Rep is at     m group 1      points    os remove  myProfile html

User · Answer

There is this solution too  netty HttpClient

User · Answer

Another option for Perl would be Web  Scraper which is based on Ruby s Scrapi  In a nutshell  with nice and concise syntax  you can get a robust scraper directly into data structures

User · Answer

For more complex scraping applications  I would recommend the IRobotSoft web scraper   It is a dedicated free software for screen scraping   It has a strong query language for HTML pages  and it provides a very simple web recording interface that will free you from many programming effort

User · Answer

Scrubyt uses Ruby and Hpricot to do nice and easy web scraping  I wrote a scraper for my university s library service using this in about 30 nbsp minutes

User · Answer

Implementations of the HTML5 parsing algorithm  html5lib  Python  Ruby   Validator nu HTML Parser  Java  JavaScript  C   in development   Hubbub  C   Twintsam  C   upcoming

User · Answer

I like Google Spreadsheets  ImportXML URL  XPath  function   It will repeat cells down the column if your XPath expression returns more than one value   You can have up to 50 importxml   functions on one spreadsheet   RapidMiner s Web Plugin is also pretty easy to use  It can do posts  accepts cookies  and can set the user-agent

User · Answer

You would be a fool not to use Perl   Here come the flames     Bone up on the following modules and ginsu any scrape around   use LWP use HTML  TableExtract use HTML  TreeBuilder use HTML  Form use Data  Dumper

User · Answer

I do a lot of advanced web scraping so wanted to have total control over my stack and understand the limitations  This webscraping library is the result

User · Answer

I use Hpricot on Ruby  As an example this is a snippet of code that I use to retrieve all book titles from the six pages of my HireThings account  as they don t seem to provide a single page with this information    pagerange   1  6 proxy   Net  HTTP  Proxy proxy  port  user  pwd  proxy start  www hirethings co nz   do  http    pagerange each do  page      resp  data   http get   perth dotnet page   page        if resp class    Net  HTTPOK        Hpricot data   h3 a   each    a  puts a innerText       end   end end    It s pretty much complete  All that comes before this are library imports and the settings for my proxy

User · Answer

The templatemaker utility from Adrian Holovaty  of Django fame  uses a very interesting approach  You feed it variations of the same page and it  learns  where the  holes  for variable data are  It s not HTML specific  so it would be good for scraping any other plaintext content as well  I ve used it also for PDFs and HTML converted to plaintext  with pdftotext and lynx  respectively

User · Answer

The recent talk by Dav Glass Welcome to the Jungle   YUIConf 2011 Opening Keynote  shows how you can use YUI 3 on Node js to do clientside-like programming  with DOM selectors instead of string processing  on the server  It is very impressive

User · Answer

Why has no one mentioned JSOUP yet for Java  http   jsoup org

User · Answer

Python has several options for HTML scraping in addition to Beatiful Soup  Here are some others    mechanize  similar to perl WWW Mechanize  Gives you a browser like object to ineract with web pages lxml  Python binding to libwww  Supports various options to traverse and select elements  e g  XPath and CSS selection  scrapemark  high level library using templates to extract informations from HTML  pyquery  allows you to make jQuery like queries on XML documents  scrapy  an high level scraping and web crawling framework  It can be used to write spiders  for data mining and for monitoring and automated testing

User · Answer

I made a very nice library Internet Tools for web scraping   The idea is to match a template against the web page  which will extract all data from the page and also validate if the page structure is unchanged   So you can just take the HTML of the web page you want to process  remove all dynamical or irrelevant content and annotate the interesting parts   E g  the HTML for a new question on the stackoverflow com index page is    lt div id  question-summary-11326954  class  question-summary narrow  gt        lt  -- skipped  this is getting too long -- gt        lt div class  summary  gt            lt h3 gt  lt a title  Some times my tree list have vertical scroll  then I scrolled very fast and the tree list shivered  Have any solution for this    class  question-hyperlink  href   questions 11326954 about-scroll-bar-issue-in-tree  gt About Scroll bar issue in Tree lt  a gt  lt  h3 gt        lt  -- skipped -- gt        lt  div gt   lt  div gt    So you just remove this certain id  title and summary  to create a template that will read all new questions in title  summary  link-arrays     lt t loop gt      lt div class  question-summary narrow  gt        lt div class  summary  gt          lt h3 gt             lt a class  question-hyperlink  gt               title  text    summary   title  link   href             lt  a gt          lt  h3 gt        lt  div gt      lt  div gt    lt  t loop gt    And of course it also supports the basic techniques  CSS 3 selectors  XPath 2 and  XQuery 1 expressions   The only problem is that I was so stupid to make it a Free Pascal library  But there is also language independent web demo

[html] Options for HTML scraping?

Examples related to html

Examples related to web-scraping

Examples related to html-parsing

Examples related to html-content-extraction