Scraping html tables into R data frames using the XML package

Question

How do I scrape html tables using the XML package   Take  for example  this wikipedia page on the Brazilian soccer team  I would like to read it in R and get the  list of all matches Brazil have played against FIFA recognised teams  table as a data frame  How can I do this

User · Accepted Answer

or a shorter try   library XML  library RCurl  library rlist  theurl  lt - getURL  https   en wikipedia org wiki Brazil national football team   opts   list ssl verifypeer   FALSE    tables  lt - readHTMLTable theurl  tables  lt - list clean tables  fun   is null  recursive   FALSE  n rows  lt - unlist lapply tables  function t  dim t  1      the picked table is the longest one on the page  tables  which max n rows

User · Answer

The rvest along with xml2 is another popular package for parsing html web pages   library rvest  theurl  lt -  http   en wikipedia org wiki Brazil national football team  file lt -read html theurl  tables lt -html nodes file   table   table1  lt - html table tables 4   fill   TRUE    The syntax is easier to use than the xml package and for most web pages the package provides all of the options ones needs

User · Answer

library RCurl  library XML     Download page using RCurl   You may need to set proxy details  etc    in the call to getURL theurl  lt -  http   en wikipedia org wiki Brazil national football team  webpage  lt - getURL theurl    Process escape characters webpage  lt - readLines tc  lt - textConnection webpage    close tc     Parse the html tree  ignoring errors on the page pagetree  lt - htmlTreeParse webpage  error function            Navigate your way through the tree  It may be possible to do this more efficiently using getNodeSet body  lt - pagetree children html children body  divbodyContent  lt - body children div children  1   children div children  4   tables  lt - divbodyContent children names divbodyContent    table     In this case  the required table is the only one with class  wikitable sortable    tableclasses  lt - sapply tables  function x  x attributes  class    thetable   lt - tables which tableclasses   wikitable sortable    table   Get columns headers headers  lt - thetable children  1   children columnnames  lt - unname sapply headers  function x  x children text value      Get rows from table content  lt - c   for i in 2 length thetable children        tablerow  lt - thetable children  i   children    opponent  lt - tablerow  1   children  2   children text value    others  lt - unname sapply tablerow -1   function x  x children text value       content  lt - rbind content  c opponent  others        Convert to data frame colnames content   lt - columnnames as data frame content    Edited to add   Sample output                       Opponent Played Won Drawn Lost Goals for Goals against     Won     1               Argentina     94  36    24   34       148           150  38 3      2                Paraguay     72  44    17   11       160            61  61 1      3                 Uruguay     72  33    19   20       127            93  45 8

User · Answer

Another option using Xpath   library RCurl  library XML   theurl  lt -  http   en wikipedia org wiki Brazil national football team  webpage  lt - getURL theurl  webpage  lt - readLines tc  lt - textConnection webpage    close tc   pagetree  lt - htmlTreeParse webpage  error function         useInternalNodes   TRUE     Extract table header and contents tablehead  lt - xpathSApply pagetree       table  class  wikitable sortable   tr th   xmlValue  results  lt - xpathSApply pagetree       table  class  wikitable sortable   tr td   xmlValue     Convert character vector to dataframe content  lt - as data frame matrix results  ncol   8  byrow   TRUE      Clean up the results content  1   lt - gsub             content  1   tablehead  lt - gsub             tablehead  names content   lt - tablehead   Produces this result   gt  head content     Opponent Played Won Drawn Lost Goals for Goals against   Won 1 Argentina     94  36    24   34       148           150 38 3  2  Paraguay     72  44    17   11       160            61 61 1  3   Uruguay     72  33    19   20       127            93 45 8  4     Chile     64  45    12    7       147            53 70 3  5      Peru     39  27     9    3        83            27 69 2  6    Mexico     36  21     6    9        69            34 58 3

[html] Scraping html tables into R data frames using the XML package

Examples related to html

Examples related to r

Examples related to xml

Examples related to parsing

Examples related to web-scraping