How can I convert an HTML table to CSV

Question

How do I convert the contents of  an HTML table   lt table gt   to CSV format  Is there a library or linux program that does this  This is similar to copy tables in Internet Explorer  and pasting them into Excel

User · Answer

Just to add to these answers  as i ve recently been attempting a similar thing  - if Google spreadsheets is your spreadsheeting program of choice  Simply do these two things   1  Strip everything out of your html file around the Table opening closing tags and resave it as another html file   2  Import that html file directly into google spreadsheets and you ll have your information beautifully imported  Top tip  if you used inline styles in your table  they will be imported as well    Saved me loads of time and figuring out different conversions

User · Answer

Here s a short Python program I wrote to complete this task  It was written in a couple of minutes  so it can probably be made better  Not sure how it ll handle nested tables  probably it ll do bad stuff  or multiple tables  probably they ll just appear one after another   It doesn t handle colspan or rowspan  Enjoy   from HTMLParser import HTMLParser import sys import re   class HTMLTableParser HTMLParser       def   init   self  row delim   n   cell delim   t            HTMLParser   init   self          self despace re   re compile r  s            self data interrupt   False         self first row   True         self first cell   True         self in cell   False         self row delim   row delim         self cell delim   cell delim      def handle starttag self  tag  attrs           self data interrupt   True         if tag     table               self first row   True             self first cell   True         elif tag     tr               if not self first row                  sys stdout write self row delim              self first row   False             self first cell   True             self data interrupt   False         elif tag     td  or tag     th               if not self first cell                  sys stdout write self cell delim              self first cell   False             self data interrupt   False             self in cell   True      def handle endtag self  tag           self data interrupt   True         if tag     td  or tag     th               self in cell   False      def handle data self  data           if self in cell               if self data interrupt                  sys stdout write                  sys stdout write self despace re sub      data  strip                self data interrupt   False   parser   HTMLTableParser    parser feed sys stdin read

User · Answer

here s a few options  http   groups google com group ruby-talk-google browse thread thread cfae0aa4b14e5560 hl nn  http   ouseful wordpress com 2008 10 14 data-scraping-wikipedia-with-google-spreadsheets   How can I scrape an HTML table to CSV   https   addons mozilla org en-US firefox addon 1852

User · Answer

Here s a ruby script that uses nokogiri -- http   nokogiri rubyforge org nokogiri   require  nokogiri   doc   Nokogiri  HTML table string   doc xpath    table  tr   each do  row    row xpath  td   each do  cell      print      cell text gsub   n        gsub            gsub    s  2   m    1             end   print   n  end   Worked for my basic test case

User · Answer

OpenOffice org can view HTML tables   Simply use the open command on the HTML file  or select and copy the table in your browser and then Paste Special in OpenOffice org   It will query you for the file type  one of which should be HTML   Select that and voila

User · Answer

This is based on atomicules  answer but more succinct and also processes th  header  cells as well as td cells   I also added the strip method to get rid of the extra whitespaces   CSV open  output csv    w   do  csv    doc xpath    table  tr   each do  row      csv  lt  lt  row xpath  th td   map   cell  cell text strip    end end   Wrapping the code inside the CSV block ensures that the file will be closed properly     If you just want the text and don t need to write it to a file  you can use this   doc xpath    table  tr   inject     do  result  row    result  lt  lt  row xpath  th td   map   cell  cell text strip  to csv end

User · Answer

Here is an example using pQuery and Spreadsheet  WriteExcel   use strict  use warnings   use Spreadsheet  WriteExcel  use pQuery   my  workbook   Spreadsheet  WriteExcel- gt new   data xls     my  sheet       workbook- gt add worksheet  my  row   0   pQuery   http   www blahblah site   - gt find   tr   - gt each  sub      my  col   0      pQuery      - gt find   td   - gt each  sub           sheet- gt write   row   col      - gt innerHTML                 row          workbook- gt close    The example simply extracts all tr tags that it finds into an excel file   You can easily tailor it to pick up specific table or even trigger a new excel file per table tag   Further things to consider    You may want to pick up td tags to create excel header s   And you may have issues with rowspan  amp  colspan      To see if rowspan or colspan is being used you can   pQuery   data  - gt find   td   - gt each  sub       my  number of cols spanned     - gt getAttribute   colspan

User · Answer

Sorry for resurrecting an ancient thread  but I recently wanted to do this  but I wanted a 100  portable bash script to do it   So here s my solution using only grep and sed   The below was bashed out very quickly  and so could be made much more elegant  but I m just getting started really with sed awk etc     curl  http   www webpagewithtableinit com   2 gt  dev null   grep -i -e   lt    TABLE   lt    TD   lt    TR   lt    TH    sed  s      t    g    tr -d   n    sed  s  lt   TR   gt    gt   n Ig     sed  s  lt       TABLE  TR     gt    gt   Ig    sed  s   lt T DH    gt    gt    lt     T DH    gt    gt    Ig    sed  s  lt   T DH    gt    gt  lt T DH    gt    gt    Ig    As you can see I ve got the page source using curl  but you could just as easily feed in the table source from elsewhere   Here s the explanation   Get the Contents of the URL using cURL  dump stderr to null  no progress meter   curl  http   www webpagewithtableinit com   2 gt  dev null       I only want Table elements  return only lines with TABLE TR TH TD tags     grep -i -e   lt    TABLE   lt    TD   lt    TR   lt    TH       Remove any Whitespace at the beginning of the line     sed  s      t    g        Remove newlines    tr -d   n r        Replace  lt  TR gt  with newline    sed  s  lt   TR   gt    gt   n Ig         Remove TABLE and TR tags    sed  s  lt       TABLE  TR     gt    gt   Ig        Remove   lt TD gt     lt TH gt    lt  TD gt     lt  TH gt      sed  s   lt T DH    gt    gt    lt     T DH    gt    gt    Ig        Replace  lt  TD gt  lt TD gt  with comma    sed  s  lt   T DH    gt    gt  lt T DH    gt    gt    Ig       Note that if any of the table cells contain commas  you may need to escape them first  or use a different delimiter   Hope this helps someone

User · Answer

This method is not really a library OR a program  but for ad hoc conversions you can    put the HTML for a table in a text file called something xls open it with a spreadsheet save it as CSV    I know this works with Excel  and I believe I ve done it with the OpenOffice spreadsheet   But you probably would prefer a Perl or Ruby script

User · Answer

I m not sure if there is pre-made library for this  but if you re willing to get your hands dirty with a little Perl  you could likely do something with Text  CSV and HTML  Parser

User · Answer

Based on audiodude s answer  but simplified by using the built-in CSV library  require  nokogiri  require  csv   doc   Nokogiri  HTML table string  csv   CSV open  output csv    w    doc xpath    table  tr   each do  row      tarray       temporary array     row xpath  td   each do  cell          tarray  lt  lt  cell text  Build array of that row of data      end     csv  lt  lt  tarray  Write that row out to csv file end  csv close   I did wonder if there was any way to take the Nokogiri NodeSet  row xpath  td    and write this out as an array to the csv file in one step  But I could only figure out doing it by iterating over each cell and building the temporary array of each cell s content

User · Answer

With Perl you can use the HTML  TableExtract module to extract the data from the table and then use Text  CSV XS to create a CSV file or Spreadsheet  WriteExcel to create an Excel file

User · Answer

Here a simple solution without any external lib   https   www codexworld com export-html-table-data-to-csv-using-javascript   It works for me without any issue

User · Answer

Read HTML File and Use Ruby s CSV and nokogiri to Output to  csv  Based on  audiodude s answer but modified in the following ways   Reads from a file to get the HTML  This is handy for long HTML tables  but easily modified to just use a static String if your HTML table is small  Uses CSV s built-in library for converting an Array into a CSV row  Outputs to a  csv file instead of just printing to STDOUT  Gets both the table headers  th  and the table body  td      Convert HTML table to CSV format   require  quot nokogiri quot   html file path    quot  quot   html string   File read  html file path    doc   Nokogiri  HTML  html string    CSV open  Rails root join  Time zone now to s   file      quot  csv quot      quot wb quot    do  csv    doc xpath   quot   table  tr quot    each do  row      csv  lt  lt  row xpath   quot th td quot    collect   amp  text   collect   amp  strip     end end

User · Answer

This is a very old thread  but may be someone like me will bump into it  I have made some additions for the audiodude s script to read the html from file instead adding it to the code  and another parameter that controls printing of the header lines   the script should be run like that  ruby  lt script name gt   lt file name gt    lt print headers gt     the code is   require  nokogiri   print header lines   ARGV 1   File open ARGV 0   do  f     table string f   doc   Nokogiri  HTML table string     doc xpath    table  tr   each do  row      if print header lines       row xpath  th   each do  cell          print      cell text gsub   n        gsub            gsub    s  2   m    1                 end     end     row xpath  td   each do  cell        print      cell text gsub   n        gsub            gsub    s  2   m    1               end     print   n    end end

User · Answer

Assuming that you ve designed an HTML page containing a table  I would recommend this solution  Worked like charm for me     document  ready      gt          buttonExport   click e   gt           Getting values of current time for generating the file name     const dateTime   new Date        const day        dateTime getDate        const month      dateTime getMonth     1      const year       dateTime getFullYear        const hour       dateTime getHours        const minute     dateTime getMinutes        const postfix       day    month    year    hour    minute            Creating a temporary HTML link element  they support setting file names      const downloadElement   document createElement  a            Getting data from our  div  that contains the HTML table     const dataType     data application vnd ms-excel       const tableDiv    document getElementById  divData        const tableHTML   tableDiv outerHTML replace    g    20            Setting the download source     downloadElement href      dataType    tableHTML            Setting the file name     downloadElement download    exported table   postfix  xls           Trigger the download     downloadElement click            Just in case  prevent default behaviour     e preventDefault                Courtesy  http   www kubilayerdogan net  p 218  You can edit the file format to  csv here   downloadElement download    exported table   postfix  csv

User · Answer

Here s an updated version of Yuvai s answer  which properly handles fields that require quoting  i e  fields that contain commas in the data  double quotes  or span multiple lines      usr bin env python3 from html parser import HTMLParser import sys import re  class HTMLTableParser HTMLParser       def   init   self  row delim   n   cell delim               HTMLParser   init   self          self despace re   re compile   s            self data interrupt   False         self first row   True         self first cell   True         self in cell   False         self row delim   row delim         self cell delim   cell delim         self quote buffer   False         self buffer   None      def handle starttag self  tag  attrs           self data interrupt   True         if tag     table               self first row   True             self first cell   True         elif tag     tr               if not self first row                  sys stdout write self row delim              self first row   False             self first cell   True             self data interrupt   False         elif tag     td  or tag     th               if not self first cell                  sys stdout write self cell delim              self first cell   False             self data interrupt   False             self in cell   True         elif tag     br               self quote buffer   True             self buffer    self row delim      def handle endtag self  tag           self data interrupt   True         if tag     td  or tag     th               self in cell   False         if self buffer    None                Quote if needed                if self quote buffer or self cell delim in self buffer or      in self buffer                    Need to quote  First  replace all double-quotes with quad-quotes                 self buffer   self buffer replace                               self buffer       0     format self buffer              sys stdout write self buffer              self quote buffer   False             self buffer   None      def handle data self  data           if self in cell               if self data interrupt                  sys stdout write                  if self buffer    None                  self buffer                  self buffer    self despace re sub      data  strip               self data interrupt   False  parser   HTMLTableParser    parser feed sys stdin read      One enhancement for this script could be to add support for specifying a different line delimiter  or auto-calculate the platform-correct one   and a different column delimiter

[html] How can I convert an HTML table to CSV?

Examples related to html

Examples related to csv

Examples related to html-table