[html] How can I convert an HTML table to CSV?

How do I convert the contents of an HTML table (<table>) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.

This question is related to html csv html-table

The answer is


This is based on atomicules' answer but more succinct and also processes th (header) cells as well as td cells. I also added the strip method to get rid of the extra whitespaces.

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

Wrapping the code inside the CSV block ensures that the file will be closed properly.


If you just want the text and don't need to write it to a file, you can use this:

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end

Here is an example using pQuery and Spreadsheet::WriteExcel:

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.

Further things to consider:

  • You may want to pick up td tags to create excel header(s).
  • And you may have issues with rowspan & colspan.

To see if rowspan or colspan is being used you can:

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

This method is not really a library OR a program, but for ad hoc conversions you can

  • put the HTML for a table in a text file called something.xls
  • open it with a spreadsheet
  • save it as CSV.

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

But you probably would prefer a Perl or Ruby script...


Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.


Based on audiodude's answer, but simplified by using the built-in CSV library

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.


Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

Here's the explanation:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

curl "http://www.webpagewithtableinit.com/" 2>/dev/null 

.

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

.

Remove any Whitespace at the beginning of the line.

| sed 's/^[\ \t]*//g' 

.

Remove newlines

| tr -d '\n\r' 

.

Replace </TR> with newline

| sed 's/<\/TR[^>]*>/\n/Ig'  

.

Remove TABLE and TR tags

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' 

.

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' 

.

Replace </TD><TD> with comma

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

.

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

Hope this helps someone!



OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!


Here's an updated version of Yuvai's answer, which properly handles fields that require quoting (i.e. fields that contain commas in the data, double quotes, or span multiple lines)

#!/usr/bin/env python3
from html.parser import HTMLParser
import sys
import re

class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim=","):
        HTMLParser.__init__(self)
        self.despace_re = re.compile("\s+")
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim
        self.quote_buffer = False
        self.buffer = None

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True
        elif tag == "br":
            self.quote_buffer = True
            self.buffer += self.row_delim

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False
        if self.buffer != None:
            # Quote if needed...
            if self.quote_buffer or self.cell_delim in self.buffer or "\"" in self.buffer:
                # Need to quote! First, replace all double-quotes with quad-quotes
                self.buffer = self.buffer.replace("\"", "\"\"")
                self.buffer = "\"{0}\"".format(self.buffer)
            sys.stdout.write(self.buffer)
            self.quote_buffer = False
            self.buffer = None

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            if self.buffer == None:
                self.buffer = ""
            self.buffer += self.despace_re.sub(" ", data).strip()
            self.data_interrupt = False

parser = HTMLTableParser() 
parser.feed(sys.stdin.read())

One enhancement for this script could be to add support for specifying a different line delimiter (or auto-calculate the platform-correct one), and a different column delimiter.


Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan or rowspan. Enjoy.

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read()) 

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV and HTML::Parser.


Here a simple solution without any external lib:

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

It works for me without any issue


With Perl you can use the HTML::TableExtract module to extract the data from the table and then use Text::CSV_XS to create a CSV file or Spreadsheet::WriteExcel to create an Excel file.


Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.

1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.

2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

Saved me loads of time and figuring out different conversions.


This is a very old thread, but may be someone like me will bump into it. I have made some additions for the audiodude's script to read the html from file instead adding it to the code, and another parameter that controls printing of the header lines.

the script should be run like that

ruby <script_name> <file_name> [<print_headers>]

the code is:

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

Read HTML File and Use Ruby's CSV and nokogiri to Output to .csv.

Based on @audiodude's answer but modified in the following ways:

  • Reads from a file to get the HTML. This is handy for long HTML tables, but easily modified to just use a static String if your HTML table is small.
  • Uses CSV's built-in library for converting an Array into a CSV row.
  • Outputs to a .csv file instead of just printing to STDOUT.
  • Gets both the table headers (th) and the table body (td).
# Convert HTML table to CSV format.

require "nokogiri"

html_file_path = ""

html_string = File.read( html_file_path )

doc = Nokogiri::HTML( html_string )

CSV.open( Rails.root.join( Time.zone.now.to_s( :file ) + ".csv" ), "wb" ) do |csv|
  doc.xpath( "//table//tr" ).each do |row|
    csv << row.xpath( "th|td" ).collect( &:text ).collect( &:strip )
  end
end

Assuming that you've designed an HTML page containing a table, I would recommend this solution. Worked like charm for me:

$(document).ready(() => {
  $("#buttonExport").click(e => {
    // Getting values of current time for generating the file name
    const dateTime = new Date();
    const day      = dateTime.getDate();
    const month    = dateTime.getMonth() + 1;
    const year     = dateTime.getFullYear();
    const hour     = dateTime.getHours();
    const minute   = dateTime.getMinutes();
    const postfix  = `${day}.${month}.${year}_${hour}.${minute}`;

    // Creating a temporary HTML link element (they support setting file names)
    const downloadElement = document.createElement('a');

    // Getting data from our `div` that contains the HTML table
    const dataType  = 'data:application/vnd.ms-excel';
    const tableDiv  = document.getElementById('divData');
    const tableHTML = tableDiv.outerHTML.replace(/ /g, '%20');

    // Setting the download source
    downloadElement.href = `${dataType},${tableHTML}`;

    // Setting the file name
    downloadElement.download = `exported_table_${postfix}.xls`;

    // Trigger the download
    downloadElement.click();

    // Just in case, prevent default behaviour
    e.preventDefault();
  });
});

Courtesy: http://www.kubilayerdogan.net/?p=218

You can edit the file format to .csv here:

downloadElement.download = `exported_table_${postfix}.csv`;

Examples related to html

Embed ruby within URL : Middleman Blog Please help me convert this script to a simple image slider Generating a list of pages (not posts) without the index file Why there is this "clear" class before footer? Is it possible to change the content HTML5 alert messages? Getting all files in directory with ajax DevTools failed to load SourceMap: Could not load content for chrome-extension How to set width of mat-table column in angular? How to open a link in new tab using angular? ERROR Error: Uncaught (in promise), Cannot match any routes. URL Segment

Examples related to csv

Pandas: ValueError: cannot convert float NaN to integer Export result set on Dbeaver to CSV Convert txt to csv python script How to import an Excel file into SQL Server? "CSV file does not exist" for a filename with embedded quotes Save Dataframe to csv directly to s3 Python Data-frame Object has no Attribute (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape How to write to a CSV line by line? How to check encoding of a CSV file

Examples related to html-table

Table column sizing DataTables: Cannot read property 'length' of undefined TypeError: a bytes-like object is required, not 'str' in python and CSV How to get the <td> in HTML tables to fit content, and let a specific <td> fill in the rest How to stick table header(thead) on top while scrolling down the table rows with fixed header(navbar) in bootstrap 3? Sorting table rows according to table header column using javascript or jquery How to make background of table cell transparent How to auto adjust table td width from the content bootstrap responsive table content wrapping How to print table using Javascript?