[python] Pandas: Looking up the list of sheets in an excel file

The new version of Pandas uses the following interface to load Excel files:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

but what if I don't know the sheets that are available?

For example, I am working with excel files that the following sheets

Data 1, Data 2 ..., Data N, foo, bar

but I don't know N a priori.

Is there any way to get the list of sheets from an excel document in Pandas?

This question is related to python excel pandas openpyxl xlrd

The answer is


from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

For a 5MB Excel file I'm working with, load_workbook without the read_only flag took 8.24s. With the read_only flag it only took 39.6 ms. If you still want to use an Excel library and not drop to an xml solution, that's much faster than the methods that parse the whole file.


This is the fastest way I have found, inspired by @divingTobi's answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]


If you:

  • care about performance
  • don't need the data in the file at execution time.
  • want to go with conventional libraries vs rolling your own solution

Below was benchmarked on a ~10Mb xlsx, xlsb file.

xlsx, xls

from openpyxl import load_workbook

def get_sheetnames_xlsx(filepath):
    wb = load_workbook(filepath, read_only=True, keep_links=False)
    return wb.sheetnames

Benchmarks: ~ 14x speed improvement

# get_sheetnames_xlsx vs pd.read_excel
225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

xlsb

from pyxlsb import open_workbook

def get_sheetnames_xlsb(filepath):
  with open_workbook(filepath) as wb:
     return wb.sheets

Benchmarks: ~ 56x speed improvement

# get_sheetnames_xlsb vs pd.read_excel
96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:


You should explicitly specify the second parameter (sheetname) as None. like this:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

"df" are all sheets as a dictionary of DataFrames, you can verify it by run this:

df.keys()

result like this:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html


I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.


Building on @dhwanil_shah 's answer, you do not need to extract the whole file. With zf.open it is possible to read from a zipped file directly.

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

The two consecutive readlines are ugly, but the content is only in the second line of the text. No need to parse the whole file.

This solution seems to be much faster than the read_excel version, and most likely also faster than the full extract version.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to excel

Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Converting unix time into date-time via excel How to increment a letter N times per iteration and store in an array? 'Microsoft.ACE.OLEDB.16.0' provider is not registered on the local machine. (System.Data) How to import an Excel file into SQL Server? Copy filtered data to another sheet using VBA Better way to find last used row Could pandas use column as index? Check if a value is in an array or not with Excel VBA How to sort dates from Oldest to Newest in Excel?

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to openpyxl

How to save a new sheet in an existing excel file, using Pandas? How to install Openpyxl with pip No module named 'openpyxl' - Python 3.4 - Ubuntu Pandas: Looking up the list of sheets in an excel file Is it possible to get an Excel document's row count without loading the entire document into memory? openpyxl - adjust column width size

Examples related to xlrd

xlrd.biffh.XLRDError: Excel xlsx file; not supported Edit existing excel workbooks and sheets with xlrd and xlwt Read Excel File in Python Pandas: Looking up the list of sheets in an excel file python xlrd unsupported format, or corrupt file. writing to existing workbook using xlwt Create a .csv file with values from a Python list