[python] Data structure for maintaining tabular data in memory?

My scenario is as follows: I have a table of data (handful of fields, less than a hundred rows) that I use extensively in my program. I also need this data to be persistent, so I save it as a CSV and load it on start-up. I choose not to use a database because every option (even SQLite) is an overkill for my humble requirement (also - I would like to be able to edit the values offline in a simple way, and nothing is simpler than notepad).

Assume my data looks as follows (in the file it's comma separated without titles, this is just an illustration):

 Row  | Name     | Year   | Priority
------------------------------------
 1    | Cat      | 1998   | 1
 2    | Fish     | 1998   | 2
 3    | Dog      | 1999   | 1 
 4    | Aardvark | 2000   | 1
 5    | Wallaby  | 2000   | 1
 6    | Zebra    | 2001   | 3

Notes:

  1. Row may be a "real" value written to the file or just an auto-generated value that represents the row number. Either way it exists in memory.
  2. Names are unique.

Things I do with the data:

  1. Look-up a row based on either ID (iteration) or name (direct access).
  2. Display the table in different orders based on multiple field: I need to sort it e.g. by Priority and then Year, or Year and then Priority, etc.
  3. I need to count instances based on sets of parameters, e.g. how many rows have their year between 1997 and 2002, or how many rows are in 1998 and priority > 2, etc.

I know this "cries" for SQL...

I'm trying to figure out what's the best choice for data structure. Following are several choices I see:

List of row lists:

a = []
a.append( [1, "Cat", 1998, 1] )
a.append( [2, "Fish", 1998, 2] )
a.append( [3, "Dog", 1999, 1] )
...

List of column lists (there will obviously be an API for add_row etc):

a = []
a.append( [1, 2, 3, 4, 5, 6] )
a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] )
a.append( [1998, 1998, 1999, 2000, 2000, 2001] )
a.append( [1, 2, 1, 1, 1, 3] )

Dictionary of columns lists (constants can be created to replace the string keys):

a = {}
a['ID'] = [1, 2, 3, 4, 5, 6]
a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] 
a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001] 
a['Priority'] = [1, 2, 1, 1, 1, 3] 

Dictionary with keys being tuples of (Row, Field):

Create constants to avoid string searching
NAME=1
YEAR=2
PRIORITY=3

a={}
a[(1, NAME)] = "Cat"
a[(1, YEAR)] = 1998
a[(1, PRIORITY)] = 1
a[(2, NAME)] = "Fish"
a[(2, YEAR)] = 1998
a[(2, PRIORITY)] = 2
...

And I'm sure there are other ways... However each way has disadvantages when it comes to my requirements (complex ordering and counting).

What's the recommended approach?

EDIT:

To clarify, performance is not a major issue for me. Because the table is so small, I believe almost every operation will be in the range of milliseconds, which is not a concern for my application.

This question is related to python data-structures

The answer is


First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?

You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.

That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.

Above all, measure what operations will be more common and decide which structure will end up costing less.


I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.

Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.

Another thing you can do is have a list of dictionaries.

rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})

This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.


Have a Table class whose rows is a list of dict or better row objects

In table do not directly add rows but have a method which update few lookup maps e.g. for name if you are not adding rows in order or id are not consecutive you can have idMap too e.g.

class Table(object):
    def __init__(self):
        self.rows =  []# list of row objects, we assume if order of id
        self.nameMap = {} # for faster direct lookup for row by name

    def addRow(self, row):
        self.rows.append(row)
        self.nameMap[row['name']] = row

    def getRow(self, name):
        return self.nameMap[name]


table = Table()
table.addRow({'ID':1,'name':'a'})

I personally wrote a lib for pretty much that quite recently, it is called BD_XML

as its most fundamental reason of existence is to serve as a way to send data back and forth between XML files and SQL databases.

It is written in Spanish (if that matters in a programming language) but it is very simple.

from BD_XML import Tabla

It defines an object called Tabla (Table), it can be created with a name for identification an a pre-created connection object of a pep-246 compatible database interface.

Table = Tabla('Animals') 

Then you need to add columns with the agregar_columna (add_column) method, with can take various key word arguments:

  • campo (field): the name of the field

  • tipo (type): the type of data stored, can be a things like 'varchar' and 'double' or name of python objects if you aren't interested in exporting to a data base latter.

  • defecto (default): set a default value for the column if there is none when you add a row

  • there are other 3 but are only there for database tings and not actually functional

like:

Table.agregar_columna(campo='Name', tipo='str')
Table.agregar_columna(campo='Year', tipo='date')
#declaring it date, time, datetime or timestamp is important for being able to store it as a time object and not only as a number, But you can always put it as a int if you don't care for dates
Table.agregar_columna(campo='Priority', tipo='int')

Then you add the rows with the += operator (or + if you want to create a copy with an extra row)

Table += ('Cat', date(1998,1,1), 1)
Table += {'Year':date(1998,1,1), 'Priority':2, Name:'Fish'}
#…
#The condition for adding is that is a container accessible with either the column name or the position of the column in the table

Then you can generate XML and write it to a file with exportar_XML (export_XML) and escribir_XML (write_XML):

file = os.path.abspath(os.path.join(os.path.dirname(__file__), 'Animals.xml'))
Table.exportar_xml()
Table.escribir_xml(file)

And then import it back with importar_XML (import_XML) with the file name and indication that you are using a file and not an string literal:

Table.importar_xml(file, tipo='archivo')
#archivo means file

Advanced

This are ways you can use a Tabla object in a SQL manner.

#UPDATE <Table> SET Name = CONCAT(Name,' ',Priority), Priority = NULL WHERE id = 2
for row in Table:
    if row['id'] == 2:
        row['Name'] += ' ' + row['Priority']
        row['Priority'] = None
print(Table)

#DELETE FROM <Table> WHERE MOD(id,2) = 0 LIMIT 1
n = 0
nmax = 1
for row in Table:
    if row['id'] % 2 == 0:
        del Table[row]
        n += 1
        if n >= nmax: break
print(Table)

this examples assume a column named 'id' but can be replaced width row.pos for your example.

if row.pos == 2:

The file can be download from:

https://bitbucket.org/WolfangT/librerias


A very old question I know but...

A pandas DataFrame seems to be the ideal option here.

http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html

From the blurb

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure

http://pandas.pydata.org/