[database-design] How to Store Historical Data

Some co-workers and I got into a debate on the best way to store historical data. Currently, for some systems, I use a separate table to store historical data, and I keep an original table for the current, active record. So, let's say I have table FOO. Under my system, all active records will go in FOO, and all historical records will go in FOO_Hist. Many different fields in FOO can be updated by the user, so I want to keep an accurate account of everything updated. FOO_Hist holds the exact same fields as FOO with the exception of an auto-incrementing HIST_ID. Every time FOO is updated, I perform an insert statement into FOO_Hist similar to: insert into FOO_HIST select * from FOO where id = @id.

My co-worker says that this is bad design because I shouldn't have an exact copy of a table for historical reasons and should just insert another record into the active table with a flag indicating that it's for historical purposes.

Is there a standard for dealing with historical data storage? It seems to me that I don't want to clutter my active records with all of my historical records in the same table considering that it may be well over a million records (I'm thinking long term).

How do you or your company handle this?

I'm using MS SQL Server 2008, but I'd like to keep the answer generic and arbitrary of any DBMS.

This question is related to database-design versioning

The answer is


You can create a materialized/indexed views on the table. Based on your requirement you can do full or partial update of the views. Please see this to create mview and log. How to create materialized views in SQL Server?


In SQL Server 2016 and above, there is a new feature called Temporal Tables that aims to solve this challenge with minimal effort from developer. The concept of temporal table is similar to Change Data Capture (CDC), with the difference that temporal table has abstracted most of the things that you had to do manually if you were using CDC.


The real question is do you need to use historical data and active data together for reporting? If so keep them in one table, partition and create a view for active records to use in active queries. If you only need to look at them occasionally (to research leagal issues or some such) then put them in a separate table.


You could just partition the tables no?

"Partitioned Table and Index Strategies Using SQL Server 2008 When a database table grows in size to the hundreds of gigabytes or more, it can become more difficult to load new data, remove old data, and maintain indexes. Just the sheer size of the table causes such operations to take much longer. Even the data that must be loaded or removed can be very sizable, making INSERT and DELETE operations on the table impractical. The Microsoft SQL Server 2008 database software provides table partitioning to make such operations more manageable."


I don't think there is a particular standard way of doing it but I thought I would throw in a possible method. I work in Oracle and our in-house web application framework that utilizes XML for storing application data.

We use something called a Master - Detail model that at it's simplest consists of:

Master Table for example calledWidgets often just containing an ID. Will often contain data that won't change over time / isn't historical.

Detail / History Table for example called Widget_Details containing at least:

  • ID - primary key. Detail/historical ID
  • MASTER_ID - for example in this case called 'WIDGET_ID', this is the FK to the Master record
  • START_DATETIME - timestamp indicating the start of that database row
  • END_DATETIME - timestamp indicating the end of that database row
  • STATUS_CONTROL - single char column indicated status of the row. 'C' indicates current, NULL or 'A' would be historical/archived. We only use this because we can't index on END_DATETIME being NULL
  • CREATED_BY_WUA_ID - stores the ID of the account that caused the row to be created
  • XMLDATA - stores the actual data

So essentially, a entity starts by having 1 row in the master and 1 row in the detail. The detail having a NULL end date and STATUS_CONTROL of 'C'. When an update occurs, the current row is updated to have END_DATETIME of the current time and status_control is set to NULL (or 'A' if preferred). A new row is created in the detail table, still linked to the same master, with status_control 'C', the id of the person making the update and the new data stored in the XMLDATA column.

This is the basis of our historical model. The Create / Update logic is handled in an Oracle PL/SQL package so you simply pass the function the current ID, your user ID and the new XML data and internally it does all the updating / inserting of rows to represent that in the historical model. The start and end times indicate when that row in the table is active for.

Storage is cheap, we don't generally DELETE data and prefer to keep an audit trail. This allows us to see what our data looked like at any given time. By indexing status_control = 'C' or using a View, cluttering isn't exactly a problem. Obviously your queries need to take into account you should always use the current (NULL end_datetime and status_control = 'C') version of a record.


I Know this old post but Just wanted to add few points. The standard for such problems is what works best for the situation. understanding the need for such storage, and potential use of the historical/audit/change tracking data is very importat.

Audit (security purpose) : Use a common table for all your auditable tables. define structure to store column name , before value and after value fields.

Archive/Historical: for cases like tracking previous address , phone number etc. creating a separate table FOO_HIST is better if you your active transaction table schema does not change significantly in the future(if your history table has to have the same structure). if you anticipate table normalization , datatype change addition/removal of columns, store your historical data in xml format . define a table with the following columns (ID,Date, Schema Version, XMLData). this will easily handle schema changes . but you have to deal with xml and that could introduce a level of complication for data retrieval .


Another option is to archive the operational data on a [daily|hourly|whatever] basis. Most database engines support the extraction of the data into an archive.

Basically, the idea is to create a scheduled Windows or CRON job that

  1. determines the current tables in the operational database
  2. selects all data from every table into a CSV or XML file
  3. compresses the exported data to a ZIP file, preferably with the timestamp of the generation in the file name for easier archiving.

Many SQL database engines come with a tool that can be used for this purpose. For example, when using MySQL on Linux, the following command can be used in a CRON job to schedule the extraction:

mysqldump --all-databases --xml --lock-tables=false -ppassword | gzip -c | cat > /media/bak/servername-$(date +%Y-%m-%d)-mysql.xml.gz

I think you approach is correct. Historical table should be a copy of the main table without indexes, make sure you have update timestamp in the table as well.

If you try the other approach soon enough you will face problems:

  • maintenance overhead
  • more flags in selects
  • queries slowdown
  • growth of tables, indexes

You can use MSSQL Server Auditing feature. From version SQL Server 2012 you will find this feature in all editions:

http://technet.microsoft.com/en-us/library/cc280386.aspx


Just wanted to add an option that I started using because I use Azure SQL and the multiple table thing was way too cumbersome for me. I added an insert/update/delete trigger on my table and then converted the before/after change to json using the "FOR JSON AUTO" feature.

 SET @beforeJson = (SELECT * FROM DELETED FOR JSON AUTO)
SET @afterJson = (SELECT * FROM INSERTED FOR JSON AUTO)

That returns a JSON representation fo the record before/after the change. I then store those values in a history table with a timestamp of when the change occurred (I also store the ID for current record of concern). Using the serialization process, I can control how data is backfilled in the case of changes to schema.

I learned about this from this link here