How to Store Historical Data

Question

Some co-workers and I got into a debate on the best way to store historical data   Currently  for some systems  I use a separate table to store historical data  and I keep an original table for the current  active record   So  let s say I have table FOO   Under my system  all active records will go in FOO  and all historical records will go in FOO Hist   Many different fields in FOO can be updated by the user  so I want to keep an accurate account of everything updated   FOO Hist holds the exact same fields as FOO with the exception of an auto-incrementing HIST ID   Every time FOO is updated  I perform an insert statement into FOO Hist similar to  insert into FOO HIST select   from FOO where id    id     My co-worker says that this is bad design because I shouldn t have an exact copy of a table for historical reasons and should just insert another record into the active table with a flag indicating that it s for historical purposes   Is there a standard for dealing with historical data storage   It seems to me that I don t want to clutter my active records with all of my historical records in the same table considering that it may be well over a million records  I m thinking long term    How do you or your company handle this   I m using MS SQL Server 2008  but I d like to keep the answer generic and arbitrary of any DBMS

User · Answer

The real question is do you need to use historical data and active data together for reporting  If so keep them in one table  partition and create a view for active records to use in active queries  If you only need to look at them occasionally  to research leagal issues or some such  then put them in a separate table

User · Answer

I Know this old post but Just wanted to add few points  The standard for such problems is what works best for the situation  understanding the need for such storage  and potential use of the historical audit change tracking data is very importat    Audit  security purpose     Use a common table for all your auditable tables  define structure to store column name   before value and after value fields    Archive Historical  for cases like tracking previous address   phone number etc  creating a separate table FOO HIST is better if you your active transaction table schema does not change significantly in the future if your history table has to have the same structure    if you anticipate table normalization   datatype change addition removal of columns  store your historical data in xml format   define a table with the following columns  ID Date  Schema Version  XMLData   this will easily handle schema changes   but you have to deal with xml and that could introduce a level of complication for data retrieval

User · Answer

Supporting historical data directly within an operational system will make your application much more complex than it would otherwise be   Generally  I would not recommend doing it unless you have a hard requirement to manipulate historical versions of a record within the system     If you look closely  most requirements for historical data fall into one of two categories    Audit logging   This is better off done with audit tables   It s fairly easy to write a tool that generates scripts to create audit log tables and triggers by reading metadata from the system data dictionary   This type of tool can be used to retrofit audit logging onto most systems   You can also use this subsystem for changed data capture if you want to implement a data warehouse  see below   Historical reporting  Reporting on historical state   as-at  positions or analytical reporting over time   It may be possible to fulfil simple historical reporting requirements by quering audit logging tables of the sort described above   If you have more complex requirements then it may be more economical to implement a data mart for the reporting than to try and integrate history directly into the operational system Slowly changing dimensions are by far the simplest mechanism for tracking and querying historical state and much of the history tracking can be automated   Generic handlers aren t that hard to write   Generally  historical reporting does not have to use up-to-the-minute data  so a batched refresh mechanism is normally fine   This keeps your core and reporting system architecture relatively simple    If your requirements fall into one of these two categories  you are probably better off not storing historical data in your operational system   Separating the historical functionality into another subsystem will probably be less effort overall and produce transactional and audit reporting databases that work much better for their intended purpose

User · Answer

I don t think there is a particular standard way of doing it but I thought I would throw in a possible method  I work in Oracle and our in-house web application framework that utilizes XML for storing application data   We use something called a Master - Detail model that at it s simplest consists of   Master Table for example calledWidgets often just containing an ID  Will often contain data that won t change over time   isn t historical   Detail   History Table for example called Widget Details containing at least    ID - primary key  Detail historical ID MASTER ID - for example in this case called  WIDGET ID   this is the FK to the Master record START DATETIME - timestamp indicating the start of that database row END DATETIME - timestamp indicating the end of that database row STATUS CONTROL - single char column indicated status of the row   C  indicates current  NULL or  A  would be historical archived  We only use this because we can t index on END DATETIME being NULL CREATED BY WUA ID - stores the ID of the account that caused the row to be created XMLDATA - stores the actual data   So essentially  a entity starts by having 1 row in the master and 1 row in the detail  The detail having a NULL end date and STATUS CONTROL of  C   When an update occurs  the current row is updated to have END DATETIME of the current time and status control is set to NULL  or  A  if preferred   A new row is created in the detail table  still linked to the same master  with status control  C   the id of the person making the update and the new data stored in the XMLDATA column   This is the basis of our historical model  The Create   Update logic is handled in an Oracle PL SQL package so you simply pass the function the current ID  your user ID and the new XML data and internally it does all the updating   inserting of rows to represent that in the historical model  The start and end times indicate when that row in the table is active for   Storage is cheap  we don t generally DELETE data and prefer to keep an audit trail  This allows us to see what our data looked like at any given time  By indexing status control    C  or using a View  cluttering isn t exactly a problem  Obviously your queries need to take into account you should always use the current  NULL end datetime and status control    C   version of a record

User · Answer

In SQL Server 2016 and above  there is a new feature called Temporal Tables that aims to solve this challenge with minimal effort from developer  The concept of temporal table is similar to Change Data Capture  CDC   with the difference that temporal table has abstracted most of the things that you had to do manually if you were using CDC

User · Answer

You can use MSSQL Server Auditing feature  From version SQL Server 2012 you will find this feature in all editions   http   technet microsoft com en-us library cc280386 aspx

User · Answer

Just wanted to add an option that I started using because I use Azure SQL and the multiple table thing was way too cumbersome for me   I added an insert update delete trigger on my table and then converted the before after change to json using the  FOR JSON AUTO  feature      SET  beforeJson    SELECT   FROM DELETED FOR JSON AUTO  SET  afterJson    SELECT   FROM INSERTED FOR JSON AUTO    That returns a JSON representation fo the record before after the change   I then store those values in a history table with a timestamp of when the change occurred  I also store the ID for current record of concern    Using the serialization process  I can control how data is backfilled in the case of changes to schema     I learned about this from this link here

User · Answer

I think you approach is correct  Historical table should be a copy of the main table without indexes  make sure you have update timestamp in the table as well    If you try the other approach soon enough you will face problems    maintenance overhead  more flags in selects queries slowdown growth of tables  indexes

User · Answer

Another option is to archive the operational data on a  daily hourly whatever  basis  Most database engines support the extraction of the data into an archive   Basically  the idea is to create a scheduled Windows or CRON job that   determines the current tables in the operational database selects all data from every table into a CSV or XML file compresses the exported data to a ZIP file  preferably with the timestamp of the generation in the file name for easier archiving    Many SQL database engines come with a tool that can be used for this purpose  For example  when using MySQL on Linux  the following command can be used in a CRON job to schedule the extraction   mysqldump --all-databases --xml --lock-tables false -ppassword   gzip -c   cat  gt   media bak servername-  date   Y- m- d -mysql xml gz

User · Answer

You could just partition the tables no    Partitioned Table and Index Strategies Using SQL Server 2008 When a database table grows in size to the hundreds of gigabytes or more  it can become more difficult to load new data  remove old data  and maintain indexes  Just the sheer size of the table causes such operations to take much longer  Even the data that must be loaded or removed can be very sizable  making INSERT and DELETE operations on the table impractical  The Microsoft SQL Server 2008 database software provides table partitioning to make such operations more manageable

User · Answer

You can create a materialized indexed views on the table  Based on your requirement you can do full or partial update of the views  Please see this to create mview and log  How to create materialized views in SQL Server

[database-design] How to Store Historical Data

Examples related to database-design

Examples related to versioning