[sql] Sample database for exercise

I would like to play with a larger database to test my knowledge on SQL.

Is there a huge .sql out there that I can use to play with SQL queries?

This question is related to sql test-data sample-data

The answer is


This is what I am using for learning sql: employees-db

this is a sample database with an integrated test suite, used to test your applications and database servers

3rd party edit

According to launchpad.net the database has moved to github.

The database contains about 300,000 employee records with 2.8 million salary entries. The export data is 167 MB, which is not huge, but heavy enough to be non-trivial for testing.

The data was generated, and as such there are inconsistencies and subtle problems. Rather than removing them, we decided to leave the contents untouched, and use these issues as data cleaning exercises.


If you want a big database of real data to play with, you could sign up for the Netflix Prize contest and get access to their data, which is pretty large (a few gigs of entries).

3rd party edit

The URL above does not contain the dataset anylonger (october 2016). The wikipedia page about the Netflix Prize reports that a law suit was settled regarding privacy concerns.


This is an online database but you can try with the stackoverflow database: https://data.stackexchange.com/stackoverflow/query/new

You also can download its dumps here:

https://archive.org/download/stackexchange


Check out CodePlex for Microsoft SQL Server Community Projects & Samples

3rd party edit

On top of the link above you might look at


Why not download the English Wikipedia? There are compressed SQL files of various sizes, and it should certainly be large enough for you

The main articles are XML, so inserting them into the db is a bit more of a problem, but you might find there are other files there that suit you. For example, the inter-page links SQL file is 2.3GB compressed. Have a look at https://en.wikipedia.org/wiki/Wikipedia:Database_download for more info.

Oskar


You want huge?

Here's a small table: create table foo (id int not null primary key auto_increment, crap char(2000));

insert into foo(crap) values ('');

-- each time you run the next line, the number of rows in foo doubles. insert into foo( crap ) select * from foo;

run it twenty more times, you have over a million rows to play with.

Yes, if he's looking for looks of relations to navigate, this is not the answer. But if by huge he means to test performance and his ability to optimize, this will do it. I did exactly this (and then updated with random values) to test an potential answer I had for another question. (And didn't answer it, because I couldn't come up with better performance than what that asker had.)

Had he asked for "complex", I'd have gien a differnt answer. To me,"huge" implies "lots of rows".

Because you don't need huge to play with tables and relations. Consider a table, by itself, with no nullable columns. How many different kinds of rows can there be? Only one, as all columns must have some value as none can be null.

Every nullable column multiples by two the number of different kinds of rows possible: a row where that column is null, an row where it isn't null.

Now consider the table, not in isolation. Consider a table that is a child table: for every child that has an FK to the parent, that, is a many-to-one, there can be 0, 1 or many children. So we multiply by three times the count we got in the previous step (no rows for zero, one for exactly one, two rows for many). For any grandparent to which the parent is a many, another three.

For many-to-many relations, we can have have no relation, a one-to-one, a one-to-many, many-to-one, or a many-to-many. So for each many-to-many we can reach in a graph from the table, we multiply the rows by nine -- or just like two one-to manys. If the many-to-many also has data, we multiply by the nullability number.

Tables that we can't reach in our graph -- those that we have no direct or indirect FK to, don't multiply the rows in our table.

By recursively multiplying the each table we can reach, we can come up with the number of rows needed to provide one of each "kind", and we need no more than those to test every possible relation in our schema. And we're nowhere near huge.