[sql-server] SQL Server insert if not exists best practice

Normalizing your operational tables as suggested by Transact Charlie, is a good idea, and will save many headaches and problems over time - but there are such things as interface tables, which support integration with external systems, and reporting tables, which support things like analytical processing; and those types of tables should not necessarily be normalized - in fact, very often it is much, much more convenient and performant for them to not be.

In this case, I think Transact Charlie's proposal for your operational tables is a good one.

But I would add an index (not necessarily unique) to CompetitorName in the Competitors table to support efficient joins on CompetitorName for the purposes of integration (loading of data from external sources), and I would put an interface table into the mix: CompetitionResults.

CompetitionResults should contain whatever data your competition results have in it. The point of an interface table like this one is to make it as quick and easy as possible to truncate and reload it from an Excel sheet or a CSV file, or whatever form you have that data in.

That interface table should not be considered part of the normalized set of operational tables. Then you can join with CompetitionResults as suggested by Richard, to insert records into Competitors that don't already exist, and update the ones that do (for example if you actually have more information about competitors, like their phone number or email address).

One thing I would note - in reality, Competitor Name, it seems to me, is very unlikely to be unique in your data. In 200,000 competitors, you may very well have 2 or more David Smiths, for example. So I would recommend that you collect more information from competitors, such as their phone number or an email address, or something which is more likely to be unique.

Your operational table, Competitors, should just have one column for each data item that contributes to a composite natural key; for example it should have one column for a primary email address. But the interface table should have a slot for old and new values for a primary email address, so that the old value can be use to look up the record in Competitors and update that part of it to the new value.

So CompetitionResults should have some "old" and "new" fields - oldEmail, newEmail, oldPhone, newPhone, etc. That way you can form a composite key, in Competitors, from CompetitorName, Email, and Phone.

Then when you have some competition results, you can truncate and reload your CompetitionResults table from your excel sheet or whatever you have, and run a single, efficient insert to insert all the new competitors into the Competitors table, and single, efficient update to update all the information about the existing competitors from the CompetitionResults. And you can do a single insert to insert new rows into the CompetitionCompetitors table. These things can be done in a ProcessCompetitionResults stored procedure, which could be executed after loading the CompetitionResults table.

That's a sort of rudimentary description of what I've seen done over and over in the real world with Oracle Applications, SAP, PeopleSoft, and a laundry list of other enterprise software suites.

One last comment I'd make is one I've made before on SO: If you create a foreign key that insures that a Competitor exists in the Competitors table before you can add a row with that Competitor in it to CompetitionCompetitors, make sure that foreign key is set to cascade updates and deletes. That way if you need to delete a competitor, you can do it and all the rows associated with that competitor will get automatically deleted. Otherwise, by default, the foreign key will require you to delete all the related rows out of CompetitionCompetitors before it will let you delete a Competitor.

(Some people think non-cascading foreign keys are a good safety precaution, but my experience is that they're just a freaking pain in the butt that are more often than not simply a result of an oversight and they create a bunch of make work for DBA's. Dealing with people accidentally deleting stuff is why you have things like "are you sure" dialogs and various types of regular backups and redundant data sources. It's far, far more common to actually want to delete a competitor, whose data is all messed up for example, than it is to accidentally delete one and then go "Oh no! I didn't mean to do that! And now I don't have their competition results! Aaaahh!" The latter is certainly common enough, so, you do need to be prepared for it, but the former is far more common, so the easiest and best way to prepare for the former, imo, is to just make foreign keys cascade updates and deletes.)