[database] Database development mistakes made by application developers

What are common database development mistakes made by application developers?

This question is related to database database-design

The answer is


Not understanding how a DBMS works under the hood.

You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.

Specifically:

  1. Do you know what a Clustered Index is? Did you think about it when you designed your schema?

  2. Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?

  3. So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.

  4. Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.

  5. Do you understand the Query Cache? Do you know what makes a query un-cachable?

  6. Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.


This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.

Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.

And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".


Not using parameterized queries. They're pretty handy in stopping SQL Injection.

This is a specific example of not sanitizing input data, mentioned in another answer.


Blaming the db engine when the query that ran sooo fast on your development machine blows up and choke once you throw some traffic at the application.


Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.


15 - Using some crazy construct and application logic instead of a simple COALESCE.


1) Poor understanding of how to properly interact between Java and the database.

2) Over parsing, improper or no reuse of SQL

3) Failing to use BIND variables

4) Implementing procedural logic in Java when SQL set logic in the database would have worked (better).

5) Failing to do any reasonable performance or scalability testing prior to going into production

6) Using Crystal Reports and failing to set the schema name properly in the reports

7) Implementing SQL with Cartesian products due to ignorance of the execution plan (did you even look at the EXPLAIN PLAN?)


Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.


  1. Using an ORM to do bulk updates
  2. Selecting more data than needed. Again, typically done when using an ORM
  3. Firing sqls in a loop.
  4. Not having good test data and noticing performance degradation only on live data.

Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.


The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.


Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).


Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!


Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.


  • Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
  • Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.

Many developers tend to execute multiple queries against the database (often querying one or two tables) extract the results and perform simple operations in java/c/c++ - all of which could have been done with a single SQL statement.

Many developers often dont realize that on development environments database and app servers are on their laptops - but on a production environment, database and apps server will be on different machines. Hence for every query there is an additional n/w overhead for the data to be passed between the app server and the database server. I have been amazed to find the number of database calls that are made from the app server to the database server to render one page to the user!


For SQL-based databases:

  1. Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
  2. Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
  3. Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
  4. Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
  5. Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
  6. Choosing the wrong datatype for a column!
  7. Not choosing a proper column name.
  8. Not adding new columns at the end of the table.
  9. Not creating proper indexes to support frequently used queries.
  10. creating indexes on columns with few possible values and creating unnecessary indexes.
    ...more to be added.

I'd like to add: Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.

Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.


Poor Performance Caused by Correlated Subqueries

Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.

Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.

select e.first_name, e.last_name
from employee e
where e.start_date > 
        (select max(ds.transaction_date)
         from daily_sales ds
         where ds.store_id = e.store_id and
               ds.total < 10000)

The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.

select e.first_name, e.last_name
from employee e,
     (select ds.store_id,
             max(s.transaction_date) transaction_date
      from daily_sales ds
      where ds.total < 10000
      group by s.store_id) dsx
where e.store_id = dsx.store_id and
      e.start_date > dsx.transaction_date

In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.

I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.


There is one thing I might add, learn using analytic functions like PARTITION BY, RANK, DENSE_RANK (for Oracle). They are absolutely essential for complex queries.

Other advice is, if possible, to have a dedicated database developer in your development team who is expert in SQL, database modelling, tuning, etc. (Not a DBA though). Such skill is a great asset.


Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz


  • Very large transactions, inserting/updating a lot of data and then reloading it. Basically this is down to not considering the multi-user environment the database works in.

  • Overuse of functions, specifically as results in selects and in where clauses which causes the function to be called over and over again for the results. This, I think, fits under the general case of them trying to work in the procedural fashion they're more used to rather than use SQL to its full advantage.


  1. Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.

  2. Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.

  3. Failure to properly discern between work that should be done in the database, and work that should be done in the app.

  4. Not validating backups, or not backing up.

  5. Embedding raw SQL in their code.


In my experience:
Not communicating with experienced DBAs.


  • Not taking a backup before fixing some issue inside production database.

  • Using DDL commands on stored objects(like tables, views) in stored procedures.

  • Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.

  • Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.


a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application

I have seen both.


Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)


If you are using replication (MySQL), following functions are unsafe unless you are using row-based replication.

USER(), CURRENT_USER() (or CURRENT_USER), UUID(), VERSION(), LOAD_FILE(), and RAND()

See: http://dev.mysql.com/doc/refman/5.1/en/replication-features-functions.html


Not using indexes.


Key database design and programming mistakes made by developers

  • Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.

  • Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.

  • Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.

  • Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.

  • Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.


Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.


Well, I would have to say that the biggest mistake application developers make is not properly normalizing the database.

As an application developer myself, I realize the importance of proper database structure, normalization, and maintenance; I have spent countless hours educating myself on database structure and administration. In my experience, whenever I start working with a different developer, I usually have to restructure the entire database and update the app to suit because it is usually malformed and defective.

For example, I started working with a new project where the developer asked me to implement Facebook Connect on the site. I cracked open the database to see what I had to work with and saw that every little bit of information about any given user was crammed into one table. It took me six hours to write a script that would organize the table into four or five separate tables and another two to get the app to use those tables. Please, normalize your databases! It will make everything else less of a headache.


I think the biggest mistakes that all developers and DBAs do is believing too much on conventions. What I mean by that is that convention are only guide lines that for most cases will work but not necessarily always. I great example is normalization and foreign keys, I know most people wont like this, but normalization can cause complexity and cause loss of performance as well, so if there is no reason to move a phone number to a phones table, don't do it. On the foreign keys, they are great for most cases, but if you are trying to create something that can work by it self when needed the foreign key will be a problem in the future, and also you loose performance. Anyways, as I sad rules and conventions are there to guide, and they should always be though of but not necessarily implemented, analysis of each case is what should always be done.


1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.

Example:

where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate

instead of

where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1

And to a lesser extent: Not adding functional indexes to those values that need them...

2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.

3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.

Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.

4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.


I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.

I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although @adam has a similar issue indicated.

Example:

SELECT
    (SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
    ,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
    MyTable c

In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.

Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.

A better (not necessarily perfect) example would be something like:

SELECT
     s.SomeValue As FirstVal
    ,o.OtherValue As SecondVal
FROM
    MyTable c
    LEFT JOIN (
        SELECT SomeDate, MAX(SomeValue) as SomeValue
        FROM SomeTable 
        GROUP BY SomeDate
     ) s ON c.Date = s.SomeDate
    LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria

This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.


Using Excel for storing (huge amounts of) data.

I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).


Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.


Biggest mistake is having a loop in the code updating or inserting data when a simple set-based solution would do the trick much faster, and much more simple.


  1. Not using version control on the database schema
  2. Working directly against a live database
  3. Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
  4. Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance

Over-use and/or dependence on stored procedures.

Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.

Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".

I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).

Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".

These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).