[sql] How does database indexing work?

Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?

For information on queries to index a field, check out How do I index a database column.

This question is related to sql database performance indexing database-indexes

The answer is

Just a quick suggestion.. As indexing costs you additional writes and storage space, so if your application requires more insert/update operation, you might want to use tables without indexes, but if it requires more data retrieval operations, you should go for indexed table.

Now, let’s say that we want to run a query to find all the details of any employees who are named ‘Abc’?

SELECT * FROM Employee 
WHERE Employee_Name = 'Abc'

What would happen without an index?

Database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Abc’. And, because we want every row with the name ‘Abc’ inside it, we can not just stop looking once we find just one row with the name ‘Abc’, because there could be other rows with the name Abc. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Abc’. This is what is called a full table scan

How a database index can help performance

The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined. An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table.

How does B-trees index work?

The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.

How does a hash table index work?

The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index.

For instance, the query we discussed earlier could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Abc => 0x28939", where 0x28939 is a reference to the table row where Abc is stored in memory. Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Abc” in the Employee_Name column.

The disadvantages of a hash index

Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality

What exactly is inside a database index? So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.

How does a database know when to use an index? When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Abc’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.

What is the cost of having a database index?

It takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.

As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.

See also

  1. What columns generally make good indexes?
  2. How do database indexes work

An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.

The first time I read this it was very helpful to me. Thank you.

Since then I gained some insight about the downside of creating indexes: if you write into a table (UPDATE or INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)

However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.

Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZE helps, you must write routines to have it done.

In certain scenarios a heap is more helpful than a table with indexes,

e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.

Also, a differentiation between clustered and non-clustered indexes is rather important.

Helped me:- What do Clustered and Non clustered index actually mean?

Simple Description!

The index is nothing but a data structure that stores the values for a specific column in a table. An index is created on a column of a table.

Example: We have a database table called User with three columns – Name, Age and Address. Assume that the User table has thousands of rows.

Now, let’s say that we want to run a query to find all the details of any users who are named 'John'. If we run the following query:

WHERE Name = 'John'

The database software would literally have to look at every single row in the User table to see if the Name for that row is ‘John’. This will take a long time.

This is where index helps us: index is used to speed up search queries by essentially cutting down the number of records/rows in a table that needs to be examined.

How to create an index:

CREATE INDEX name_index
ON User (Name)

An index consists of column values(Eg: John) from one table, and those values are stored in a data structure.

So now the database will use the index to find employees named John because the index will presumably be sorted alphabetically by the Users name. And, because it is sorted, it means searching for a name is a lot faster because all names starting with a “J” will be right next to each other in the index!

Just think of Database Index as Index of a book.

If you have a book about dogs and you want to find an information about let's say, German Shepherds, you could of course flip through all the pages of the book and find what you are looking for - but this of course is time consuming and not very fast.

Another option is that, you could just go to the Index section of the book and then find what you are looking for by using the Name of the entity you are looking ( in this instance, German Shepherds) and also looking at the page number to quickly find what you are looking for.

In Database, the page number is referred to as a pointer which directs the database to the address on the disk where entity is located. Using the same German Shepherd analogy, we could have something like this (“German Shepherd”, 0x77129) where 0x77129 is the address on the disk where the row data for German Shepherd is stored.

In short, an index is a data structure that stores the values for a specific column in a table so as to speed up query search.

Classic example "Index in Books"

Consider a "Book" of 1000 pages, divided by 10 Chapters, each section with 100 pages.

Simple, huh?

Now, imagine you want to find a particular Chapter that contains a word "Alchemist". Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.

This analogy is known as "Full Table Scan" in database world.

enter image description here

But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.

But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.

Thus, the index is a separate section that stores values of indexed column + pointer to the indexed row in a sorted order for efficient look-ups.

Things are simple in schools, isn't it? :P

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to database

Implement specialization in ER diagram phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' cannot be loaded Room - Schema export directory is not provided to the annotation processor so we cannot export the schema SQL Query Where Date = Today Minus 7 Days MySQL Error: : 'Access denied for user 'root'@'localhost' SQL Server date format yyyymmdd How to create a foreign key in phpmyadmin WooCommerce: Finding the products in database TypeError: tuple indices must be integers, not str

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to indexing

numpy array TypeError: only integer scalar arrays can be converted to a scalar index How to print a specific row of a pandas DataFrame? What does 'index 0 is out of bounds for axis 0 with size 0' mean? How does String.Index work in Swift Pandas KeyError: value not in index Update row values where certain condition is met in pandas Pandas split DataFrame by column value Rebuild all indexes in a Database How are iloc and loc different? pandas loc vs. iloc vs. at vs. iat?

Examples related to database-indexes

How does database indexing work?