This article explains the concept of an index and Primary Key and its utility in databases.
Every time a query that contains a WHERE clause is issued, MySQL has to do a full scan of the table to find matching rows. A linear scan of the table isn’t very efficient. Adding an index to the table can significantly speed-up search for matching rows. An index’s primary purpose is to provide an ordered representation of indexed data. An index works similarly to how a book index works. Say if you are looking for a particular word, you can scan the index and find the page numbers where the word appears. This saves you time as now you don’t have to flip through every page in the book. Since a book index is sorted in alphabetical order, you can visually very quickly narrow down the portion of the index to inspect. If you are looking for a topic starting with the letter ‘M’, you can jump to the part of the index with all the topics starting with the letter ‘M’. However, the trade-off is that the index itself takes up a few pages of the book.
-- The article queries are reproduced below for convenient copy/paste into the terminal.
-- Query 1
SHOW INDEX FROM Actors;
-- Query 2ANALYZE TABLE Actors;
SHOW INDEX FROM Actors;
-- Query 3
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions)
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions)
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions)
This Articles Contents
SHOW INDEX FROM tablename;
We can use the following query to display the indexes on a table:
SHOW INDEX FROM Actors;
The cardinality shows the number of unique values for the primary key. It’s also described as an estimate of the number of unique values in the index and may not be exact for smaller tables. Since we have eleven actors in the Actors table we have cardinality as 11. However, if you execute the above query in the console, the cardinality may not come out to be 11. Cardinality is counted based on statistics stored as integers, so the value is not necessarily exact even for small tables. However, if you execute the following command and then check for cardinality, it should come out to be exact.
ANALYZE TABLE Actors;
SHOW INDEX FROM Actors;
All columns that make up the primary key must be non-null.
Index
Before proceeding further let’s understand the finer details and the general concept of an index. There are two kinds of indexes:
- Clustered
- Non-clustered
Clustered Index
When we insert rows into a table they are written to the underlying physical storage medium such as the hard disk. In the case of a clustered index, the table rows are sorted and kept in a B-tree structure (or an R- tree in the case of a spatial index). The physical arrangement of the rows on the disk mimics the logical order defined by the index. There’s no separate data-structure that holds the table’s rows. It is a way to inform the database system to cluster values physically close to each other on the disk that are close to each other in the index order.
B-trees and their modified version, B+ trees, are common data-structures used to store data on disks. Very large tables can’t fit into the main memory of the computer and only portions of such tables are brought from the hard disk to main memory. Searching for the required data becomes easy with B-trees as they guarantee a fixed number of disk reads since they are balanced. A B-tree consists of a root node, branch nodes, and then leaf nodes. An example B-tree looks like as follows:
In a B+ tree, the data only lives in the leaf nodes. The root and the internal nodes contain the key on which the B+ tree is sorted. MySQL stores rows in an entity called a page. A page is the smallest unit of data that a database can write to or read from a disk. A page contains rows and forms the leaf node of the B+ tree. An approximation of how the rows from the Actors table will appear as in a B+ tree with the first name as the key is shown below:
The above diagram shows the nodes are linked to each other via next and back pointers. Once we seek the desired data, we can easily navigate the index using these pointers. This is especially useful in range queries. For instance, if we wanted to select all the actors with names greater than Kylie in alphabetical order, we’ll seek to the leaf node with the key-value Kylie, and then blindly follow the forward pointers until the end of the index to gather all the rows satisfying the query.
Remember the database works with pages and not rows. When a page is read into memory, all the rows in that page are loaded into memory. In MySQL the default size of a page is fixed at 16KB though it is configurable. A collection of pages forms an extent and a collection of extents forms a segment. Segments in turn form a tablespace. A tablespace consists of tables and their associated indexes. There is a tablespace called the system tablespace, and in older versions of MySQL, all user tables were also part of the system tablespace. With later MySQL versions a configuration can be specified to have a separate tablespace for each user table. Conceptually, the page, extent, and segment are related as shown below:
We’ll skip the details on pages, extents, etc., as we learned enough to continue our discussion on indexes. Rows are physically written out in the order of the chosen primary key. A primary key uniquely identifies each row in a table. Let’s take the example of our Actors table. We define the ID as the primary key constraint. The rows are stored in the order of the ID column. In other words, the clustered index is the table. As a test,we’ll add three rows to our Actors table. The row we add first will have a higher ID than the second and third rows. Next, we’ll execute a select query and examine the order of the rows returned.
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions) VALUES (15, "First","Row", "1999-01-01", "Male", "Single",0.00);
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions) VALUES (13, "Second","Row", "1999-01-01", "Male", "Single",0.00);
INSERT INTO Actors (Id, FirstName, SecondName,DoB, Gender, MaritalStatus, NetWorthInMillions) VALUES (12, "Third","Row", "1999-01-01", "Male", "Single",0.00);
Note that the first row we add appears last and the last row we add appears earlier than the first and the second rows. This is because the rows are retrieved in the order of the index. The leaf nodes of a clustered index contain the actual data called a page. We have learned that pages form the leaf nodes of the tree stored on the disk. Now we’ll see how rows are stored within each page.
Page
The pages of the B-tree split and merge as required. If a page is full and a new key is inserted, the existing page splits. Similarly, if enough rows are deleted from a page it may get merged with another. There are other intricate details around how pages are managed, but they aren’t important for this introductory course on MySQL. Within a leaf-node page, the records or rows exist as a singly linked-list. The linked list enforces the index order on the rows. A new row is added to the available free space within the page. The existing records in the page aren’t moved around in case the new row appears between existing rows in index order. Instead the new row is placed in the free space available within the page and the linked list pointers are manipulated to fix the order. This is depicted below:
A clustered index doesn’t imply that the data rows are contiguously stored on the hard disk, or in other words, the pages aren’t contiguous on the physical disk. The location on the disk where a row gets written to is the realm of the OS. A clustered index only ensures that the physical and logical order the rows appear in is the same.
Every table is stored as a clustered index with the primary key as the sort key in MySQL when the database engine is selected as InnoDB. A database engine is the software module that a database management system uses to create, read, update, and delete data from a database. In case of MySQL, we have InnoDB and MyISAM as examples of two popular storage engines. You can view the available storage engines as follows:
SHOW ENGINES;
MySQL creates a clustered index on the primary key. If no primary key is defined it looks for the first UNIQUE index with all the columns that form the key set as NOT NULL. If no UNIQUE index is available, MySQL generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing row ID values. The rows are ordered by the ID that gets assigned to each row. This also implies that there can only be one clustered index per table as the table rows can only be arranged in one order on the disk. All other indexes are secondary indexes.
Non-clustered Index
In the case of the clustered index, we saw that leaf-nodes consisted of pages that contained the actual data. On the contrary, in a non-clustered index, the leaf-nodes don’t hold the actual data, rather, a pointer to data stored elsewhere on the disk. If the selected database engine is MyISAM then the rows aren’t stored in sorted order; rather they appear as a heap without any ordering. Such a table is called a heap-table since it contains an unordered pile of data. The MyISAM database engine is based on ISAM (Indexed Sequential Access Method), an indexing algorithm developed by IBM that allows retrieving information from large sets of data in a fast way. In the image below, the Actors table appears as a heap and a non-clustered index contains pointers to rows.
The rows appear in the insertion order when the engine is MyISAM. All indexes are secondary indexes when using MyISAM as the database engine, even the primary index is a unique non-clustered index named primary. However, in case of InnoDB, the secondary index’s leaf nodes don’t point to the rows as they do in MyISAM. Remember that pages can be split and merged, and the rows can be physically moved around on the disk.
This could be catastrophic if the leaf nodes of an InnoDB secondary index pointed to row locations on disk. To avoid the cost of rewriting, the secondary index’s leaf nodes primary key values are stored on the leaf node rather than pointers to rows. This means we first seek the secondary index, get the primary key, and then use that primary key to navigate the primary index to locate the row. In contrast, a MyISAM secondary index can get the row data as soon as it reaches the leaf node of its index.
Cost of Indexing
An index doesn’t come for free. For one, it takes up additional disk space, and two, it needs to be modified whenever an insert or an update is made to the table.