More on SQL and MySQL : PHP

Choosing keys and indexes for fast searching
Elementary database-tuning techniques
Adding and deleting users of a DBMS, and changing user permissions
Limitations of MySQL

Keys, Primary Keys, and Indexes

As discussed earlier in our introduction to SQL, each table should have a PRIMARY KEY definition as part of the CREATE TABLE statement. A primary key is an attribute-or set of attributes-that uniquely identifies a row in a table. Storing two rows with the same primary key isn't permitted and, indeed, an attempt to INSERT duplicate primary keys produces an error.

In MySQL, the attribute values of the primary key are stored in an index to allow fast access to a row. The default MySQL index type is fast for queries that find a specific row, a range of rows, for joins between tables, grouping data, ordering data, and finding minimum and maximum values. Indexes don't provide any speed improvement for retrieving all the rows in a table or for other query types.

Indexes are also useful for fast access to rows by values other than those that are associated with attributes in the primary key. For example, in the customer table, you might define an index by adding the clause:

KEY namecity (surname,firstname,city)

to the CREATE TABLE statement. After you define this index, some queries that select a particular customer through a WHERE clause can use it. Consider an example:

SELECT * FROM customer
  WHERE surname = 'Marzalla' AND
  firstname = 'Dimitria' AND
  city = 'St Albans';

This query can use the new index to locate-in at most a few disk accesses-the row that matches the search criteria. Without the index, the DBMS must scan all the rows in the customer table and compare each row to the WHERE clause. This might be quite slow and certainly requires significantly more disk accesses than the index-based approach (assuming the table has more than a few rows).

A particular feature of DBMSs is that they develop a query evaluation strategy and optimize it without any interaction from the user or programmer. If an index is available, and it makes sense to use it in the context of a query, the DBMS does this automatically. All you need to do is identify which queries are common, and make an index available for those common queries by adding the KEY clause to the CREATE TABLE statement or using ALTER TABLE on an existing table.

Careful index design is important. The namecity index we have defined can also speed queries other than those that supply a complete surname, firstname, and city. For example, consider a query:

SELECT * FROM customer
  WHERE surname = 'LaTrobe' AND
  firstname = 'Anthony';

This query can also use the index namecity, because the index permits access to rows in sorted order first by surname, then firstname, and then city. With this sorting, all "LaTrobe, Anthony" index entries are clustered together in the index. Indeed, the index can also be used for the query:

SELECT * FROM customer
  WHERE surname LIKE 'Mar%';

Again, all surnames beginning with "Mar" are clustered together in the index. However, the index can't be used for a query such as:

SELECT * FROM customer
  WHERE firstname = 'Dimitria' AND
  city = 'St Albans';

The index can't be used because the leftmost attribute named in the index, surname, isn't part of the WHERE clause. In this case, all rows in the customer table must be scanned and the query is much slower (again assuming there are more than a few rows in the customer table, and assuming there is no other index).

Careful choice of the order of attributes in a KEY clause is important. For an index to be usable in a query, the leftmost attribute must appear in a WHERE clause.

There are other cases in which an index can't be used, such as when a query contains an OR that isn't on an indexed attribute:

SELECT * FROM customer
  WHERE surname = 'Marzalla' OR
  email = 'dimitria@lucaston.com';

Again, the customer table must be completely scanned, because the second condition, email='dimitria@lucaston.com', requires all rows to be retrieved as there is no index available on the attribute email. Also, the case where the ORed attribute isn't the leftmost attribute in an index requires a complete scan of the customer table. The following example requires a complete scan:

SELECT * FROM customer
WHERE firstname = 'Dimitria' OR
surname = 'Marzalla';

If all the attributes in the index are used in all the queries, to optimize index size, the leftmost attribute in the KEY clause should be the attribute with the highest number of duplicate entries.

Because indexes speed up queries, why not create indexes on all the attributes you can possibly search on? The answer is that while indexes are fast for searching, they consume space and require updates each time rows are added or deleted, or key attributes are changed. So, if a database is largely static, additional indexes have low overheads, but if a database changes frequently, each additional index slows the update process significantly. In either case, indexes consume additional space, and unnecessary indexes should be avoided.

One way to reduce the size of an index and speed updates is to create an index on a prefix of an attribute. Our namecity index uses considerable space: for each row in the customer table, an index entry is up to 120 characters in length because it is created from the combined values of the surname, firstname, and city attributes.^[2] To reduce space, you can define the index as:

^[2] This isn't the space actually required by an index entry, because the data is compressed for storage. However, even with compression, the fewer characters indexed, the more compact the representation, the more space saved, and-depending on the usability of the index-the faster searching and updates are.

KEY namecity (surname(10),firstname(3),city(2));

This uses only the first 10 characters of surname, 3 of firstname, and the first 2 characters of city to distinguish index entries. This is quite reasonable, because 10 characters from a surname distinguishes between most surnames, and the addition of a few characters from a first name and the prefix of their city should be sufficient to uniquely identify almost all customers. Having a smaller index with less information can also mean that queries are actually faster, because more index information can be retrieved from disk per second, and disk retrieval speed is almost always the bottleneck in query performance.

The space saving is significant with a reduced index. A new index entry requires only 15 characters, a saving of up to 105 characters, so index insertions, deletions, and modifications are now likely to be much faster. Note that for TEXT and BLOB attribute types, a prefix must be taken when indexing, because indexing the entire attribute is impractical and isn't permitted by the MySQL DBMS.