Data Model Reference
Purpose
The purpose of this whitepaper is to discuss and describe the terms "Surrogate Key" and "Natural Key", and to explain when either would be appropriate to use in a data model. Furthermore, the concepts of "Intelligence" and "Generated" will be applied to both types of keys. The goal is for the reader to have a good understanding of the plusses and minuses of all forms of uniquely identifying records in a database system. The reader will also be challenged to think critically about the business situation before adopting any given identification methodology.
Terms Defined
- Surrogate Key
- A surrogate key is a column or columns that are not defined by business requirements, but are added to a table simply to uniquely identify records. For example, the business might define a set of related attributes they want to keep track of. In that list, there may be a number of items which could be considered unique (candidate keys), but none which appeared to be stable enough to be the primary identifier for the set of attributes. An additional attribute might be suggested by the data analyst, in the form of the Entity Name + the text "Identifier". For instance, for an entity called "Survey", this would be "Survey Identifier". There is no business purpose or meaning for this new attribute, and therefore the resulting table column would be more stable from business changes.
- Natural Key
- A natural key is a column or columns that are chosen by the business to uniquely identify records, and are allowed by the data analyst because they are considered to be stable enough to become the primary identifier. For instance, an existing business process might define a Product Code that uniquely identifies equipment within the company. By selecting this as the Equipment primary key, the business more readily accepts the resulting database structures.
- Intelligent Key
- An intelligent key is called this because there is intelligence behind the values assigned to the columns in the key. For instance, a Product Code may be a ten character field (defined as CHAR(10) in the DBMS). This may be mapped as follows: the first 2 characters are the business unit, the next 2 indicate the month of manufacture, the next 3 are a product type code, the last 3 are a sequence number. This specific case could also be called a concatenated key, since it is actually a combination of 4 separate fields into one single field.
- Generated Key
- A generated key has no intelligence to the contents, and is typically assigned a value by the computer through some internal mechanism. This could be a sequential or random number generator or some other method. The point is that the business does not care about the value assigned to any given key, just that a unique value is assigned and can be referred to by dependent tables. For example, an Invoice Number might be defined as a 10 digit number, and assigned values sequentially, beginning at some starting point and incrementing by one for each new record added to the Invoice table.
The Keys in Use
In most databases, relationships between tables are handled in two primary ways, and both can be found. First is the single column key. In this method, each table is assigned a single primary key column, all foreign keys are also a single column, and all referential integrity is on the pairs of columns. The second method is to have the child table inherit the primary key column(s) from the parent as part of its own primary key, as a dependent of the parent table. In this case, the key column of the child table is not sufficient to uniquely identify the records, so needs the parent primary key plus the child table primary key.
Single column primary keys are well suited to surrogate keys, since the surrogate key has no business meaning and is stable by itself. They are also typically system generated. Child tables also have a single unique primary key column. The reference from the child to the parent is on the single parent column carried as a mandatory foreign key column in the child. A child of the child would also have its own unique primary key column, and so would have a single column foreign key to the child table.
Dependent table configurations are suited for natural keys. The parent table has whatever columns are chosen to uniquely identify the record as the primary key. The child table's primary key is composed of the parent primary key plus a column from the child that makes each record unique. Since the natural keys are defined by the business, there is a good chance that at least one of them will be an intelligent key.
It is extremely rare to see just single column primary keys in a database, or for all tables to depend on their parents for parts of their primary keys. Rather, some thought is given to a judicious mix of single and multi column primary keys. For instance, associative tables typically have their primary key defined as the combination of the primary keys of the parent tables, which may themselves be single column primary key tables.
Strengths and Weaknesses
A design that favors single column surrogate primary keys will have very efficient joins between tables, as most of them will be single column joins. Tables also will be much narrower than their natural key cousins, meaning much less dead weight. However, getting from a deeply nested great, great, great grandchild table to the ultimate parent table may take some serious navigation. Read that as complex queries. Relational database engines are optimized for these kinds of queries, and the overall reduction in table dead weight tends to offset the complexity of the queries.
The natural key structure is, simply enough, more natural to deal with, for the developer and the superuser accessing the tables through their favorite query tool. Child tables make sense intuitively, since the primary key columns can be used to discover the lineage. Unfortunately, this comes at a fairly steep price. First, if any key is intelligent, it is going to be almost everywhere, and a business change will be devastating. Take the case of a data type change because the Product Code described above has maxed out the 3 digit sequence code. Long ago they started using letters after they passed 999, and now even that is not enough. They want to make the last part of the Product Code 5 digits, increasing the Product Code from CHAR(10) to CHAR(12). Every table that uses that Product Code as part of the primary key will have to be structurally changed. Many databases will not let that change be made casually, but typically require the data to be unloaded, the table dropped and rebuilt, then the data reloaded. In large database systems, this could be a monumental undertaking.
Best Practices
So, what is the best thing to do? Well, that depends, of course. There is no absolute mandate that says only one way to manage keys is the accepted way. However, to minimize the impact of future business changes, there are a few rules of thumb that have a good track record.
- Make every attempt in the initial design to find an absolutely stable primary identifier. Press the business to be honest and answer if there is any possibility that the natural key they recognize could ever change in value or data type. Be ready with arguments to support using surrogate keys. They are inherently stable, perform very well in joins and referential integrity, and provide simple joins from table to table.
- Establish a maximum limit for the number of nested dependent tables. One published guide states that 5 levels of dependent tables is a red flag. Reevaluate the tables and consider introducing surrogate keys somewhere along the chain.
- If using single column primary keys, remember that deeply nested tables may need a direct foreign key link to a great grandparent table. This can provide a shortcut up the chain if the application and data access tendencies require it.
NOTE: Yes, this does introduce problems, and is a violation of at least 2nd normal form. However some denormalization may be tollerated because database engines aren't perfect, and the complex joins might perform poorly.- Don't automatically assume that multiple natural keys are bad. Very many situations, such as Invoice and Invoice Detail, Order and Order Detail tables show an obvious place to use the business understood Line Number as the Detail table primary key, combined with the parent table primary key. Trying to force a surrogate key on the Detail tables would be cumbersome and a relatively poor performer.
- As stated already, associative tables typically use the combined keys of the parent tables as the primary key. However, if the associative table is itself a parent of many other tables, be ready to come up with a single column primary key for it.
Disclaimer and Notes
Your mileage may vary. In other words, if you disagree, great. Please provide your feedback to the webmaster regarding the information I have provided. I am always wanting to learn, and heck, I might very well have left out some very important details, considering I wrote this whole thing in about an hour off the top of my head. My hat is off to every DBA who has to maintain a database they inherited with the Product Code from hell delima I described. I actually went through a similar ordeal in one of the companies I worked for. My sympathies to anyone who has had to make massive changes throughout an entire DB2 subsystem because of an intelligent key that was anything but.
I would like to add illustrations to the above paper. If you have ideas towards that end, please let me know.
Visitors on this page: [an error occurred while processing this directive]
Last Updated: [an error occurred while processing this directive]