Database "key/id" Design Ideas, Surrogate Key, Primary Key, Etc

September 16, 2024 Post a Comment

So I've seen several mentions of a surrogate key lately, and I'm not really sure what it is and how it differs from a primary key. I always assumed that ID was my primary key in a

Solution 1:

No, your ID can be both a surrogate key (which just means it's not "derived from application data", e.g. an artificial key), and it should be your primary key, too.

The primary key is used to uniquely and safely identify any row in your table. It has to be stable, unique, and NOT NULL - an "artificial" ID usually has those properties.

I would normally recommend against using "natural" or real data for primary keys - are not REALLY150% sure it's NEVER going to change?? The Swiss equivalent of the SSN for instance changes each time a woman marries (or gets divorced) - hardly an ideal candidate. And it's not guaranteed to be unique, either......

To spare yourself all that grief, just use a surrogate (artificial) ID that is system-defined, unique, and never changes and never has any application meaning (other than being your unique ID).

Scott Ambler has a pretty good article here which has a "glossary" of all the various keys and what they mean - you'll find natural, surrogate, primary key and a few more.

Solution 2:

First, a Surrogate key is a key that is artificially generated within the database, as a unique value for each row in a table, and which has no dependency whatsoever on any other attribute in the table.

Now, the phrase Primary Key is a red herring. Whether a key is primary or an alternate doesn't mean anything. What matters is what the key is used for. Keys can serve two functions which are fundementally inconsistent with one another.

They are first and foremost there to ensure the integrity and consistency of your data! Each row in a table represents an instance of whatever entity that table is defined to hold data for. No Surrogate Key, by definition, can ever perform this function. Only a properly designed natural Key can do this. (If all you have is a surrogate key, you can always add another row with every other attributes exactly identical to an existing row, as long as you give it a different surrogate key value)
Secondly they are there to serve as references (pointers) for the foreign Keys in other tables which are children entities of an entity in the table with the Primary Key. A Natural Key, (especially if it is a composite of multiple attributes) is not a good choice for this function because it would mean tha that A) the foreign keys in all the child tables would also have to be composite keys, making them very wide, and thereby decreasing performance of all constraint operations and of SQL Joins. and B) If the value of the key changed in the main table, you would be required to do cascading updates on every table where the value was represented as a FK.

So the answer is simple... Always (wherever you care about data integrity/consistency) use a natural key and, where necessary, use both! When the natural key is a composite, or long, or not stable enough, add an alternate Surrogate key (as auto-incrementing integer for example) for use as targets of FKs in child tables. But at the risk of losing data consistency of your table, DO NOT remove the natural key from the main table.

To make this crystal clear let's make an example. Say you have a table with Bank accounts in it... A natural Key might be the Bank Routing Number and the Account Number at the bank. To avoid using this twin composite key in every transaction record in the transactions table you might decide to put an artificially generated surrogate key on the BankAccount table which is just an integer. But you better keep the natural Key! If you didn't, if you did not also have the composite natural key, you could quite easily end up with two rows in the table as follows

id  BankRoutingNumber BankAccountNumber   BankBalance
 1     12345678932154   9876543210123       $123.12
 2     12345678932154   9876543210123    ($3,291.62)

Now, which one is right?

To marc from comments below, What good does it do you to be able to "identify the row"?? No good at all, it seems to me, because what we need to be able to identify is which bank account the row represents! Identifying the row is only important for internal database technical functions, like joins in queries, or for FK constraint operations, which, if/when they are necessary, should be using a surrogate key anyway, not the natural key.

You are right in that a poor choice of a natural key, or sometimes even the best available choice of a natural key, may not be truly unique, or guaranteed to prevent duplicates. But any choice is better than no choice, as it will at least prevent duplicate rows for the same values in the attributes chosen as the natural key. These issues can be kept to a minimum by the appropriate choice of key attributes, but sometimees they are unavoidable and must be dealt with. But it is still better to do so than to allow incorrect inaccurate or redundant data into the database.

As to "ease of use" If all you are using the natural key for is to constrain the insertion of duplicate rows, and you are using another, surrogate, key as the target for FK constraints, I do not see any ease of use issues of concern.

Solution 3:

The reason that database purists get all up in arms about surrogate keys is because, if used improperly, they can allow data duplication, which is one of the evils that good database design is meant to banish.

For instance, suppose that I had a table of email addresses for a mailing list. I would want them to be unique, right? There's no point in having 2, 3, or n entries of the same email address. If I use email_address as my primary key ( which is a natural key -- it exists as data independently of the database structure you've created ), this will guarantee that I will never have a duplicate email address in my mailing list.

However, if I have a field called id as a surrogate key, then I can have any number of duplicate email addresses. This becomes bad if there are then 10 rows of the same email address, all with conflicting subscription information in other columns. Which one is correct, if any? There's no way to tell! After that point, your data integrity is borked. There's no way to fix the data but to go through the records one by one, asking people what subscription information is really correct, etc.

The reason why non-purists want it is because it makes it easy to use standardized code, because you can rely on refering to a single database row with an integer value. If you had a natural key of, say, the set ( client_id, email, category_id ), the programmer is going to hate coding around this instance! It kind of breaks the encapsulation of class-based coding, because it requires the programmer to have deep knowledge of table structure, and a delete method may have different code for each table. Yuck!

So obviously this example is over-simplified, but it illustrates the point.

Solution 4:

Wow, you opened a can of worms with this question. Database purists will tell you never to use surrogate keys (like you have above). On the other hand, surrogate keys can have some tremendous benefits. I use them all the time.

In SQL Server, a surrogate key is typically an auto-increment Identity value that SQL Server generates for you. It has NO relationship to the actual data stored in the table. The opposite of this is a Natural key. An example might be Social Security number. This does have a relationship to the data stored in the table. There are benefits to natural keys, but, IMO, the benefits to using surrogate keys outweigh natural keys.

I noticed in your example, you have a GUID for a primary key. You generally want to stay away from GUIDS as primary keys. The are big, bulky and can often be inserted into your database in a random way, causing major fragmentation.

Randy

Solution 5:

Users Table

Using a Guid as a primary key for your Users table is perfect.

LogEntry table

Unless you plan to expose your LogEntry data to an external system or merge it with another database, I would simply use an incrementing int rather than a Guid as the primary key. It's easier to work with and will use slightly less space, which could be significant in a huge log stretching several years.

comprasconencanto1