Over the past decade, there have been a number of DBMSs introduced (typically labeled as NoSQL) which utilize a network or hierarchical data model. MongoDB and Cassandra come immediately to mind as examples. Some such systems support networks through the concepts of “links” and some support hierarchical data using a nested data model often utilizing JSON. In my opinion, these systems have not internalized lessons from history.
At the SIGFIDET (now SIGMOD) annual conference in 1974, there was a “Great Debate” over the merits of the relational model versus the network and hierarchical models. Ted Codd argued the case for the relational model, while Charlie Bachmann represented the network model. I was at the debate and there is a published report of the event posted that can be read by all.
Two Essential Questions
Basically, the argument was about which model was a better fit for structured data (as opposed to documents, e-mails, etc.) and boiled down to two questions:
- Question 1: Are high-level data sublanguages a good idea?
- Question 2: Are tables the best data structure or should one use a network or hierarchy?
The last 45 years have definitely affirmed Codd’s position on both issues. On question 1, the discussion at the debate centered around the appropriateness of high-level declarative languages (think SQL) versus low-level record-at-a-time notations (think the data management language for the IBM hierarchical DBMS, IMS). Obviously, a programmer writes less code in a high-level language and is insulated from many changes to the storage representation of data, i.e., high-level languages are good. Hence, the counter argument at the time was that “high-level languages could not possibly be made as efficient as lower level notations.”
The adage I would use today is “never bet against the compiler” because a SQL optimizer compiles a high-level language into a low-level notation. Can a SQL optimizer choose as good a query plan as a programmer? The last 45 years have demonstrated that “rocket scientists” can indeed beat the optimizer but “mere mortals” cannot. Since most of us are in the latter category, the compiler wins. Moreover, the “hint” system present in most SQL engines allows a rocket scientist to steer the compiler toward his/her favorite plan. Seemingly, the mainstream has internalized this discussion, and SQL has become the near-universal “lingua franca” for database access.
The argument about the second question revolved around “data independence.” Codd mentioned to me many years ago that one of his inspirations for the relational model was his IBM programmer friends. They complained that every time the schema changed, they had to throw away their code and start anew. As such, data independence is the ability of application programs to survive changes in data storage or representation.
I will illustrate the issue with a very simple example and start with an entity-relationship representation of some data. Consider an entity “Employee” with a key of name and other attributes such as salary and birthdate. Consider a second entity “Department” with a key of dname, and other attributes such as floor_number and budget. There is a relationship between the two entities, “Works_in”, indicating the department in which an employee works. If the company requires employees to be in a single department, then an obvious representation in tables is:
Employee (name, salary, birthdate, dname)
Department (dname, floor_number, budget)
In a hierarchical or network data model, one might construct a schema as:
Department (dname, floor_number, budget)
Employee (name, salary birthdate)
This notation indicates that employees are stored “underneath” the department they work in using a hierarchical representation. Since we have already ascertained that high-level languages are good, here is the SQL to find the names of employees in a department on the first floor:
From Employee E, Department D
Where D.floor = 1 and E.dname = D.dname
In querying a hierarchical database, one must program from “outside in” to mimic the hierarchical representation. In a made-up high-level hierarchical language, here is the same query:
For all Departments where floor = 1
The above schemas can be optimized for the query we used. In a hierarchical system, one could use an index on floor, while a relational system would use indexes on D.floor and E.dname.
In the Real World
In the real-world, workloads change on a regular basis as business conditions change. Typically, such changes occur once per quarter or more. Also, database applications often last a quarter century or more. Hence, one must plan for 100 or more such changes over the lifetime of an application, and database/application maintenance is a really big deal.
One very simple change is to the structure of the dominant query. Suppose it becomes “find the names of employees between 30 and 40 years old”.
In a relational world, one would cluster (or index) employees on age to support this query. Note that the previous query continues to run, albeit with perhaps degraded performance.
In a hierarchical system, one might (or might not) have “dependent” indexes that would allow an age index on the dependent records. Even with this capability, the employee records are not clustered on age, so query performance may be inadequate. To get superior performance, employee records cannot remain clustered with departments, but must be made a top-level object. This requires changing the schema. A side effect of this decision is that our first query fails. The discussion about data models in the 1970s revolved around data independence, i.e., would queries continue to execute when changes were made to:
- The layout of objects
- The indexing of objects
- The objects that were physically stored, as opposed to represented as views
The conclusion from the 1970s was that the relational model provides superior data independence, compared to the network and hierarchical models.
Forty-five years later, this conclusion is still true. If you want to insulate yourself from the changes that business conditions dictate, use a relational DBMS. If you want the successor to the successor to your job to thank you for your wise decision, use a relational model.
Dr. Stonebraker has been a pioneer of database research and technology for more than forty years. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty-five years. More recently at M.I.T. he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, the H-Store transaction processing engine, the SciDB array DBMS, and the Data Tamer data curation system. Presently he serves as Chief Technology Officer of Paradigm4 and Tamr, Inc.
Professor Stonebraker was awarded the ACM System Software Award in 1992 for his work on INGRES. Additionally, he was awarded the first annual SIGMOD Innovation award in 1994, and was elected to the National Academy of Engineering in 1997. He was awarded the IEEE John Von Neumann award in 2005 and the 2014 Turing Award, and is presently an Adjunct Professor of Computer Science at M.I.T, where he is co-director of the Intel Science and Technology Center focused on big data.