Database, DBMS Principles, and the Relational Model

Compiled By: Aurelie A. Peralta

MAIN TOPICS: 

Principles of Database Management Systems

1. Database, DBMS Principles, and the Relational Model

2. Logical Database Design and Normalization

3. Physical Database Design, Hardware, and Related Issues

Discussions:

Database, DBMS Principles, and the Relational Model

Understanding Databases

A database is an organized collection of data serving a central purpose. It is organized in the sense that it contains data that is stored, formatted, accessed, and represented in a consistent manner. It serves a central purpose in that it does not contain extraneous or superfluous data. It stores only what is relevant to its purpose. Most often, a database’s purpose is business, but it may store scientific, military, or other data not normally thought of as business data. Hence, there are business databases, scientific databases, military databases, and the list goes on and on. Modern databases contain many types of data other than text and numeric.

When discussing databases, and database design in particular, it is commonplace to refer to the central purpose a database serves as its business, regardless of its more specific field, such as aerospace, biomedical, or whatever. Furthermore, in real life a database is often found to be very, very specific to its business.

In earlier days, programmers who wrote code to serve Automatic Data Processing (ADP) requirements found they frequently needed to store data from run to run. This became known as the need for persistent storage; that is, the need for data to persist, or be saved, from one run of a program to the next. This fundamental need began the evolution of databases as we know them. A secondary need, simple data storage, also helped give rise to databases.

Modern databases typically serve some processing storage need for departments or smaller organizational units of their parent organization or enterprise. Hence, we use the terms enterprise-wide database, referring to the scope of the whole organization’s business; the department-wide database, referring to the level of a department; and the workgroup database, usually referring to some programming or business unit within a department. Most often, databases are found at the department-wide and workgroup levels.

Occasionally one finds databases that serve enterprise-wide needs, such as payroll and personnel databases, but these are far outnumbered by their smaller brethren. In fact, when several departmental databases are brought together, or integrated into one large database, this is the essence of building a data warehouse (DW). The smaller databases, which act as the data sources for the larger databases are known as operational databases.  An operational database is just one that produces data, which we have known for years as a production database. Only in the context of building a DW do you find production databases also referred to as operational databases, or sometimes operational data stores. With the advent of Internet technology, databases and data warehouses now frequently serve as back ends for Web browser front ends.

When workgroup databases are integrated to serve a larger, departmental need, the result is typically referred to as a data mart (DM). A DM is nothing more than a departmental-scale DW. When you are integrating several smaller databases into one larger database serving a broader organizational need, the resulting database can generally be considered a DW if it stores data historically, provides decision support, offers summarized data, serves data read-only, and acts essentially as a data sink for all the relevant production databases that feed it.

Otherwise, if a database simply grows large because it is a historical database that’s been storing data for a long period of time or because of the type of data it must store or because of the frequency with which it must store data, it is often referred to as a very large database (VLDB). Currently, a general guideline is that any database of 100GB or larger can be considered a VLDB.

Understanding a DBMS

A Database Management System (DBMS) is the software that manages a database. It acts as a repository for all the data and is responsible for its storage, security, integrity, concurrency, recovery, and access. The DBMS has a data dictionary, sometimes referred to as the system catalog, which stores data about everything it holds, such as names, structures, locations, and types. This data is also referred to as metadata, meaning data about data. The lifespan of a piece of data, from its creation to its deletion, is recorded in the data dictionary, as is all logical and physical information about that piece of data. A Database Administrator (DBA) should become intimate with the data dictionary of the DBMS, which serves him over the life of the database.

Securing Data

Security is always a concern in a production database, and often in a development or test database too. It is usually not a question of whether or not to have any security, but rather how much to have. A DBMS typically offers several layers of security, in addition to the operating system (OS) and network security facilities. Most often, a DBMS holds user accounts with passwords requiring the user to login, or be authenticated, in  order to access the database.

DBMSs also offer other mechanisms, such as groups, roles, privileges, and profiles, which all offer further refinement of security. These security levels not only provide for enforcement, but also for the establishment of business security policies.

Maintaining and Enforcing Integrity

The integrity of data refers to its consistency and correctness. For data to be consistent, it must be modeled and implemented the same way in all of its occurences. For data to be correct, it must be right, accurate, and meaningful. Integrity no only enhances data, but also gives data its value.

One way a DBMS maintains integrity is by locking a data item in the process of being changed. Another way a DBMS enforces integrity is to replicate a change to a piece of data if it is stored in more than one place. The last way a DBMS enforces integrity is by keeping an eye on the data values being entered or changed so that they fall within required specifications (example: a range check).

A DBMS must manage concurrency when it offers multiuser access. That is, when more than one person at a time must access the same database, specifically the same pieces of data, the DBMS must ensure that this concurrent access is somehow possible.

This is how concurrency and integrity are linked. When a person wants to look at or change a piece of data, that person is performing a transaction with the database.

Understanding Transactions

A DBMS has, part of its code, a transaction manager whose sole purpose is to manage concurrency and ensure integrity of transactions. The transaction manager has a tought job because it must allow many people to access the same data at the same time, and yet it must put the data back as though it had been accessed by one person at a time, one after the other, which ensures its correctness. Therein lies the fundamental answer as to how a DBMS must resolve all those multiple copies of data. Transactions occuring during the same time period can preserve the accuracy of the data if (and only if) they are serializable. Simply put, the DBMS must rearrange them so that the net result of all the changes is as if they all occurred single file.

The transaction is a unit of concurrency, or a unit of work. Nothing smaller or lesser than a transaction can occur. That is, on one can halfway change a piece of data. All transactions must be atomic in that each individual transaction either completes or not. A transaction that completes is said to be committed, and one that does not is rolled back.

A transaction log is kept by the DBMS for the purpose of rolling back (undo), and also for rolling forward (redo). A rollback is an undo operation. A rollforward is a redo operation. Hence, the key to transaction recovery in a DBMS is that a transaction must be atomic and can be done, undone, or redone when necessary.

Communicating with the Database

A DBMS is no good if you can’t talk to it. How does one talk to a DBMS? Through an access or quey language. The Structured Query Language (SQL) is the predominant query language today. DBAs use query languages to build and maintain a database, and users use query languages to access the database and to look at or change the data.

Understanding an RDBMS

In 1970, E.F. Codd fathered the concept of the relational model. Before RDBMSs like DB2 were born, hierarchic (IMS) and network (IDMS) models were commonplace. Before these models, databases were built using flat files and third generation language (3GL) access routines.  Many of these legacy databases still exist on mainframes and minicomputers. CODASYL (Conference on Data System Languages) was a database standard created by the Database Task Group (DBTG). This was a COBOL-based network database standard, and IDMS was one vendor implementation. Since the seventies, however, RDBMSs have come to dominate the marketplace, with products such as Oracle, SyBase, Informix, and Ingres.

Using the Relational Model

A relational model is one in which:

1. The fundamental pieces of data are relations.

2. The operations upon those tables yield only relations (relational closure).

What is a relation? It’s a mathematical concept describing how the elements of two sets relate, or correspond to each other. Hence, the relational model is founded in mathematics. A relational model organizes data into tables and only tables.

A relational table has a set of named attributes, or columns, and a set of tuples, or rows. Sometimes a column is referred to as a field. Sometimes a row is referred to as a record. A row-and-column intersection is usually referred to as a cell. The columns are placeholders, having domains, or data types, such as character or integer. The rows themselves are the data.

A relational table must meet some special properties to be part of the relational model:

1. Data stored in cells must be atomic.

2. Data stored under columns must be of the same data type.

3. Each row is unique (No duplicate rows)

4. Columns have no order to them.

5. Rows have no order to them.

6. Columns have a unique name.

In addition to tables and their properties, the relational model has its own special operations. The relational model operations allow subsets of columns, subset of rows, joins of tables, and other mathematical set operations such as union. What really matters is that these operations take tables as input and produce tables as output. SQL is the current ANSI standard language for RDBMSs, and it embodies these relational operations.

SQL is both a Data Definition Language (DDL) and a Data Manipulation Language (DML). A unified DDL and DML is inherently more productive and useful than two different languages and interfaces. The DBAs and the users access the database through the same overall language.

The last thing the relational model requires are two fundamental integrity rules. These are the entity integrity rule and the referential integrity rule.

A primary key is a column or set of columns that uniquely identifies rows. Sometimes, more than one column or sets of columns can act as a primary key. A primary key that is made up of multiple columns is called a concatenated key, a compound key, or more often, a composite key.

The remaining possible primary keys are referred to as candidate keys, or alternate keys. A foreign key is a column or set of columns in one table that exist as the primary key in another table. A foreign key in one table is said to reference the primary key of another table. The entity integrity rule simply states that the primary key cannot be totally or partially empty, or null. The referential integrity rule simply states that a foreign key must either be null or match a currently existing value of the primary key that it references.

An RDBMS, then, is a DBMS that is built upon the preceding foundations of the relational model and generally satisfies all of the requirements mentioned.

Using Codd’s Twelve Rules

Codd proposed twelve rules that a DBMS should follow tobe classified as fully relational:

1. The information rule.

2. The guaranteed access rule.

3. Nulls must be used in a consistent manner.

4. An active, online data dictionary should be stored as relational tables and accessible through the regular data access language.

5. The data access language must provide all means of access and be the only means of access, except possibly for low-level access routines.

6. All views that may be updatable should be updatable.

7. There must be set-level inserts, updates, and deletes.

8. Physical data independence.

9. Logical data independence.

10. Integrity independence.

11. Distribution independence.

12. The nonsubversion rule.

Logical Database Design and Normalization

Entity-Relationship Modeling

The best thing a DBA can do for his database is to start out with proper, logical design. A well-informed and wise DBA knows that a good design improves performance rather than detracts from it, contrary to popular wisdom. Indeed, jumping directly into the physical design or further simply invites trouble, not just in terms of performance, but in data integrity.

As the relational model came to dominate over other data models during the mid-seventies, relational modeling techniques sprung up that permitted formal design capabilities. The most popular of these is the Entity-Relationship Diagram (ERD), developed by P.P. Chen in 1976. This is known as semantic data model because it attempts to capture the semantics, or proper meaning, of business elements, the essence of the business. Because the relatinal model itself is mostly syntactic model, one dealing mostly with structure, the ERD typically supplements it. In fact, ERD modeling naturally precedes relational modeling. Once an ERD is complete, it is mapped into the relational model more or less directly, and later the relational model is mapped to its physical model.

An entity is a business element, such as an employee or a project. A relationship is an association between entities, such as employees working on several projects. Attributes are the characteristics that make up an entity. Attributes are said to take on values from domains, or value sets. The values they take will be the data used later on in the relational model. ERDs can be drawn many ways. It doesn’t really matter as long as you choose one and remain consistent in your meaning throughout.

There are three kinds of relationships: one-to-one, one-to-many, and many-to-many. The dominant relationship to be used in the relational model is the one-to-many. The new entity created from an intersection table in the relational model is called a join table.

Mapping ERDs to the Relational Model

An ERD folds nicely into the relational model because it was created for that purpose. Essentially, entities become tables and attributes become columns. Identifier attributes become primary keys. Relationships don’t really materialize except through intersection tables. Foreign keys are created by always placing primary keys from a table on a one side into the table on a many side.

Understanding Normalization

Normalization is a refinement, or extension, of the relational model. Normalization is also a process that acts upon the first draft relational model and improves upon it in certain concrete ways. The foundation of normalization is mathematical, like the relational model. It is based on a concept known as functional dependency (FD).

Three major reasons for normalization:

1. To maintain data integrity.

2. To build a model that is as application independent as possible.

3. To reduce storage needs (and frequently lay the foundation for improved search performance, too).

Levels of normalization:

1. First Normal Form (1NF) – No repeating groups. This is the same as saying that the data stored in a cell must be of single, simple value and cannot hold more than one piece of information.

2. Second Normal Form (2NF) – No partial dependencies. Every non-key column depends on the full primary key, including all of its columns if it is composite.

3. Third Normal Form (3NF) – No transitive dependencies. No nonkey column depends on another non-key column.

Continuing the Normal Form

The Boyce Codd Normal Form (BCNF) – No inverse partial dependencies. This is also sometimes referred to, semi-seriously, as 3.5 NF. Neither the primary key, nor any part of it, depends on a nonkey attribute.

Fourth Normal Form and higher. Normalization theory in academia has taken us many levels beyond the BCNF. Database analysis and design texts typically go as high as 5NF. 4NF deals with multivalued dependencies (MVDs), while 5NF deals with join dependencies (JDs).

Normal forms as high as 7 and 8 have been introduced in theses and dissertations. In addition, alternative normal forms such as Domain Key Normal Form (DKNF) have been developed that parallel or otherwise subsume current normalization theory.

Recommendation: strive for at least BCNF, then compensate with physical database design as necessary.

Physical Database Design, Hardware, and Related Issues

(to be discussed next…)

Reference: Using Oracle 8 by William G. Page Jr. and Nathan Hughes, Copyright 1998 by Que Corporation