A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult it to answer questions. The software used to manage and query a database is known as a database management system (DBMS). The properties of database systems are studied in information science.
At the core of the concept of a database is the idea of a collection of facts, or pieces of knowledge. Facts may be structured in a number of ways, known as data models. For instance, one model is to associate each fact with a record representing an entity (such as a person), and to arrange these entities into trees or hierarchies -- the hierarchical data model. Another model is to arrange facts into sets of values which satisfy logical predicates -- the relational model.
(The terms database and database management system are sometimes interchanged by students. In the professional argot, a database is always the collection of facts, not the software program.)
Database management systems range from the extremely simple to the highly complex. Differences among DBMSes include whether they are capable of ensuring the integrity of the data; whether they may be used by many users at once; and what sorts of conclusions they can be programmed to compute from a set of data.
The first database management systems were developed in the 1960s. A pioneer in the field was Charles Bachman. Two key data models arose at this time: the network model (developed by CODASYL) followed by the hierarchical model (as implemented in IMS). These were later usurped by the relational model, which was contemporary with the so-called flat model designed for very small tasks. Another contemporary of the relational model is the object-oriented database (OODB).
While the relational model is based on set theory, one proposed modification suggests fuzzy set theory (based on fuzzy logic) as an alternative.
Various techniques are used to model data structure. Certain models are more easily implemented by some types of database management systems than others. For any one logical model various physical implementations may be possible. An example of this is the relational model: In larger systems the physical implementation often has indexes which point to the data - this is similar to some aspects of common implementations of the network model. But in small relational databases the data is often stored in a set of files, one per table, in a flat, unindexed structure. There is some confusion below and elsewhere in this article as to logical data model vs its physical implementation.
The flat (or table) model consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be similar values, and all members of a row are assumed to be related to one another. For instance, columns for name and password might be used as a part of a system security database. Each row would have the specific password associated with a specific user. Columns of the table often have a type associated with them, defining them as character data, date or time information, integers, or floating point numbers. This model is the basis of the spreadsheet.
The network model allows multiple datasets to be used together through the use of pointers (or references). Some columns contain pointers to different tables instead of data. Thus, the tables are related by references, which can be viewed as a network structure. A particular subset of the network model, the hierarchical model, limits the relationships to a tree structure, instead of the more general directed graph structure implied by the full network model.
The relational model was introduced in an academic paper by E. F. Codd in 1970 as a way to make database management systems more independent of any particular application. It is a mathematical model defined in terms of predicate logic and set theory.
Although the basic idea of relational database management systems has been very popular, relatively few people understand the mathematical definition and only a few, obscure DBMSs implement it completely and without extension. Oracle, for example, can be used in a purely relational way, but it also allows tables to be defined which allow duplicate rows -- an extension upon (or violation of) the relational model. In common English usage, a DBMS is called relational if it supports relational operations, regardless of whether it enforces strict adherence to the relational model. The following is an informal, non-technical explanation of how "relational" database management systems commonly work.
A relational database contains multiple tables, each similar to the one in the "flat" database model. However, unlike network databases, the tables are not linked by pointers. Instead, "keys" are used to match up rows of data in different tables. A key is just one or more columns in a table, which correspond to columns in other tables. Any of the columns in a table can be a key, or multiple columns can be grouped together into a single key. Unlike pointers, it's not necessary to define all the keys in advance; a column can be used as a key even if it wasn't originally intended to be one.
When a key consists of data that has an external, real-world meaning (such as a person's name, a book's ISBN, or a car's serial number), it's called a "natural" key. If no natural key is suitable, an arbitrary key can be assigned (such as by giving employees ID numbers). In practice, most databases have both generated and natural keys, because generated keys can be used internally to create links between rows that can't break, while natural keys can be used, less reliably, for searches and for integration with other databases. (For example, records in two independently developed databases could be matched up by social security number, except when the social security numbers are incorrect, missing, or have changed.)
You request data from a relational database by sending it a query that's written in a special language, usually a dialect of SQL. Although SQL was originally intended for end-users, it's much more common for SQL queries to be embedded into software that provides an easier user interface. (Many web sites perform SQL queries when generating pages.)
In response to a query, the database returns a result set, which is just a list of rows containing the answers. The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted. Often, data from multiple tables gets combined into one, by doing a "join". Conceptually, this is done by taking all possible combinations of rows (the "cross-product"), and then filtering out everything except the answer. In practice, relational database management systems rewrite ("optimize") queries to perform faster, using a variety of techniques: In the "join" the primary optimisation is obtained through the use of indexes to prevent the building of the complete cross-product which would otherwise be necessary.
The flexibility of relational databases allows programmers to write queries that were not anticipated by the database designers. As a result, relational databases can be used by multiple applications in ways the original designers did not foresee, which is especially important for databases that might be used for decades. This has made the idea and implementation of relational databases very popular with businesses.
Database designing is the first part of any good project.
Implementations and indexing
All of these kinds of database can take advantage of indexing to increase their speed. The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value. An index allows a set of table rows matching some criterion to be located quickly. Various methods of indexing are commonly used; b-trees, hashes, and linked lists are all common indexing techniques.
Relational DBMSs have the advantage that indices can be created or dropped without changing existing applications, because applications don't use the indices directly. Instead, the database software decides on behalf of the application which indices to use. The database chooses between many different strategies based on which one it estimates will run the fastest.
Relational DBMSs utilise many different algorithms to compute the result of an SQL statement. The RDBMs will produce a plan of how to execute the query, which is generated by analysing the run times of the different algorithms and selecting the quickest. Some of the key algorithms that deal with joins are Nested Loops Join , Sort-Merge Join and Hash Join .
Mapping objects into databases
In recent years, the object-oriented paradigm has been applied to databases as well, creating a new programming model known as object databases. These databases attempt to overcome some of the difficulties of using objects with the SQL DBMSs. An object-oriented program allows objects of the same type to have different implementations and behave differently, so long as they have the same interface (polymorphism). This doesn't fit well with a SQL database where user-defined types are difficult to define and use, and where the Two Great Blunders prevail: the identification of classes with tables (the correct identification is of classes with types, and of objects with values), and the usage of pointers.
A variety of ways have been tried for storing objects in a database, but there is little consensus on how this should be done. Implementing object databases undo the benefits of relational model by introducing pointers and making ad-hoc queries more difficult. This is because they are essentially adaptations of obsolete network and hierarchical databases to object-oriented programming. As a result, object databases tend to be used for specialized applications and general-purpose object databases have not been very popular. Instead, objects are often stored in SQL databases using complicated mapping software. At the same time, SQL DBMS vendors have added features to allow objects to be stored more conveniently, drifting even further away from the relational model.
Applications of databases
Databases are used in many applications, spanning virtually the entire range of computer software. Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed. Even individual users find them convenient, though, and many electronic mail programs and personal organizers are based on standard database technology.
Transactions and concurrency
In addition to their data model, most practical databases ("transactional databases") attempt to enforce a database transaction model that has desirable data integrity properties. Ideally, the database software should enforce the ACID rules, summarised here:
Atomicity - Either all the tasks in a transaction must be done, or none of them. The transaction must be completed, or else it must be undone (rolled back).
Consistency - Every transaction must preserve the integrity constraints -- the declared consistency rules -- of the database. It cannot place the data in a contradictory state.
Isolation - Two simultaneous transactions cannot interfere with one another. Intermediate results within a transaction are not visible to other transactions.
Durability - Completed transactions cannot be aborted later or their results discarded. They must persist through (for instance) restarts of the DBMS after crashes.
In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.
Concurrency control is a method used to ensure transactions are executed in a safe manner and follows the ACID rules. The DBMS must be able to ensure only serializable, recoverable schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions.
The Database State
While databases are a vital tool of modern life, concerns are growing over the privacy and civil liberties issues that result, particularly with governments using increasingly larger interlinked databases with their potential for mass surveillance in what has been termed by some the 'Database State'.
Last updated: 05-13-2005 07:56:04