Introduction
Most of human activity is data driven or generating. This data comes in two main types: structured (tabulated, nice to process) and unstructured (hard to tabulate, like images, letters, web pages). There is also semi-structured, such as XML/JSON based data.
Traditional databases are structured and alphanumeric. Newer techologies can handle unstructured data, but often in a tabular "wrapping".
Data modelling
We need to consider
Principal functions of database systems
DB Management Systems are software which manages data (as name implies). It lets us model the data; query, update, and store data; and secure and ensure the integrity and consistency of stored data.
Integrity is maintained in cases of unauthorised access, as well as updates to the database (which may violate constraints), concurrent transactions, failure or natural disaster, etc, etc.
In cases of natural disaster, this is not as much guaranteed.
The software optimises data access, by creating and using indexes, finding the best way to execute a set of operations, and the best order of relational data operations (SELECT, PROJECT, JOIN, etc).
The interface to a DBMS is often through a declarative language (like SQL), which lets one easily use database functions.
3-schema architecture
External views
Different users may not want or need access to different parts of the database, so external views are different per user group. You could think of the difference between a customer's interface vs an employee's.
Other definitions
The Schema is the collection and relations between tables.
Normalisation is a process of removing anomalies from tables.
Informal Definition
A relation is a table of values. It contains a set of rows, called tuples, and a set of columns.
Elements in rows represent certain facts of an entity, and column headers denote the atrributes of interest we store.
A relational database then is a database built out of relations, with known table names and attribute (name, type) pairs.
Formulating queries involves specifying the table and attribute name(s) of interest, and setting constraints (predicates) that need to be satisfied for interest.
Formal definition
The Schema (description) of a relation:
Customer(id, name, address, phoneNo)
BritishPhoneNumbers
are all 11 digit phone numbers in the UK.
(ddd)dddd-ddd
format (d for digit)
* Technically attributes are unordered, but for human readability sake we make it ordered.
Given the schema \(R(A_1, A_2, ..., A_n)\),
The purpose of all this is to present a (relatively) human-understandable format, with the goal to simplify how to interrogate (query) and maintain the database.
On schemas
The relation schema is the relation name and attribute list (Tea(name: string, manuf: string)
)
The Database schema is the set of relationship schemas.
The relation schema is defined during table creation, for example with creating a (tea) drinkers' table:
CREATE TABLE Drinkers (
name CHAR(30) Primary Key,
addr CHAR(50) DEFAULT "0 Null Island",
phoneNo CHAR(16) NOT NULL
);
Primary Key
and NOT NULL
are constraints.
About the Relational Model
Why do we use relational databases
It's a fairly simple abstract model, but also comes with declarative languages like SQL.
Note that the abstract model is set based, but SQL itself is bag-based (bags are multisets, bags can have duplicate elements whilst sets cannot).
Constraints exist to determine permissible states.
We have domain constraints (values in attributes must be \(\in\) domain, or null if permitted); and 3 explicit schema-based constraints:
Key constraints
The Superkey of a relation R is the subset \(SK \subseteq R :\) in any valid \(r(R)\), for any different tuples \(t_1, t_2;\;\) \(t_1[SK] \neq t_2[SK] \). Superkeys are unique.
Ex. Take the relation R(name, country, phone)
, the superkey SK{country, phone}
means every row must have a unique country, phone
combination.
Minimal Key
The Candidate Superkey of R is a minimal superkey. That is, if removing any attribute from the cand. key makes it not a key.
A relation may have several candidate keys, one can arbitrarily determine one to be the primary key.
Entity integrity
Entity integrity basically deals with nulls where they shouldn't be.
Primary keys must never be null (they are identifiers), and other entries can be set to not null.
Referential Integrity
Referential Integrity deals with inter-table references.
For example, we have a students table, which has student ID as a primary key. We also have a table of courses which students do, with student ID as foreign keys. If we delete a student, the foreign key for that student will be referening nothing, and this is an integrity violation.
A set of attributes \(FK\) from relation \(R_1\) is a Foreign Key that references \(R_2\) if the attributes of \(FK\) in \(R_1\) is the same as the Primary Key \(PK\) in \(R_2\). Foreign keys must either refer to the \(PK\) of \(R_2\), or be NULL.
Foreign keys are also defined during table creation.
Semantic attribution integrity
We also have Semantic attribution integrity, which in short comprise validation checks that the attributes make sense, such as Age > 0.
Constraints and updates
On an update operation, integrity constraints may be violated. The DB management system is there to ensure that these are caught, and corrected, rejected, or reported.
Some possible violations that may occur:
Functional dependencies
Functional dependencies (FDs) are useful, since they are formal statements which allow derivation of good database design.
Let X, Y, Z be sets of attributes, and A, B, C be single attributes, where \(ABC \implies \{A, B, C\}\).
\(X \longrightarrow Y \) is an assertion about relation R and 2 sets of attributes from R, which constrains all possible tuples in \(r(R)\). If \(X \longrightarrow Y\), values of Y depend on values of X. For any two tuples \(t_1, t_2 \in R\), \[t_1[X] = t_2[X] \implies t_1[Y] = t_2[Y]\] (Injunctive function)
FDs and Keys
Claim. if K is a superkey, then \(K \longrightarrow A\; \forall\) other subsets \(A \in R\)
Proof. Assume \(K \longrightarrow A, K \longrightarrow B\) for some \(A, B \in R\). This violates key constraints, as keys must be unique for all tuples on R. \(\Box\)
The reverse is also true, making if and only if.
Determining keys
(1) We can assert a key. Say "this attribute will be the key." This makes it our only functional dependency by definition.
(2) Assert functional dependencies. If we have a set of functional dependencies, then we have our key.
FDs can come from keys or physical limits, e.g. no two lectures can happen in the same room at the same time \(\implies\) {room, time}
as a viable key.
SQL and declarative languages
SQL is a declarative language - a very high level language that tells the system merely what to do instead of how to do it.
Things like data manipulation that one would have to implement in a procedural language are implemented in the back-end by the DB management system, and it is the one responsible for finding the optimal way of carrying out an SQL command.
SQL is both a data definition language (DDL) and a data manipulation language (DML).
SQL Terminology
In SQL, relations are tables, tuples are rows, and attributes are columns.
Standards
Whilst SQL has an official standard - SQL 3 currently - different comapnies implement SQL differently and none have implemented the standard directly, thus different platforms have slightly different versions of the language.
Databases, Schemas, and Catalogues
A database contains 1 or more schemas, which contain one or more related tables. A schema can be created with the create schema
command, and schemas can (supposedly) be given an authorisation, create schema Company authorisation 'JSmith'
If one does not create a schema before creating tables, those tables go in the default public
schema.
Virtual relations can be created with create view
, which aren't actual tables but "virtual" tables spanning actual tables.
Note that SQL is generally case insensitive.
now, syntax time
Tables
Tables are created with
create table <tableName> (
<name1> <type1> <constraint1>,
<name2> <type2> <constraint2>,
[otherConstraint], [otherConstraint]
);
There are a few essential data types, their PostgreSQL (pn. post-gres) equivalents are given. The ANSI names are omitted.
PostgreSQL type | Description |
---|---|
char(n) |
Fixed length character string - string is padded with spaces |
varchar(n) |
Variable limited length - like above but with no padding |
numberic(p, s) |
Arbitrary precision numbers. \(p\) is precision and \(s\) is scale |
int, int4, int2 |
First two are 4 byte ints, int2 is a 2 byte int |
float4, float8 |
4 byte and 8 byte floating point numbers |
blob, date, boolean, time
, etc.
Thus we can create an example table like so:
create table students (
studentID int primary key, -- this is a comment
studentName varchar(30), -- char varying
courseID int
);
Attributes can be set to have a default value, not null, and semantic checks
accountBalance real not null default 0.0,
age int not null check (age > 0 and age < 125)
One may also add a unique clause - this can specify alternate secondary keys, candidate keys.
dname varchar(15) unique
Foreign keys
Constraints can be declared inline or at the after attributes. Below is an exmaple with foreign keys, where constraints are written at the end.
create table project (
Pname varchar(15) not null,
Pnumber int not null,
Plocation int,
Dnum int not null,
primary key (Pnumber),
unique (Pname),
foreign key (Dnum) references department (Dnumber)
-- department is another table
);
Constraints
One can set actions for foreign keys, which would trigger when specific updates cause integrity violations. These actions are set null
, cascade
(propagate), set default
.
Different actions can happen for different queries, specified with on update
and on delete
. Thus,
constraint superSSN_SSN
foreign key (superSSN) references employee (SSN)
on delete set null
on update cascade
Where superSSN_SSN
is the name of the constraint.
Constraints can also be set on fields inline, with the syntax constraint name details
studentID int constraint nonzero check (studentID > 0)
rm -rf /
SQL also has delete (DROP) commands, which do not double check and will delete your data. Use with caution.
These are drop table
and drop schema
.
Selecting
SQL allows duplicate entries in the database - these are dealt with at point of query instead of point of insertion.
Generally, the terminology is that rows are selected whilst columns are projected.
The basic select command is like select attr1, attr2 from someTable where someCondition;
.
The output is also a table, and we can manipulate said table.
select distinct
removes duplicateswhere
sets conditions on which rows are selected, and use the boolean operations and or not
.*
is the wildcard "all" operator.
We can qualify names using a dot, if there are multiple tables that share a column name (whether through foreign keys or otherwise), by doing select students.studentID from students, courses;
.
If table names are too long or unweildy, they can be aliased, for example S.studentID from students S
.
Please note that strings in SQL are single quoted only.
We can order by a field by order by field asc
(or desc
)
Columns can be renamed using as, such as select name as teaName, manufacturer from
...
Operational semantics
The general order of operations goes
from
clausewhere
select
Comparisons
Once can replace a column value with a constant value. For example, if we search from a database of tea drinkers for those with Earl Grey as favourite tea, we can do something like the following:
select drinker, 'likes earl grey' as likesEarlGrey
from likes
where tea = 'Earl Grey';
There are several more string functions, but often these are implementation-specific.
Apostrophes have to be written using double quotes, so 'Joe"s'
would mean Joe's.
String attributes can also be pattern matched, using the keywords like
and not like
. A pattern is a quoted string with special characters, like a simplified form of regex. _
means any one character, %
means any number of any characters.
where phone like '%1926_______'
Nulls
Nulls are special in that they are not real values. They are useful for unknown/not applicable values.
Thus, nulls are incomparable: if A's phone number is NULL, B's is also NULL, A's phone number does not equal B's.
SQL comparisons actually have three boolean values, those being True, False, Unknown. Comparing with NULL can yield unknown values. The process of three value logic is that if the value of unknown does not matter in an expression (e.g. \(F \land U = F\) and \(T \lor U = T\)).
To check nulls, use is: attr is null
or attr is not null
.
A modification does not return, but changes the state of the DB
Insert
insert into <tableName> values (<val1>, <val2>, ...);
Inserted values must be in attribute order of list in schema. To insert partial values, we can specify the column names:
insert into <tableName> (A1, A2, ..., Aj) values (V1, V2, ..., Vj);
Unspecified attributes will fall to a default or null value. This is also useful if we do not know the order of attributes.
Delete
To delete tuples from a table matching the expression,
delete from <tableName> where <expression>;
And the next one, depending on database, will delete either nothing or everything -- in postgres, this deletes every tuple from the table.
delete from <tableName>;
Update
update <tableName>
set (att1=val1, att2=val2, ...)
where <condition>
Selecting from multiple tables
In a from when selecting more than one table: ... from students, courses ...
by default with the comma we will get the cartesian product, or cross join.
Cross join
Cross join can also be explicit: students cross join courses
.
Natural Join
Write R natural join S
, natural join can be done by first cross joining \(R \times S\), take the attribute in R with the same name as that in S, merge the two columns into one and remove any where they don't match up:
\[
\begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{Name} \\ \hline 1 & Bob \\ \hline 2 & Alice \\ \hline 3 & Brian \\ \hline \end{array}
\textrm{ natural join }
\begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{CourseID} \\ \hline 1 & 1 \\ \hline 1 & 2 \\ \hline 2 & 2 \\ \hline 3 & 1 \\ \hline 3 & 2 \\ \hline \end{array}
\]
\[= \begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{Name} & \textrm{CourseID} \\ \hline 1 & Bob & 1 \\ \hline 1 & Bob & 2 \\ \hline 2 & Alice & 2 \\ \hline 3 & Brian & 1 \\ \hline 3 & Brian & 2 \\ \hline \end{array}\]
(Regular) Join
Extension of natural join, has syntax R join S on <cond>
for some condition.
If the operator in the cond (because you'll need one) is an =, this is an equi-join. The default join method is an inner join.
Inner join
An inner join between R and S is the same as
select * from R, S where R.myID = S.theirID;
i.e. it's like a natural join, except the two columns are not merged.
Theta join
A theta join is simply not an equi-join
Nesting joins
select *
from (
(students S inner join studentCourses C on S.studentID = C.studentID)
inner join courses D on C.courseID = D.courseID
);
Outer Joins
Inner joins lose info -- an inner join between students and studentCourses will discard all students without courses. An outer join will keep one side's tuples and fill in the other side's with null
.
left outer join
keeps the left side, right outer join
keeps the right side, and full outer join
keeps both.
Outer joins must come with a natural
or on
clause.
Subqueries
Subqueries are queries within queries - you can select from the result of a select, basically. Internal queries can also be aliased for convenience.
select tea
from likes, (
select drinker from frequents where tearoom = 'Joe"s Tea Room'
) JD
where likes.drinker = JD.drinker
Scalar Subqueries
If a query produces a 1x1 table -- i.e a single value, it is scalar. Scalar subqueries can be used in calculations and where conditions.
In
The keyword in
asks if <tuple> in <relation>
. not in
is the opposite keyword. Used in where clauses.
select A1, A2 from testTable where A1 in (select A2 from testTable);
Set comparisons
in
is essentially \(\in\) in set algebra. What if we want something like \(x < (\forall y \in Y)\)? Well we have the all, any, and some operators:
select ... from ... where <attr> <operation> all/any/some <subquery>
Where the operation is a standard comparator operation. ALL
matches \(\forall\) and ANY, SOME
both match \(\exists\).
SELECT ProductName FROM Products
WHERE ProductID = ANY
(SELECT ProductID FROM OrderDetails
WHERE Quantity = 10);
Subqueries which need to refer to values of the outer query - often subquery is run per tuple of main query.
Exists
We can emptiness test with select attr from tableName where EXISTS ( <subq> );
. Inner queries can use table aliases from outer queries, such as
select courseID, courseName from courses C
where exists (
select * from studentCourses D where C.courseID = D.courseID
);
Inner queries which use outer query tuples are correlated. Nested query evaluated once per row of outer, and thus can use the scope of outer.
not exists
obviously exists (ironically enough).
Correlated versus not
If a subquery is independent from the outer one, the subquery is evaluated first. Otherwise it does the row thing.
Inserting with queries
Guess what, you can insert nested queries too.
insert into backup (copy1, copy2)
select E.source1, E.source2
from example E
where source1 > 2;
Set operations
Relations are sets (bags) and of course we can do set operations, with keywords union, intersect, except
being \(\cup, \cap, \setminus\) respectively.
The sets must be comparable, however, for it to work. You cant have tables with different columns.
(select * from likes)
intersect
(select drinker, tea from sells join frequents on (tearoom))
Bags vs sets
Yeah so you saw the grey "Bags" earlier? Well SQL "sets" aren't actually sets, they're "bags", or "multisets", or whatever word mathematicians make up to avoid saying sets but with duplicate items.
select distinct
rids us of the duplicates, and the set operations will too, but say you do have two tables with duplicates, and don't want to remove them when unioning. Well that's where all
comes in: union all
will put all the columns together without removing duplicates.
You also have intersect all
and except all
, but it starts to act a bit funny here.
Say we have the code (select person from A) except all (select person from B)
, the first one has joe, bob, joe, tim, fred
and the second has fred fred tim tim joe
, as a result you'd get joe, bob
. Since there's two Joes in the first one, and only one Joe (Who's Joe?) in the second one, \(2-1=1\) one Joe left over.
It's a similar case with intersect all
where they match and discard left overs.
Aggregation
Many queries involve computing aggregates, using sum(), avg(), count(), min(), max(), percentile()
. These keywords apply to columns.
count()
is special since it can take count(*)
which means count all rows, and is also extended to count(distinct ...)
which should be self explanatory.
The rest of these need an attribute.
select avg(price) from sells where tea = 'Earl Grey';
select studentID from studentCourseMarks
where courseID = 2 and mark > (
select avg(mark) from studentCourseMarks where courseID = 2
);
Behaviour with nulls
Nulls never contribute to a sum, average, and count, and can never be a max or a min.
If the whole column is null, count()
is 0, and then min(), max(), sum()
etc become null.
Group by
Follows a select-from-where, results are grouped according to the values for the specified attribute, and (it must be present) aggregates are applied per group.
select courseID, count(*) as students
from studentCourses group by courseID;
select tea, avg(price) from sells group by tea;
Limitations
If an aggregate is used, each elem must be aggregated or grouped by.
-- selecting earl grey at cheapest price
select teaRoom from sells
where (tea, price) = (
select tea, min(price) from sells
where tea = 'Earl Grey'
group by tea
);
Having
A having condition is a condition on the grouped by rows, all rows must pass said condition to be shown.
-- selecting courses with > 2 students
select sc.courseID, courseName, count(*) as students
from studentCourses sc, courses
where sc.courseID = courses.courseID
group by sc.courseID, courseName
having count(*) > 2;
having
may refer to any tuple variable in from
, and may refer to an attribute, as long as that attribute makes sense (i.e. the attribute being grouped)
SQL is based on relational algebra (RA): a formal mathematical model operating over relations, tuples, and attributes.
Note that whilst SQL is declarative, RA is procedural, one line at a time.
Since operators are formally defined, their properties are key to optimising execution on the back end.
The operators are (the /1 and /2 are to show unary and binary operators)
Example. For every project in stafford, list the project number, controlling department number, department manager's last name, address, and birthdate:
select Pnumber, Dnum, Lname, Address, Bdate
from (
(
(
select * from PROJECT
where Plocation = 'Stafford'
) join DEPARTMENT on Dnum = Dnumber
) join EMPLOYEE on Mgr_ssn = Ssn
);
In relational algebra, this is \[\pi((\sigma(P) \bowtie D) \bowtie E)\] Or unabbreviated, \[\pi_{\textrm{P.Pnumber, P.Dnum, E.Lname, E.Address, E.Bdate}}((\sigma_{\textrm{P.Location = 'Stafford'}}(P) \bowtie_{\textrm{P.Dnum=D.Dnumber}} D) \bowtie_{\textrm{D.Mgr_ssn = E.Ssn}} E)\]
Selection
\[\sigma_{\textrm{condition}}(R)\] The condition consists of <attrname> <comparator> <constant/other attr name>.
This is applied independently to each tuple.
Select is commutative: \(\sigma_a(\sigma_b(R)) = \sigma_b(\sigma_a(R))\), and chaining them is just the same as using logical and: \(\sigma_a(\sigma_b(R) = \sigma_{a \land b}(R)\).
Projection
\[\pi_{\textrm{attr list}(R)\] Attribute list are the desired attributes (columns) of relation.
If there are no keys in this list, in SQL you may get duplicates, but in RA duplicates are illegal since we're using set maths.
If \(l_1 \supset l_2\) then \(\pi_{l_1}(\pi{l_2}(R))\) is illegal since you're projecting nonexistent columns.
Renaming
Expression order
We've seen single expressions: \(\pi_{\textrm{Fname, Lname, Salary}}(\sigma_{Dno = 5}(\textrm{EMPLOYEE}))\)
We can also chain expressions: \begin{align} & \textrm{DEPS_EMPS} \longleftarrow \sigma_{Dno = 5}(\textrm{EMPLOYEE}) \\ & \textrm{RESULT} \longleftarrow \pi_{\textrm{Fname, Lname, Salary}}(\textrm{DEPS_EMPS}) \end{align}
Set operations
Two relations \(R(a_1 \dots a_m)\) and \(S(b_1 \dots b_m)\) are type compatible if \(m = n\) and \(dom(a_i) = dom(b_i) \; \forall i \in [1..n] \).
Cross join
Cartesian product of sets.
Join
Instead of cross join and select, we can join on a condition \(R \bowtie_{\textrm{cond}} S\).
An equijoin is a join where the condition has an =
Natural Join
An extension of join is natural join, represented by the star \(R \ast S\).
Completeness
Not all of these operators are needed to do everything, in fact the minimal complete set of operators is \(\{\sigma, \pi, \rho, \cup, -, \times\} \) (minus is set minus).
Division
If we have \(R \longleftarrow S \times T\), then we must have \(R \div S = T\)... right?
Yeah, but it's a bit weird. Suppose S is students, T is topics, and R is something like "researches". Then \(S \times T\) will pair each student with all topics, or each topic with all students. \(R \div S\) "extracts" those topics out that all students research.
\(R \div S\) "extracts" those topics out that all students research. Even if we add more rows onto R to make \(R'\), then \(R' \div S\) would still only extract topics that all students research.
We define this as universal quantification, and T is the universal quantifier of all values of S.
Let \(R(Z)\) be the fat relation and \(S(X)\) be the skinny relation, such that \(X \subseteq Z\).
\(R(Z) \div S(X)\) is a relation \(T(Y):\) tuple \(t \in T(Y)\) iff \(\forall t_S \in S \exists t_R \in R :\)
Alternatively, we can define \(T \longleftarrow R \div S\) as \begin{align} T_1 &\longleftarrow \pi_Y(R) \\ T_2 &\longleftarrow \pi_Y((S \times T_1) - R) \\ T & \longleftarrow T_1 - T_2. \end{align}
Generalised Projection
\(\pi_{F_1 \dots F_n} (R)\) where \(F_1 \dots F_n\) are functions on R: can do maths, constants, etc.
Aggregate Functions
\(_{\textrm{grouping attrs}} \mathfrak{I} _\textrm{function list} (R)\): The notation is fraktur I.
Where the grouping attributes are attributes in R to group by
, and the function list is a list of (function, attribute) pairs, the functions are usually sum(), avg(), max(), min(), count()
etc.
\[ _\textrm{Dno} \mathfrak{I} _\textrm{COUNT Ssn, AVERGAGE Salary} (\textrm{EMPLOYEE}) \]
Outer joins
Join bowties with what looks like equals coming out of one (or both) sides: left ⟕ right ⟖. (These are unavailable in base mathjax)
As if one formal system wasn't enough, here's another.
yay...
Predicates and Propositions
Relational Calculus is based on predicates and propositions.
A predicate is a declarative sentence: a statement. Predicates must be verifiably true or false. They specifiy retrieval requests, but only what to retrieve, not how.
True or false statements depend on presence within the database.
Variables in Predicates
We can have a predicate with variables: "Person \(p\) plays intrument \(i\).", and these variables can be replaced: "Person Megan plays intrument \(i\)."
A predicate is instantiated when all parameters are filled, this becomes a proposition
"Person Megan plays instrument cello." is now a proposition.
Intension and Extension
The intension of a predicate is the predicate's "meaning": the properties and quantities associated with a statement.
The extension of a predicate is the set of all instantiations where the predicate is true.
Operations
Propositions can be composed with others with logical and first order operators \(\land, \lor, \lnot, \forall, \exists\).
Sets
Let \(P(x)\) be a predicate. If \(a\) is an instance of \(x\), and \(P(a) = \top\), \(a\) satisfies \(P\).
\(P(x)\) is the membership predicate for the set of all satisfactory objects \(a\).
Suppose \(P(y) = y \in \mathbb{Z} \land 5 < y < 10\). The Set related is \(\{6,7,8,9\} \).
These sets can also be composed with set operators.
Relational Calculus -> Relational Model
A predicate can be thought of as a schema. The predicate defines what's true, whilst the schema defines attributes and domains for valid data.
A proposition can be thought of as a tuple. Propositions are set values substituted into parameters of a predicate; if they are satisfactory, they are valid in the schema.
The extension is the set of all valid states of the schema (the set of all predicate instantiations)
We work with a closed world assumption: a tuple is true if and only if it is in the database.
Queries
RC can be used to express queries declaratively, and there are two types:
Tuple Relational Calculus (TRC) and Domain Relational Calculus (DRC).
TRC
Statements in RC are declarative.
To define queries, we need to define 3 more things:
DB Queries
A standard TRC query looks like \[\{t_1\cdot A_1, t_2 \cdot A_2, \dots, t_n \cdot A_n | \textrm{cond}(t_1 \dots t_n)\}\] Where cond is a true/false formula, \(t_1 \dots t_n\) are free variables.
The tuple variables range over all possible tuples, so must specify explicitly which relation they come from.
An atomic formula is of form
Compound Formulas are of form \begin{matrix} &\lnot F, &F_1 \lor F_2, &F_1 \land F_2, &(\forall t)F, &(\exists t)F \end{matrix}
Examples
Because the formal definition is kinda confusing.
Example. Names of employees who earn more than $50k annually. \[\{t.Fname, t.Lname | EMPLOYEE(t) \land t.salary > 50000\} \]
Example. Now we want the employees' full names and their supervisors. \[\{t.Fname, t.Lname, s.Fname, s.Lname | EMPLOYEE(t) \land EMPLOYEE(s) \land t.Superssn = s.Ssn\} \]
Quantification binds variables
We can bind variables with first order logic quantifiers: \[(\forall t)((\exists u)(PROJECT(v) \lor EMPLOYEE(t) \lor v.Ssn = u.Ssn))\]
Example. List names and addresses of all employees who work for research. \begin{align} \{t.Fname, &t.Lname, t.Address | \\ &EMPLOYEE(t) \land \\ &(\exists d)(DEPARTMENT(d) \land d.Dname = 「Research」 \land d.Dnumber = t.Dno) \\ \} \end{align}
Universal Quantification
Note that implies doesn't exist, so \(a \implies b\) must become \(\lnot a \lor b\).
Example. We want names of all employees who work on a project controlled by department 5.
Start with \(\{e.Fname, e.Lname | EMPLOYEE(e) \land F(e)\} \) for some F.
Where F \(\implies\) "\(e\) works on all projects controlled by dept 5"
\(\implies\) "all projects in dept 5, and involves \(e\)"
\(\implies (\forall p)(PROJECT(p) \implies \dots)\), whcih we can't do since imply doesn't exist
\(\therefore (\forall p)(\lnot PROJECT(p) \lor (\lnot (p.Dnum = 5) \lor \dots))\)
The second round of dots wants "e works on p", \[(\exists w)(WORKS\_ON(w) \land w.Essn = e.Ssn \land w.Pno = p.Pnumber)\] Thus finally getting
Safety first
An RC expression is safe if it is guaranteed to return a finite number of tuples.
Achieved when everything is explicitly mentioned. An unsafe expression would be like \(\{t | \lnot EMPLOYEE(t) \} \). Safe expressions are reducible to RA.
Introduction
Whereas in TRC vars range over tuples, in DRC we have political instability vars range over attribute domains. Again it's similar notation, like
\[\{x_1 \dots x_n | \textrm{cond}(x_1 \dots x_n)\}\]
Attributes
Atomic formulas are like
Examples
Example. List birthdate and address of employee whose name is John Smith. \begin{align} \{u, v &| (\exists q)(\exists r)(\exists t)(\exists w)(\exists x)(\exists y)(\exists z)( \\ &EMPLOYEE(qrstuvwxyz) \land q = 「John」 \land r = 「B」 \land s = 「smith」 \\ )\} \end{align}
Now look at those ugly unnecessary quantifiers. For ease we say they're implicitly mentioned
Codel's Theorem
Theorem. Relational algebra and relational calculus are equally expressive languages.
Languages that are equal in expressive power to relational algebra are relationally complete.
DB Design
We want to develop good database model so we don't end up storing unmaintainable garbage.
(1) being easy to understand
Each row for one entitiy, and to only refer to different entities via foreign keys.
(2) Not redundant
Anomalies
Anomalies are when the things in the database don't line up.
Update Anomalies happen on updates. If we change "Computing" to "Computer science", but "Computing" is multipully redundantly stored, then you might get some as "Computing" and some as "Computer Science".
Insert anomaly suppose students and courses are on the same table. We can't insert a course without a student being on it.
Delete Anomaly if a course is deleted, we could lose all students on said course. Again, bad.
i.e. we want to design DBs such that no anomalies present (whether through triggers, good tables, etc)
(3) No Nulls
Nulls are unhelpful. They waste space and make operations complicated. Could occasionally be useful though.
If there are a lot of nulls, why not move it into a different table?
Decomposition
Decomposing a DB into smaller tables is generally helpful (to remove spurious tuples)- but must rejoin without information loss. Must use keys.
Functional Dependencies
See the definition in Constraints. FDs are key to make database good.
Normalisation theory
Want to structure a database so it is understandable and logical to retrieve data, and avoid anomalies. Thus we have levels of normalisation ("normal forms"), higher the better.
Universal relation
Start with a universal relation: one with all attributes. Progressively reduce redundancy to normalise.
When we split the tables, we need to be able to rejoin them with lossless/non additive joins, and preserving the dependencies.
Normal Forms
In terms of normal forms, we have 4 "layers": 1st normal form (1NF), 2NF, 3NF, and "Boyce-Codd Normal form" BCNF.
Of course 4NF, 5NF, etc exist but usually 3NF or BCNF suffices. If a schema is 3NF, it is also 2NF and 1NF: cascade.
Decomposing
Not all decompositions are good: seek those that remove spurious tuples.
Also make sure that no extra spurious tuples are created when rejoining, and if possible, preserve all dependencies (but this may not be the case, and is not always perfectly necessary).
Lossless Join Decomposition LJD
Given a schema \(R\), and a set \(F\) of FDs of R, \(\{R_1 \dots R_k\}\) is the LJD of R if for every legal instance of \(R\), \[\pi_{R_1} (R) \bowtie \pi_{R_2} (R) \bowtie \cdots \pi_{R_k} (R) = R.\]
Example. This is an invalid decomposition: \[ \begin{array} {} A & B & C \\ \hline 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 2 & 8 \end{array} \longrightarrow R_1: \begin{array}{} A & B \\ \hline 1 & 2 \\ 4 & 5 \\ 7 & 2 \end{array} ,\;\; R_2: \begin{array}{} B & C \\ \hline 2 & 3 \\ 5 & 6 \\ 2 & 8 \end{array} \] because \[ R_1 \bowtie R_2 = \begin{array}{} A & B & C \\ \hline 1 & 2 & 3 \\ 1 & 2 & 8 \\ 4 & 5 & 6 \\ 7 & 2 & 3 \\ 7 & 2 & 8 \end{array} \] of which lines 2 and 4 are spurious.
Implied FDs
Given R, the closure (cover) \(F^+\) of a set of FDs \(F\) is the set of all implied FDs from \(F\).
Implief FDs can be derived using the following properties: (Armstrong Inference Rules)
Additional Rules
Test for non-additive joins
Given a relation R and a set of FDs F, \(R_1, R_2\) is a lossless decomposition of R in F if and only if \begin{align} &R_1 \cap R_2 \implies (R_1 - R_2) \\ \textrm{or }&R_1 \cap R_2 \implies (R_2 - R_1) \end{align} exists in \(F^+\)
Example. Take the tables from the previous example: \[ \begin{array} {} A & B & C \\ \hline 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 2 & 8 \end{array} \longrightarrow R_1: \begin{array}{} A & B \\ \hline 1 & 2 \\ 4 & 5 \\ 7 & 2 \end{array} ,\;\; R_2: \begin{array}{} B & C \\ \hline 2 & 3 \\ 5 & 6 \\ 2 & 8 \end{array} \] We know that \begin{align} R_1 \cap R_2 &= B \\ R_1 - R_2 &= A \\ R_2 - R_1 &= C \end{align} Does B -> A or C? The answer is no, since there's two 2s in B, thus this is an invalid decomposition.
Dependency Preservation
Take a relation R with a set of FDs, if it is decomposed into \(R_1, R_2, \dots\), we need to make sure the FDs hold for all tables (they can be observed independently without crossing tables).
Otherwise, if tables need to be crossed, then insertion becomes very expensive (as a join would be needed per insertion)
Example. Take R(city, streetNo, postcode) where {city, streetNo} -> postcode, and postcode -> city.
We can decompse this into \(R_1\)(streetNo, postcode) and \(R_2\)(city, postcode).
To check ocnsistency, \(R_1 \cup R_2 =\) postcode, \(R_1 - R_2 = \) streetNo, and \(R_2 - R_1 = \) city. \(R_1 \cup R_2 \not \implies R_1 - R_2\), but \(R_1 \cup R_2 \implies R_2 - R_1\), so lossless join is preserved.
However, we cannot check the FD {city, streetNo} -> postcode, thus dependencies are not entirely preserved.
Decomposition Quality
The better that the two criteria are held, the better the decomposition is. If we have R(A, B, C) with A -> B and B -> C, then
Finally, Normalisation
Finally, normalisation is the process of decomposition to reduce redundancy, such that the decomposed tables are losslessly joined. If a decomposed form fits a criteria, it is considered that normal form.
Level 0
Data is converted into a table in the dumbest way possible.
Level 1 (NF)
tags
field with multiple tags would not count),We still have one wide table, with storage anomalies. Still can get key troubles.
Level 2
Where every non-key attribute is fully dependent on any key -- no attr depends on a proper subset of the keys, so
\(\forall\)key \(k\) in R, and \(\forall\)non-key attribute \(A\) in R, \(k \implies \{A\} \) must be irreducible.
This improves integrity, prevents most anomalies.
When we say reducible, say we have a relation where \(\{A,B\} \implies C,\; A \implies D\), since \(A \in \{A,B\}\), \(\{A, B\} \implies D\) is reducible and this is not 2NF.
Level 3
In 2NF and no transitive functional dependency from key to nonkey.
That is, none of \(R(\underline{A}, B, C) : A \implies B,\; A \implies C,\; B \implies C\), because \(B \implies C\) means \(A \implies B \implies C\) which is a transitive FD.
Level 3+ (BCNF)
A relation is in BCNF if, whenever \(\exists X \implies A\), X is a superkey.
Most 3NF are automatically BCNF, but that's not always the case: if there is a determinant that does not belong to the superkey in a non-trivial functional dependency.
"Each attribute must represent a fact about the key, the whole key, and nothing but the key, so help me Codd"
The ideal goal is BCNF with lossless join and dependency preservation. If this is impossible, go for BCNF with lossless join, or just 3NF.
CIA (sus)
In computer security \(\exists\) CIA triangle: confidentiality, integrity, availability, none of which we want to lose.
DB software usually comes with access and inference control to help manage this. The power rests with the "governor", the DB administrator (DBA), who get access to
There is often an audit log in a DB to log changes made.
Discretionary Access Control DAC
To grant and revoke priveliges to part of the database, and to certain SQL functions. Can be as fine as relation-level or even tuple-level access.
The "Access Matrix Model" stores data on which accounts can access which relations or tuples. Each relation has an owner with all priveliges. We can create schema name authorization your_mum
to let your mum access the relation name.
Priveliges can also be given with grant
(can also be fine detail, such as granting select but not insert), removed by revoke
, and the grantee can grant others their priveliges if they have the grant option
permission.
Some DBs limit this propagation, either horizontally (A can grant only to X many people) or vertically (nesting grants can only go Y "generations" down).
Mandatory Access Control MAC
More fine-grained than the "all or nothing" of DAC.
Most DBMSs have DAC only, but special corporate/military versions will have MAC, which has things like a security heirarchy:
(t) Top Secret > (s) Secret > (c) Classified > (u) Unclassified
And each user has a "clearance" level, that is their security classification. The main principle of this is no read up, no write down: a user cannot read (thus write) higher than their clearance, and cannot write but can read lower than their clearance.
Even different attributes inside a row can have different securities. For example we could have John (u), Smith (c), 140 000 (c), Fair (s), (u)
(columns first name, surname, salary, standing, overall classification) where the letters in brackets denote the classification of each single attribute.
Then, a user with insufficient clearance will just see NULLs.
This may cause anomalies however. If a classified user wants to update John Smith's standing to "Excellent", the database must not reject that change, since to him, John Smith's standing doesn't exist, and so we would get duplicate rows.
Role Based RBAC
Another system, this one works like discord roles with a heirarchy of roles.
Statistical DBs
Certain DBs exist for people to query aggregate data (i.e. population of males in Upper Slaughter), but not to retrieve personal information from. This is obviously done through permissions.
Problem is, increasing specificity can lead to a bad actor extracting personal information anyway, by querying data so specific that it pinpoints to very few people.
We can mitigate this in a few ways, one of which is by limiting the count, e.g. restricting such that if a query gives a result less than 30, reject the query.
Another is by adding noise, which introduces plausible deniability. Say you have a DB of men who've had STIs, and want to hide personal data whilst still being able to track population metrics. Every time you record data, you could flip a coin, and if tails, record "had STI" as true regardless - false data as noise.
This way, you could track percentages in the population (half of what the database reports) whilst still leaving people with plausible deniability that they've just been entered as a false result for noise.
Flow control
A flow from X to Y is when a program reads from X, and writes to Y.
Flow control essentially prevents higher classification reads being written to lower classification tables.
SQL Injection
The "never use unchecked string formatting" in writing an SQL app, because attacks can specifically word their queries to manipulate statements for data that they should have never gotten.
Always sanitise inputs / use prepared statements. Don't assume data is safe.
Give error messages just informative enough to tell what's wrong, but just vague enough to give no useful information: "The username or password is incorrect".