Home

Blog

About

Donate

CS258

Database Systems

Introduction

Relational Databases and Constraints
SQL
Relational Algebra
Relational Calculus
Database Design, Normalisation and Security

Introduction

Most of human activity is data driven or generating. This data comes in two main types: structured (tabulated, nice to process) and unstructured (hard to tabulate, like images, letters, web pages). There is also semi-structured, such as XML/JSON based data.

Traditional databases are structured and alphanumeric. Newer techologies can handle unstructured data, but often in a tabular "wrapping".

Data modelling

We need to consider

What data items do we care about (e.g. students)
What attributes of said data items (e.g. ID, Name, Year)
Relationships between data items and attributes of said relations

Relational Databases are table-centric, and contain "relationships" (linked attributes) between different tables to remove duplication difficulties one gets with a single table.

Principal functions of database systems

DB Management Systems are software which manages data (as name implies). It lets us model the data; query, update, and store data; and secure and ensure the integrity and consistency of stored data.

Integrity is maintained in cases of unauthorised access, as well as updates to the database (which may violate constraints), concurrent transactions, failure or natural disaster, etc, etc.

In cases of natural disaster, this is not as much guaranteed.

The software optimises data access, by creating and using indexes, finding the best way to execute a set of operations, and the best order of relational data operations (SELECT, PROJECT, JOIN, etc).

The interface to a DBMS is often through a declarative language (like SQL), which lets one easily use database functions.

3-schema architecture

External views

Different users may not want or need access to different parts of the database, so external views are different per user group. You could think of the difference between a customer's interface vs an employee's.

Other definitions

The Schema is the collection and relations between tables.

Normalisation is a process of removing anomalies from tables.

Relational Database Model and Constraints

The Relational Database
Constraints

The Relational Database

Informal Definition

A relation is a table of values. It contains a set of rows, called tuples, and a set of columns.
Elements in rows represent certain facts of an entity, and column headers denote the atrributes of interest we store.

A relational database then is a database built out of relations, with known table names and attribute (name, type) pairs.

Formulating queries involves specifying the table and attribute name(s) of interest, and setting constraints (predicates) that need to be satisfied for interest.

Formal definition

The Schema (description) of a relation:

Is denoted $R(A_1, A_2, ..., A_n)$ where R is the name and $A_i$ are attributes.
Ex. Customer(id, name, address, phoneNo)
Every attribute has a domain, a set of valid values.
A tuple is an ordered* set of values, enclosed in <angle brackets>.
A relation however is an unordered set of tuples.
A domain is a logical definition in the real world
Ex. BritishPhoneNumbers are all 11 digit phone numbers in the UK.
A domain also has a data format
Ex. US phone numbers have (ddd)dddd-ddd format (d for digit)
Attribute name designates the "role played" - the same domain can apply to multiple attributes. The names of said attributes determine what they are.
The relation state is the set of all tuples in a relation (i.e. all the actual data in the relation at that time)
A "subset of the cartesian product of all domains",

* Technically attributes are unordered, but for human readability sake we make it ordered.

Given the schema $R(A_1, A_2, ..., A_n)$,

$r(R)$ is a specific state (instance) of R.
$r(R) = \{t_1, t_2, ..., t_m\} : t_j$ is an n-tuple (n attributes).
$\therefore t_j = < v_1, v_2, ..., v_n> : v_j \in \textrm{domain}(A_j)$
And $r(R) \subset \textrm{domain}(A_1) \times \textrm{domain}(A_2) \times ... \times \textrm{domain}(A_n)$

The purpose of all this is to present a (relatively) human-understandable format, with the goal to simplify how to interrogate (query) and maintain the database.

On schemas

The relation schema is the relation name and attribute list (Tea(name: string, manuf: string) )
The Database schema is the set of relationship schemas.

The relation schema is defined during table creation, for example with creating a (tea) drinkers' table:

CREATE TABLE Drinkers (
	name CHAR(30) Primary Key,
	addr CHAR(50) DEFAULT "0 Null Island",
	phoneNo CHAR(16) NOT NULL
);

Primary Key and NOT NULL are constraints.

About the Relational Model

The order of tuples is not important
The order of attributes is not important, but we do assume order
A collection of relationship schemas is the database schema - intention
Corresponding relation states is a relational database - extension
The metadata includes the schema and other key information about tables.

Why do we use relational databases

It's a fairly simple abstract model, but also comes with declarative languages like SQL.

Note that the abstract model is set based, but SQL itself is bag-based (bags are multisets, bags can have duplicate elements whilst sets cannot).

Constraints

Constraints exist to determine permissible states.

We have domain constraints (values in attributes must be $\in$ domain, or null if permitted); and 3 explicit schema-based constraints:

Key constraints
Entity integrity
Referential integrity

Key constraints

The Superkey of a relation R is the subset $SK \subseteq R :$ in any valid $r(R)$, for any different tuples $t_1, t_2;\;$ $t_1[SK] \neq t_2[SK] $. Superkeys are unique.

Ex. Take the relation R(name, country, phone), the superkey SK{country, phone} means every row must have a unique country, phone combination.

Minimal Key

The Candidate Superkey of R is a minimal superkey. That is, if removing any attribute from the cand. key makes it not a key.

A relation may have several candidate keys, one can arbitrarily determine one to be the primary key.

Entity integrity

Entity integrity basically deals with nulls where they shouldn't be.

Primary keys must never be null (they are identifiers), and other entries can be set to not null.

Referential Integrity

Referential Integrity deals with inter-table references.

For example, we have a students table, which has student ID as a primary key. We also have a table of courses which students do, with student ID as foreign keys. If we delete a student, the foreign key for that student will be referening nothing, and this is an integrity violation.

A set of attributes $FK$ from relation $R_1$ is a Foreign Key that references $R_2$ if the attributes of $FK$ in $R_1$ is the same as the Primary Key $PK$ in $R_2$. Foreign keys must either refer to the $PK$ of $R_2$, or be NULL.

Foreign keys are also defined during table creation.

Semantic attribution integrity

We also have Semantic attribution integrity, which in short comprise validation checks that the attributes make sense, such as Age > 0.

Constraints and updates

On an update operation, integrity constraints may be violated. The DB management system is there to ensure that these are caught, and corrected, rejected, or reported.
Some possible violations that may occur:

On INSERT:
- Domain: if attribute values are not in domain
- Key: if key already exists
- Foreign Key: if FK value references nonexistent PK
- Entity Integ.: NULL when not null specified
On DELETE:
- Ref. Integ. (only): PK being deleted causes a FK to lose reference
on UPDATE:
- Domain and Entity integ. violations
- Ref. Integ.: updatng PK or FK

Functional dependencies

Functional dependencies (FDs) are useful, since they are formal statements which allow derivation of good database design.

Let X, Y, Z be sets of attributes, and A, B, C be single attributes, where $ABC \implies \{A, B, C\}$.

$X \longrightarrow Y $ is an assertion about relation R and 2 sets of attributes from R, which constrains all possible tuples in $r(R)$. If $X \longrightarrow Y$, values of Y depend on values of X. For any two tuples $t_1, t_2 \in R$, \[t_1[X] = t_2[X] \implies t_1[Y] = t_2[Y]\] (Injunctive function)

FDs and Keys

Claim. if K is a superkey, then $K \longrightarrow A\; \forall$ other subsets $A \in R$

Proof. Assume $K \longrightarrow A, K \longrightarrow B$ for some $A, B \in R$. This violates key constraints, as keys must be unique for all tuples on R. $\Box$

The reverse is also true, making if and only if.

Determining keys

(1) We can assert a key. Say "this attribute will be the key." This makes it our only functional dependency by definition.

(2) Assert functional dependencies. If we have a set of functional dependencies, then we have our key.

FDs can come from keys or physical limits, e.g. no two lectures can happen in the same room at the same time $\implies$ {room, time} as a viable key.

Declaring

SQL and declarative languages

SQL is a declarative language - a very high level language that tells the system merely what to do instead of how to do it.

Things like data manipulation that one would have to implement in a procedural language are implemented in the back-end by the DB management system, and it is the one responsible for finding the optimal way of carrying out an SQL command.

SQL is both a data definition language (DDL) and a data manipulation language (DML).

SQL Terminology

In SQL, relations are tables, tuples are rows, and attributes are columns.

Standards

Whilst SQL has an official standard - SQL 3 currently - different comapnies implement SQL differently and none have implemented the standard directly, thus different platforms have slightly different versions of the language.

Databases, Schemas, and Catalogues

A database contains 1 or more schemas, which contain one or more related tables. A schema can be created with the create schema command, and schemas can (supposedly) be given an authorisation, create schema Company authorisation 'JSmith'

If one does not create a schema before creating tables, those tables go in the default public schema.

Virtual relations can be created with create view, which aren't actual tables but "virtual" tables spanning actual tables.

Note that SQL is generally case insensitive.

now, syntax time

Tables

Tables are created with

create table <tableName> (
	<name1> <type1> <constraint1>,
	<name2> <type2> <constraint2>,
	[otherConstraint], [otherConstraint]
);

There are a few essential data types, their PostgreSQL (pn. post-gres) equivalents are given. The ANSI names are omitted.

PostgreSQL type	Description
`char(n)`	Fixed length character string - string is padded with spaces
`varchar(n)`	Variable limited length - like above but with no padding
`numberic(p, s)`	Arbitrary precision numbers. $p$ is precision and $s$ is scale
`int, int4, int2`	First two are 4 byte ints, `int2` is a 2 byte int
`float4, float8`	4 byte and 8 byte floating point numbers

Amongst others like blob, date, boolean, time, etc.

Thus we can create an example table like so:

create table students (
	studentID int primary key,  -- this is a comment
	studentName varchar(30),  -- char varying
	courseID int
);

Attributes can be set to have a default value, not null, and semantic checks

accountBalance real not null default 0.0,
age int not null check (age > 0 and age < 125)

One may also add a unique clause - this can specify alternate secondary keys, candidate keys.

dname varchar(15) unique

Foreign keys

Constraints can be declared inline or at the after attributes. Below is an exmaple with foreign keys, where constraints are written at the end.

create table project (
	Pname varchar(15) not null,
	Pnumber int not null,
	Plocation int,
	Dnum int not null, 
	primary key (Pnumber),
	unique (Pname),
	foreign key (Dnum) references department (Dnumber) 
	-- department is another table
);

Constraints

One can set actions for foreign keys, which would trigger when specific updates cause integrity violations. These actions are set null, cascade (propagate), set default.

Different actions can happen for different queries, specified with on update and on delete. Thus,

constraint superSSN_SSN
    foreign key (superSSN) references employee (SSN)
	on delete set null
	on update cascade

Where superSSN_SSN is the name of the constraint.

Constraints can also be set on fields inline, with the syntax constraint name details

studentID int constraint nonzero check (studentID > 0)

rm -rf /

SQL also has delete (DROP) commands, which do not double check and will delete your data. Use with caution.

These are drop table and drop schema.

Querying

Selecting

SQL allows duplicate entries in the database - these are dealt with at point of query instead of point of insertion.

Generally, the terminology is that rows are selected whilst columns are projected.

The basic select command is like select attr1, attr2 from someTable where someCondition;.

The output is also a table, and we can manipulate said table.

select distinct removes duplicates
where sets conditions on which rows are selected, and use the boolean operations and or not.
* is the wildcard "all" operator.

We can qualify names using a dot, if there are multiple tables that share a column name (whether through foreign keys or otherwise), by doing select students.studentID from students, courses;.

If table names are too long or unweildy, they can be aliased, for example S.studentID from students S.

Please note that strings in SQL are single quoted only.

We can order by a field by order by field asc (or desc)

Columns can be renamed using as, such as select name as teaName, manufacturer from...

Operational semantics

The general order of operations goes

Get relation in from clause
Apply the row selection in where
Apply the column projection in select

When performing comparisons, linear search is used.

Comparisons

Once can replace a column value with a constant value. For example, if we search from a database of tea drinkers for those with Earl Grey as favourite tea, we can do something like the following:

select drinker, 'likes earl grey' as likesEarlGrey 
	from likes
	where tea = 'Earl Grey';

There are several more string functions, but often these are implementation-specific.

Apostrophes have to be written using double quotes, so 'Joe"s' would mean Joe's.

String attributes can also be pattern matched, using the keywords like and not like. A pattern is a quoted string with special characters, like a simplified form of regex. _ means any one character, % means any number of any characters.

where phone like '%1926_______'

Nulls

Nulls are special in that they are not real values. They are useful for unknown/not applicable values.

Thus, nulls are incomparable: if A's phone number is NULL, B's is also NULL, A's phone number does not equal B's.

SQL comparisons actually have three boolean values, those being True, False, Unknown. Comparing with NULL can yield unknown values. The process of three value logic is that if the value of unknown does not matter in an expression (e.g. $F \land U = F$ and $T \lor U = T$).

To check nulls, use is: attr is null or attr is not null.

Updating

A modification does not return, but changes the state of the DB

Insert

insert into <tableName> values (<val1>, <val2>, ...);

Inserted values must be in attribute order of list in schema. To insert partial values, we can specify the column names:

insert into <tableName> (A1, A2, ..., Aj) values (V1, V2, ..., Vj);

Unspecified attributes will fall to a default or null value. This is also useful if we do not know the order of attributes.

Delete

To delete tuples from a table matching the expression,

delete from <tableName> where <expression>;

And the next one, depending on database, will delete either nothing or everything -- in postgres, this deletes every tuple from the table.

delete from <tableName>;

Update

update <tableName>
set (att1=val1, att2=val2, ...)
where <condition>

Joins

Selecting from multiple tables

In a from when selecting more than one table: ... from students, courses ... by default with the comma we will get the cartesian product, or cross join.

Cross join

Cross join can also be explicit: students cross join courses.

Natural Join

Write R natural join S, natural join can be done by first cross joining $R \times S$, take the attribute in R with the same name as that in S, merge the two columns into one and remove any where they don't match up: \[ \begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{Name} \\ \hline 1 & Bob \\ \hline 2 & Alice \\ \hline 3 & Brian \\ \hline \end{array} \textrm{ natural join } \begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{CourseID} \\ \hline 1 & 1 \\ \hline 1 & 2 \\ \hline 2 & 2 \\ \hline 3 & 1 \\ \hline 3 & 2 \\ \hline \end{array} \] \[= \begin{array} {|r|r|}\hline \textrm{StudentID} & \textrm{Name} & \textrm{CourseID} \\ \hline 1 & Bob & 1 \\ \hline 1 & Bob & 2 \\ \hline 2 & Alice & 2 \\ \hline 3 & Brian & 1 \\ \hline 3 & Brian & 2 \\ \hline \end{array}\]

(Regular) Join

Extension of natural join, has syntax R join S on <cond> for some condition.

If the operator in the cond (because you'll need one) is an =, this is an equi-join. The default join method is an inner join.

Inner join

An inner join between R and S is the same as

select * from R, S where R.myID = S.theirID;

i.e. it's like a natural join, except the two columns are not merged.

Theta join

A theta join is simply not an equi-join

Nesting joins

select *
from (
	(students S inner join studentCourses C on S.studentID = C.studentID) 
	inner join courses D on C.courseID = D.courseID 
);

Outer Joins

Inner joins lose info -- an inner join between students and studentCourses will discard all students without courses. An outer join will keep one side's tuples and fill in the other side's with null.

left outer join keeps the left side, right outer join keeps the right side, and full outer join keeps both.

Outer joins must come with a natural or on clause.

Query Nesting

Subqueries

Subqueries are queries within queries - you can select from the result of a select, basically. Internal queries can also be aliased for convenience.

select tea
from likes, (
	select drinker from frequents where tearoom = 'Joe"s Tea Room'
) JD 
where likes.drinker = JD.drinker

Scalar Subqueries

If a query produces a 1x1 table -- i.e a single value, it is scalar. Scalar subqueries can be used in calculations and where conditions.

The keyword in asks if <tuple> in <relation>. not in is the opposite keyword. Used in where clauses.

select A1, A2 from testTable where A1 in (select A2 from testTable);

Set comparisons

in is essentially $\in$ in set algebra. What if we want something like $x < (\forall y \in Y)$? Well we have the all, any, and some operators:

select ... from ... where <attr> <operation> all/any/some <subquery>

Where the operation is a standard comparator operation. ALL matches $\forall$ and ANY, SOME both match $\exists$.

SELECT ProductName FROM Products
WHERE ProductID = ANY
	(SELECT ProductID FROM OrderDetails
	WHERE Quantity = 10);

Correlated Subqueries

Subqueries which need to refer to values of the outer query - often subquery is run per tuple of main query.

Exists

We can emptiness test with select attr from tableName where EXISTS ( <subq> );. Inner queries can use table aliases from outer queries, such as

select courseID, courseName from courses C
where exists (
	select * from studentCourses D where C.courseID = D.courseID
);

Inner queries which use outer query tuples are correlated. Nested query evaluated once per row of outer, and thus can use the scope of outer.

not exists obviously exists (ironically enough).

Correlated versus not

If a subquery is independent from the outer one, the subquery is evaluated first. Otherwise it does the row thing.

Inserting with queries

Guess what, you can insert nested queries too.

insert into backup (copy1, copy2)
select E.source1, E.source2 
from example E
where source1 > 2;

Set operations

Relations are sets (bags) and of course we can do set operations, with keywords union, intersect, except being $\cup, \cap, \setminus$ respectively.

The sets must be comparable, however, for it to work. You cant have tables with different columns.

(select * from likes)
	intersect 
(select drinker, tea from sells join frequents on (tearoom))

Bags vs sets

Yeah so you saw the grey "Bags" earlier? Well SQL "sets" aren't actually sets, they're "bags", or "multisets", or whatever word mathematicians make up to avoid saying sets but with duplicate items.

select distinct rids us of the duplicates, and the set operations will too, but say you do have two tables with duplicates, and don't want to remove them when unioning. Well that's where all comes in: union all will put all the columns together without removing duplicates.

You also have intersect all and except all, but it starts to act a bit funny here.

Say we have the code (select person from A) except all (select person from B), the first one has joe, bob, joe, tim, fred and the second has fred fred tim tim joe, as a result you'd get joe, bob. Since there's two Joes in the first one, and only one Joe (Who's Joe?) in the second one, $2-1=1$ one Joe left over.

It's a similar case with intersect all where they match and discard left overs.

Aggregation and Grouping

Aggregation

Many queries involve computing aggregates, using sum(), avg(), count(), min(), max(), percentile(). These keywords apply to columns.

count() is special since it can take count(*) which means count all rows, and is also extended to count(distinct ...) which should be self explanatory.

The rest of these need an attribute.

select avg(price) from sells where tea = 'Earl Grey';

select studentID from studentCourseMarks
where courseID = 2 and mark > (
	select avg(mark) from studentCourseMarks where courseID = 2
);

Behaviour with nulls

Nulls never contribute to a sum, average, and count, and can never be a max or a min.

If the whole column is null, count() is 0, and then min(), max(), sum() etc become null.

Group by

Follows a select-from-where, results are grouped according to the values for the specified attribute, and (it must be present) aggregates are applied per group.

select courseID, count(*) as students
from studentCourses group by courseID;

select tea, avg(price) from sells group by tea;

Limitations

If an aggregate is used, each elem must be aggregated or grouped by.

-- selecting earl grey at cheapest price
select teaRoom from sells
where (tea, price) = (
	select tea, min(price) from sells 
	where tea = 'Earl Grey'
	group by tea
);

Having

A having condition is a condition on the grouped by rows, all rows must pass said condition to be shown.

-- selecting courses with > 2 students
select sc.courseID, courseName, count(*) as students
from studentCourses sc, courses 
where sc.courseID = courses.courseID
group by sc.courseID, courseName
having count(*) > 2;

having may refer to any tuple variable in from, and may refer to an attribute, as long as that attribute makes sense (i.e. the attribute being grouped)

Relational Algebra

Introduction

Overview
Operators (Detailed)
Additional Operators

SQL is based on relational algebra (RA): a formal mathematical model operating over relations, tuples, and attributes.

Note that whilst SQL is declarative, RA is procedural, one line at a time.

Since operators are formally defined, their properties are key to optimising execution on the back end.

Overview

The operators are (the /1 and /2 are to show unary and binary operators)

Projection/1: $\pi_a(X)$ defines the attributes $a$ of interest of X (like select from)
Selection/1: $\sigma_c(X)$ defines tuples of interest with w/r/t constraint $c$ (like where)
Product/2: $X \times Y$ cartesian product
Join/2: $X \bowtie Y$ joins on attributes or criteria ($\bowtie_c$)
Set operators: $X \cup Y、X \cap Y、\overline{X} $.
Rename/1: $\rho(X)$ can rename the table ($\rho_S(X)$), rename attributes ($\rho_{a_1 \dots a_n}(X)$), or both ($\rho_{S(a_1 \dots a_n)}(X)$).

Example. For every project in stafford, list the project number, controlling department number, department manager's last name, address, and birthdate:

select Pnumber, Dnum, Lname, Address, Bdate
from (
	(
		(
			select * from PROJECT
			where Plocation = 'Stafford'
		) join DEPARTMENT on Dnum = Dnumber
	) join EMPLOYEE on Mgr_ssn = Ssn
);

In relational algebra, this is \[\pi((\sigma(P) \bowtie D) \bowtie E)\] Or unabbreviated, \[\pi_{\textrm{P.Pnumber, P.Dnum, E.Lname, E.Address, E.Bdate}}((\sigma_{\textrm{P.Location = 'Stafford'}}(P) \bowtie_{\textrm{P.Dnum=D.Dnumber}} D) \bowtie_{\textrm{D.Mgr_ssn = E.Ssn}} E)\]

Operators (Detailed)

Selection

\[\sigma_{\textrm{condition}}(R)\] The condition consists of <attrname> <comparator> <constant/other attr name>.

This is applied independently to each tuple.

Select is commutative: $\sigma_a(\sigma_b(R)) = \sigma_b(\sigma_a(R))$, and chaining them is just the same as using logical and: $\sigma_a(\sigma_b(R) = \sigma_{a \land b}(R)$.

Projection

\[\pi_{\textrm{attr list}(R)\] Attribute list are the desired attributes (columns) of relation.

If there are no keys in this list, in SQL you may get duplicates, but in RA duplicates are illegal since we're using set maths.

If $l_1 \supset l_2$ then $\pi_{l_1}(\pi{l_2}(R))$ is illegal since you're projecting nonexistent columns.

Renaming

$\rho_S(R)$ renames relation R to S
$\rho_{(b_1 \dots b_n)} (R)$ renames attributes of R ($b_i$ renames attribute $a_i$)
$\rho_{S(b_1 \dots b_n)} (R)$ does both!

Expression order

We've seen single expressions: $\pi_{\textrm{Fname, Lname, Salary}}(\sigma_{Dno = 5}(\textrm{EMPLOYEE}))$

We can also chain expressions: \begin{align} & \textrm{DEPS_EMPS} \longleftarrow \sigma_{Dno = 5}(\textrm{EMPLOYEE}) \\ & \textrm{RESULT} \longleftarrow \pi_{\textrm{Fname, Lname, Salary}}(\textrm{DEPS_EMPS}) \end{align}

Set operations

Two relations $R(a_1 \dots a_m)$ and $S(b_1 \dots b_m)$ are type compatible if $m = n$ and $dom(a_i) = dom(b_i) \; \forall i \in [1..n] $.

Cross join

Cartesian product of sets.

Join

Instead of cross join and select, we can join on a condition $R \bowtie_{\textrm{cond}} S$.

An equijoin is a join where the condition has an =

Natural Join

An extension of join is natural join, represented by the star $R \ast S$.

Completeness

Not all of these operators are needed to do everything, in fact the minimal complete set of operators is $\{\sigma, \pi, \rho, \cup, -, \times\} $ (minus is set minus).

Division

If we have $R \longleftarrow S \times T$, then we must have $R \div S = T$... right?

Yeah, but it's a bit weird. Suppose S is students, T is topics, and R is something like "researches". Then $S \times T$ will pair each student with all topics, or each topic with all students. $R \div S$ "extracts" those topics out that all students research.

$R \div S$ "extracts" those topics out that all students research. Even if we add more rows onto R to make $R'$, then $R' \div S$ would still only extract topics that all students research.

We define this as universal quantification, and T is the universal quantifier of all values of S.

Let $R(Z)$ be the fat relation and $S(X)$ be the skinny relation, such that $X \subseteq Z$.

$R(Z) \div S(X)$ is a relation $T(Y):$ tuple $t \in T(Y)$ iff $\forall t_S \in S \exists t_R \in R :$

$t_R[Y] = t$ (t comes from R but only contains attributes Y)
$t_R[X] = t_S$ (t must be present with all combinations of tuples from S as a tuple in R

Alternatively, we can define $T \longleftarrow R \div S$ as \begin{align} T_1 &\longleftarrow \pi_Y(R) \\ T_2 &\longleftarrow \pi_Y((S \times T_1) - R) \\ T & \longleftarrow T_1 - T_2. \end{align}

Additional Operators

Generalised Projection

$\pi_{F_1 \dots F_n} (R)$ where $F_1 \dots F_n$ are functions on R: can do maths, constants, etc.

Aggregate Functions

$_{\textrm{grouping attrs}} \mathfrak{I} _\textrm{function list} (R)$: The notation is fraktur I.

Where the grouping attributes are attributes in R to group by, and the function list is a list of (function, attribute) pairs, the functions are usually sum(), avg(), max(), min(), count() etc.

\[ _\textrm{Dno} \mathfrak{I} _\textrm{COUNT Ssn, AVERGAGE Salary} (\textrm{EMPLOYEE}) \]

Outer joins

Join bowties with what looks like equals coming out of one (or both) sides: left ⟕ right ⟖. (These are unavailable in base mathjax)

Relational Calculus

Introduction

Relational Calculus
Domain Relational Calculus

As if one formal system wasn't enough, here's another.

yay...

Relational Calculus

Predicates and Propositions

Relational Calculus is based on predicates and propositions.

A predicate is a declarative sentence: a statement. Predicates must be verifiably true or false. They specifiy retrieval requests, but only what to retrieve, not how.

True or false statements depend on presence within the database.

Variables in Predicates

We can have a predicate with variables: "Person $p$ plays intrument $i$.", and these variables can be replaced: "Person Megan plays intrument $i$."

A predicate is instantiated when all parameters are filled, this becomes a proposition

"Person Megan plays instrument cello." is now a proposition.

Intension and Extension

The intension of a predicate is the predicate's "meaning": the properties and quantities associated with a statement.

The extension of a predicate is the set of all instantiations where the predicate is true.

Operations

Propositions can be composed with others with logical and first order operators $\land, \lor, \lnot, \forall, \exists$.

Sets

Let $P(x)$ be a predicate. If $a$ is an instance of $x$, and $P(a) = \top$, $a$ satisfies $P$.

$P(x)$ is the membership predicate for the set of all satisfactory objects $a$.

Suppose $P(y) = y \in \mathbb{Z} \land 5 < y < 10$. The Set related is $\{6,7,8,9\} $.

These sets can also be composed with set operators.

Relational Calculus -> Relational Model

A predicate can be thought of as a schema. The predicate defines what's true, whilst the schema defines attributes and domains for valid data.

A proposition can be thought of as a tuple. Propositions are set values substituted into parameters of a predicate; if they are satisfactory, they are valid in the schema.

The extension is the set of all valid states of the schema (the set of all predicate instantiations)

We work with a closed world assumption: a tuple is true if and only if it is in the database.

Queries

RC can be used to express queries declaratively, and there are two types:

Tuple Relational Calculus (TRC) and Domain Relational Calculus (DRC).

Tuple Relational Calculus

TRC

Statements in RC are declarative.

To define queries, we need to define 3 more things:

The range relation, over which all tuples operate
The condition used to select tuples (aka formulas)
The desired attributes to be retrieved

DB Queries

A standard TRC query looks like \[\{t_1\cdot A_1, t_2 \cdot A_2, \dots, t_n \cdot A_n | \textrm{cond}(t_1 \dots t_n)\}\] Where cond is a true/false formula, $t_1 \dots t_n$ are free variables.

The tuple variables range over all possible tuples, so must specify explicitly which relation they come from.

An atomic formula is of form

$R(t_i)$ where R is a relation and $t_i$ is a tuple variable.
$t_i \cdot A \circ t_j \cdot B$ where $\circ$ is $=, <, \leq, \geq, >, \neq$
$t_i \cdot A \circ c$ (or inverse) where $c$ is a constant

Compound Formulas are of form \begin{matrix} &\lnot F, &F_1 \lor F_2, &F_1 \land F_2, &(\forall t)F, &(\exists t)F \end{matrix}

Examples

Because the formal definition is kinda confusing.

Example. Names of employees who earn more than $50k annually. \[\{t.Fname, t.Lname | EMPLOYEE(t) \land t.salary > 50000\} \]

Example. Now we want the employees' full names and their supervisors. \[\{t.Fname, t.Lname, s.Fname, s.Lname | EMPLOYEE(t) \land EMPLOYEE(s) \land t.Superssn = s.Ssn\} \]

Quantification binds variables

We can bind variables with first order logic quantifiers: \[(\forall t)((\exists u)(PROJECT(v) \lor EMPLOYEE(t) \lor v.Ssn = u.Ssn))\]

Example. List names and addresses of all employees who work for research. \begin{align} \{t.Fname, &t.Lname, t.Address | \\ &EMPLOYEE(t) \land \\ &(\exists d)(DEPARTMENT(d) \land d.Dname = 「Research」 \land d.Dnumber = t.Dno) \\ \} \end{align}

Universal Quantification

Note that implies doesn't exist, so $a \implies b$ must become $\lnot a \lor b$.

Example. We want names of all employees who work on a project controlled by department 5.

Start with $\{e.Fname, e.Lname | EMPLOYEE(e) \land F(e)\} $ for some F.

Where F $\implies$ "$e$ works on all projects controlled by dept 5"
$\implies$ "all projects in dept 5, and involves $e$"
$\implies (\forall p)(PROJECT(p) \implies \dots)$, whcih we can't do since imply doesn't exist
$\therefore (\forall p)(\lnot PROJECT(p) \lor (\lnot (p.Dnum = 5) \lor \dots))$

The second round of dots wants "e works on p", \[(\exists w)(WORKS\_ON(w) \land w.Essn = e.Ssn \land w.Pno = p.Pnumber)\] Thus finally getting

{$e.Lname, e.Fname | EMPLOYEE(e) \land$( $(\forall p) (\lnot PROJECT(p) \lor \lnot (p.Pnum = 5) \lor$( $(\exists w) (WORKS\_ON(w) \land w.Essn = e.Ssn \land w.Pno = p.Pnumber)$ ) )}

Safety first

An RC expression is safe if it is guaranteed to return a finite number of tuples.

Achieved when everything is explicitly mentioned. An unsafe expression would be like $\{t | \lnot EMPLOYEE(t) \} $. Safe expressions are reducible to RA.

Domain Relational Calculus

Introduction

Whereas in TRC vars range over tuples, in DRC ~~we have political instability~~ vars range over attribute domains. Again it's similar notation, like \[\{x_1 \dots x_n | \textrm{cond}(x_1 \dots x_n)\}\]

Attributes

Atomic formulas are like

$R(x_1, x_2 \dots x_k)$ is a relation over $k$ attributes
$x_i \circ x_j$ where $\circ$ is an operator
$x_i \circ c$ or vice-versa

Examples

Example. List birthdate and address of employee whose name is John Smith. \begin{align} \{u, v &| (\exists q)(\exists r)(\exists t)(\exists w)(\exists x)(\exists y)(\exists z)( \\ &EMPLOYEE(qrstuvwxyz) \land q = 「John」 \land r = 「B」 \land s = 「smith」 \\ )\} \end{align}

Now look at those ugly unnecessary quantifiers. For ease we say they're implicitly mentioned

Codel's Theorem

Theorem. Relational algebra and relational calculus are equally expressive languages.

Languages that are equal in expressive power to relational algebra are relationally complete.

Database Design, Normalisation and Security

Relational DB Design
Normalisation
Security

Relational DB Design

DB Design

We want to develop good database model so we don't end up storing unmaintainable garbage.

(1) being easy to understand

Each row for one entitiy, and to only refer to different entities via foreign keys.

(2) Not redundant

Storage cost: storing the same thing twice is unnecessary,
Consistency Cost: dealing with anomalies (from deleting or updating data),
Updating redundant data makes more effort.

Anomalies

Anomalies are when the things in the database don't line up.

Update Anomalies happen on updates. If we change "Computing" to "Computer science", but "Computing" is multipully redundantly stored, then you might get some as "Computing" and some as "Computer Science".

Insert anomaly suppose students and courses are on the same table. We can't insert a course without a student being on it.

Delete Anomaly if a course is deleted, we could lose all students on said course. Again, bad.

i.e. we want to design DBs such that no anomalies present (whether through triggers, good tables, etc)

(3) No Nulls

Nulls are unhelpful. They waste space and make operations complicated. Could occasionally be useful though.

If there are a lot of nulls, why not move it into a different table?

Decomposition

Decomposing a DB into smaller tables is generally helpful (to remove spurious tuples)- but must rejoin without information loss. Must use keys.

Normalisation

Functional Dependencies

See the definition in Constraints. FDs are key to make database good.

Normalisation theory

Want to structure a database so it is understandable and logical to retrieve data, and avoid anomalies. Thus we have levels of normalisation ("normal forms"), higher the better.

Universal relation

Start with a universal relation: one with all attributes. Progressively reduce redundancy to normalise.

When we split the tables, we need to be able to rejoin them with lossless/non additive joins, and preserving the dependencies.

Normal Forms

In terms of normal forms, we have 4 "layers": 1st normal form (1NF), 2NF, 3NF, and "Boyce-Codd Normal form" BCNF.

Of course 4NF, 5NF, etc exist but usually 3NF or BCNF suffices. If a schema is 3NF, it is also 2NF and 1NF: cascade.

Decomposing

Not all decompositions are good: seek those that remove spurious tuples.

Also make sure that no extra spurious tuples are created when rejoining, and if possible, preserve all dependencies (but this may not be the case, and is not always perfectly necessary).

Lossless Join Decomposition LJD

Given a schema $R$, and a set $F$ of FDs of R, $\{R_1 \dots R_k\}$ is the LJD of R if for every legal instance of $R$, \[\pi_{R_1} (R) \bowtie \pi_{R_2} (R) \bowtie \cdots \pi_{R_k} (R) = R.\]

Example. This is an invalid decomposition: \[ \begin{array} {} A & B & C \\ \hline 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 2 & 8 \end{array} \longrightarrow R_1: \begin{array}{} A & B \\ \hline 1 & 2 \\ 4 & 5 \\ 7 & 2 \end{array} ,\;\; R_2: \begin{array}{} B & C \\ \hline 2 & 3 \\ 5 & 6 \\ 2 & 8 \end{array} \] because \[ R_1 \bowtie R_2 = \begin{array}{} A & B & C \\ \hline 1 & 2 & 3 \\ 1 & 2 & 8 \\ 4 & 5 & 6 \\ 7 & 2 & 3 \\ 7 & 2 & 8 \end{array} \] of which lines 2 and 4 are spurious.

Implied FDs

Given R, the closure (cover) $F^+$ of a set of FDs $F$ is the set of all implied FDs from $F$.

Implief FDs can be derived using the following properties: (Armstrong Inference Rules)

Reflexive: If $Y \subseteq X$, then $X \implies Y$ (trivial FD)
Augmentation: if $X \implies Y$, then $XZ \implies YZ$ (where $XZ = X \cup Z$)
Transitive: IF $X \implies Y$ and $Y \implies Z$, then $X \implies Z$

Additional Rules

Decomposition: If $X \implies YZ$ then $X \implies Y$ and $X \implies Z$
union: If $X \implies Y$ and $X \implies Z$ then $X \implies YZ$
Pseudotransitivity: If $X \implies Y$ and $WY \implies Z$, then $WX \implies Z$

Test for non-additive joins

Given a relation R and a set of FDs F, $R_1, R_2$ is a lossless decomposition of R in F if and only if \begin{align} &R_1 \cap R_2 \implies (R_1 - R_2) \\ \textrm{or }&R_1 \cap R_2 \implies (R_2 - R_1) \end{align} exists in $F^+$

Example. Take the tables from the previous example: \[ \begin{array} {} A & B & C \\ \hline 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 2 & 8 \end{array} \longrightarrow R_1: \begin{array}{} A & B \\ \hline 1 & 2 \\ 4 & 5 \\ 7 & 2 \end{array} ,\;\; R_2: \begin{array}{} B & C \\ \hline 2 & 3 \\ 5 & 6 \\ 2 & 8 \end{array} \] We know that \begin{align} R_1 \cap R_2 &= B \\ R_1 - R_2 &= A \\ R_2 - R_1 &= C \end{align} Does B -> A or C? The answer is no, since there's two 2s in B, thus this is an invalid decomposition.

Dependency Preservation

Take a relation R with a set of FDs, if it is decomposed into $R_1, R_2, \dots$, we need to make sure the FDs hold for all tables (they can be observed independently without crossing tables).

Otherwise, if tables need to be crossed, then insertion becomes very expensive (as a join would be needed per insertion)

Example. Take R(city, streetNo, postcode) where {city, streetNo} -> postcode, and postcode -> city.

We can decompse this into $R_1$(streetNo, postcode) and $R_2$(city, postcode).

To check ocnsistency, $R_1 \cup R_2 =$ postcode, $R_1 - R_2 = $ streetNo, and $R_2 - R_1 = $ city. $R_1 \cup R_2 \not \implies R_1 - R_2$, but $R_1 \cup R_2 \implies R_2 - R_1$, so lossless join is preserved.

However, we cannot check the FD {city, streetNo} -> postcode, thus dependencies are not entirely preserved.

Decomposition Quality

The better that the two criteria are held, the better the decomposition is. If we have R(A, B, C) with A -> B and B -> C, then

$R_1(A,B),\;R_2(B,C)$ is better, because it is both lossless and preserves dependency, whereas
$R_1(A,B),\;R_2(A,C)$ is worse, since it does not preserve dependency.

Finally, Normalisation

Finally, normalisation is the process of decomposition to reduce redundancy, such that the decomposed tables are losslessly joined. If a decomposed form fits a criteria, it is considered that normal form.

1NF "the key": All attributes are atomic, and there exists a key
2NF "the whole key": None key attributes must depend on a full key / any candidate key
3NF "nothing but the key": Non key attributes must only depend on the key
BCNF: every determinant of a non trivial FD is a superkey whatever that means

The process

Level 0

Data is converted into a table in the dumbest way possible.

Level 1 (NF)

Ensure: no multi-valued attributes, no composite attributes. (ex. a tags field with multiple tags would not count),
and we have a key

We still have one wide table, with storage anomalies. Still can get key troubles.

Level 2

Where every non-key attribute is fully dependent on any key -- no attr depends on a proper subset of the keys, so

$\forall$key $k$ in R, and $\forall$non-key attribute $A$ in R, $k \implies \{A\} $ must be irreducible.

This improves integrity, prevents most anomalies.

When we say reducible, say we have a relation where $\{A,B\} \implies C,\; A \implies D$, since $A \in \{A,B\}$, $\{A, B\} \implies D$ is reducible and this is not 2NF.

Level 3

In 2NF and no transitive functional dependency from key to nonkey.

That is, none of $R(\underline{A}, B, C) : A \implies B,\; A \implies C,\; B \implies C$, because $B \implies C$ means $A \implies B \implies C$ which is a transitive FD.

Level 3+ (BCNF)

A relation is in BCNF if, whenever $\exists X \implies A$, X is a superkey.

Most 3NF are automatically BCNF, but that's not always the case: if there is a determinant that does not belong to the superkey in a non-trivial functional dependency.

"Each attribute must represent a fact about the key, the whole key, and nothing but the key, so help me Codd"

The ideal goal is BCNF with lossless join and dependency preservation. If this is impossible, go for BCNF with lossless join, or just 3NF.

Security

CIA (sus)

In computer security $\exists$ CIA triangle: confidentiality, integrity, availability, none of which we want to lose.

DB software usually comes with access and inference control to help manage this. The power rests with the "governor", the DB administrator (DBA), who get access to

Two subsystems: security and authorisation
Two security mechanisms: discretionary (granting priveleges) and mandatory (classification levels)

There is often an audit log in a DB to log changes made.

Discretionary Access Control DAC

To grant and revoke priveliges to part of the database, and to certain SQL functions. Can be as fine as relation-level or even tuple-level access.

The "Access Matrix Model" stores data on which accounts can access which relations or tuples. Each relation has an owner with all priveliges. We can create schema name authorization your_mum to let your mum access the relation name.

Priveliges can also be given with grant (can also be fine detail, such as granting select but not insert), removed by revoke, and the grantee can grant others their priveliges if they have the grant option permission.

Some DBs limit this propagation, either horizontally (A can grant only to X many people) or vertically (nesting grants can only go Y "generations" down).

Mandatory Access Control MAC

More fine-grained than the "all or nothing" of DAC.

Most DBMSs have DAC only, but special corporate/military versions will have MAC, which has things like a security heirarchy:

(t) Top Secret > (s) Secret > (c) Classified > (u) Unclassified

And each user has a "clearance" level, that is their security classification. The main principle of this is no read up, no write down: a user cannot read (thus write) higher than their clearance, and cannot write but can read lower than their clearance.

Even different attributes inside a row can have different securities. For example we could have John (u), Smith (c), 140 000 (c), Fair (s), (u) (columns first name, surname, salary, standing, overall classification) where the letters in brackets denote the classification of each single attribute.

Then, a user with insufficient clearance will just see NULLs.

This may cause anomalies however. If a classified user wants to update John Smith's standing to "Excellent", the database must not reject that change, since to him, John Smith's standing doesn't exist, and so we would get duplicate rows.

Role Based RBAC

Another system, this one works like discord roles with a heirarchy of roles.

Statistical DBs

Certain DBs exist for people to query aggregate data (i.e. population of males in Upper Slaughter), but not to retrieve personal information from. This is obviously done through permissions.

Problem is, increasing specificity can lead to a bad actor extracting personal information anyway, by querying data so specific that it pinpoints to very few people.

We can mitigate this in a few ways, one of which is by limiting the count, e.g. restricting such that if a query gives a result less than 30, reject the query.

Another is by adding noise, which introduces plausible deniability. Say you have a DB of men who've had STIs, and want to hide personal data whilst still being able to track population metrics. Every time you record data, you could flip a coin, and if tails, record "had STI" as true regardless - false data as noise.

This way, you could track percentages in the population (half of what the database reports) whilst still leaving people with plausible deniability that they've just been entered as a false result for noise.

Flow control

A flow from X to Y is when a program reads from X, and writes to Y.

Flow control essentially prevents higher classification reads being written to lower classification tables.

SQL Injection

The "never use unchecked string formatting" in writing an SQL app, because attacks can specifically word their queries to manipulate statements for data that they should have never gotten.

Always sanitise inputs / use prepared statements. Don't assume data is safe.

Give error messages just informative enough to tell what's wrong, but just vague enough to give no useful information: "The username or password is incorrect".

CS258

Introduction

Contents

Introduction

Relational Database Model and Constraints

Contents

The Relational Database

Constraints

Structured Query Language

Contents

Declaring

Querying

Updating

Joins

Query Nesting

Correlated Subqueries

Aggregation and Grouping

Relational Algebra

Introduction

Overview

Operators (Detailed)

Additional Operators

Relational Calculus

Introduction

Relational Calculus

Tuple Relational Calculus

Domain Relational Calculus

Database Design, Normalisation and Security

Contents

Relational DB Design

Normalisation

The process

Security