Session 10 – notes

Here are some notes for the Thursday lecture in week 5. I spent most of the time going through the slides and also wrote some code in a notebook – the slides are here, and the code I wrote in the notebook is interspersed below (you can download the initial notebook here).

Triggers and rollbacks, and how to handle them from Python

We started out with a slight variation of the college application database we've seen many times by now:

PRAGMA foreign_keys = ON;

DROP TABLE IF EXISTS applications;
DROP TABLE IF EXISTS colleges;
DROP TABLE IF EXISTS students;

CREATE TABLE students(
  s_id        INTEGER,
  s_name      TEXT,
  gpa         REAL,
  PRIMARY KEY (s_id)
);

CREATE TABLE colleges(
  c_name      TEXT,
  state       TEXT,
  enrollment  INT,
  PRIMARY KEY (c_name)
);

CREATE TABLE applications(
  s_id        INTEGER,
  c_name      TEXT,
  major       TEXT,
  decision    CHAR(1) DEFAULT 'N',
  PRIMARY KEY (s_id, c_name, major),
  FOREIGN KEY (s_id) REFERENCES students(s_id),
  FOREIGN KEY (c_name) REFERENCES colleges(c_name)
);

INSERT
INTO   students(s_id, s_name, gpa)
VALUES (123, 'Amy', 3.9),
       (234, 'Bob', 3.6),
       (345, 'Craig', 3.5),
       (456, 'Doris', 3.9),
       (567, 'Edward', 2.9),
       (678, 'Fay', 3.8),
       (789, 'Gary', 3.4),
       (987, 'Helen', 3.7),
       (876, 'Irene', 3.9),
       (765, 'Jay', 2.9),
       (654, 'Amy', 3.9),
       (543, 'Craig', 3.4);

INSERT
INTO   colleges(c_name, state, enrollment)
VALUES ('Stanford', 'CA', 15000),
       ('Berkeley', 'CA', 36000),
       ('MIT', 'MA', 10000),
       ('Cornell', 'NY', 21000);

INSERT
INTO   applications(s_id, c_name, major, decision)
VALUES (123, 'Stanford', 'CS', 'Y'),
       (123, 'Stanford', 'EE', 'N'),
       (123, 'Berkeley', 'CS', 'Y'),
       (123, 'Cornell', 'EE', 'Y'),
       (234, 'Berkeley', 'biology', 'N'),
       (345, 'MIT', 'bioengineering', 'Y'),
       (345, 'Cornell', 'bioengineering', 'N'),
       (345, 'Cornell', 'CS', 'Y'),
       (345, 'Cornell', 'EE', 'N'),
       (678, 'Stanford', 'history', 'Y'),
       (987, 'Stanford', 'CS', 'Y'),
       (987, 'Berkeley', 'CS', 'Y'),
       (876, 'Stanford', 'CS', 'N'),
       (876, 'MIT', 'biology', 'Y'),
       (876, 'MIT', 'marine biology', 'N'),
       (765, 'Stanford', 'history', 'Y'),
       (765, 'Cornell', 'history', 'N'),
       (765, 'Cornell', 'psychology', 'Y'),
       (543, 'MIT', 'CS', 'N');

We then had a Python program which lets us add applications to the database:

import sqlite3


db = sqlite3.connect("colleges.sqlite")
db.execute("PRAGMA foreign_keys = ON")


def show_applications():
    c = db.cursor()
    c.execute(
        """
        SELECT    s_id, s_name, c_name, group_concat(major)
        FROM      students
                  JOIN applications USING (s_id)
        GROUP BY  s_id, c_name
        ORDER BY  s_name, s_id
        """
    )
    for s_id, s_name, c_name, majors in c:
        print(f"{s_id:3d}: {s_name:<16}  {c_name:<10}: {majors}")


def add_application(s_id, c_name, major):
    c = db.cursor()
    try:
        c.execute(
            """
            INSERT
            INTO       applications(s_id, c_name, major)
            VALUES     (?, ?, ?)
            """,
            [s_id, c_name, major]
        )
        db.commit()
    except sqlite3.Error as e:
        print(f"Error when adding application: {str(e)}")


def main():
    show_applications()
    s_id = input("student id: ")
    if s_id:
        c_name = input("college: ")
        major = input("major: ")
        add_application(s_id, c_name, major)
        main()


main()

There are a few things which are noteworthy here:

  • To turn on checking of foreign keys, I need to set the PRAGMA foreign_keys to ON, and I need to do that on the connection to our database (when SQLite was released, a long time ago, it didn't have foreign key checking, and by the time it was introduced, there were many billions of SQLite databases in use, so it was introduced as an opt-in). So, in SQLite3 we have to turn foreign key checking, and we need to do it once for every connection.
  • In order to catch any errors from the foreign key checking, such as someone adding an application to a college not in our database, we need to have a try - except statement, as seen above.

We tried to add a new application which conflicted with an earlier application (same student, college, and major), and we got:

$ uv run applications.py
123: Amy               Berkeley  : CS
123: Amy               Cornell   : EE
123: Amy               Stanford  : CS,EE
234: Bob               Berkeley  : biology
345: Craig             Cornell   : CS,EE,bioengineering
345: Craig             MIT       : bioengineering
543: Craig             MIT       : CS
678: Fay               Stanford  : history
987: Helen             Berkeley  : CS
987: Helen             Stanford  : CS
876: Irene             MIT       : biology,marine biology
876: Irene             Stanford  : CS
765: Jay               Cornell   : history,psychology
765: Jay               Stanford  : history
student id: 123
college: Berkeley
major: CS
Error when adding application: UNIQUE constraint failed: applications.s_id, applications.c_name, applications.major

So, the database itself returns the text 'UNIQUE constraint failed: applications.s_id, applications.c_name, applications.major'.

Last time we added a trigger to make sure no student applies for more than two majors at any college:

DROP TRIGGER IF EXISTS limit_applications;
CREATE TRIGGER limit_applications
BEFORE INSERT ON applications
WHEN
  (SELECT  count(c_name)
   FROM    applications
   WHERE   s_id = NEW.s_id AND c_name = NEW.c_name) >= 2
BEGIN
  SELECT RAISE (ROLLBACK, 'Too many applications to one college');
END;

Having this trigger allows us to try to insert each new application, without first checking the number of earlier applications – the database takes care of that check itself now (and we end up in the except statement if someone applies to many times). The error message in this case would be 'Too many applications to one college'

We also talked briefly about the problem left as en exercise at the end of the notes to lecture 9.

Using 'foreign key clauses'

One of the main points of having attributes marked as foreign keys is that we can make sure we have no 'dangling pointers' (i.e., rows referring to missing values).

If we want to remove a student who hasn't applied for any colleges, just removing the row in the students table should be OK (since there is no foreign key pointing to it), but what should we do if we really wanted to remove a student who actually has applied to one or more majors? It turns out that we can ask the database to remove any applications for our student, using a foreign-key-clause:

CREATE TABLE applications(
  a_id        INTEGER,
  s_id        INTEGER,
  c_name      TEXT,
  major       TEXT,
  decision    CHAR(1) DEFAULT 'N',
  PRIMARY KEY (a_id),
  UNIQUE      (s_id, c_name, major),
  FOREIGN KEY (s_id) REFERENCES students(s_id) ON DELETE CASCADE,
  FOREIGN KEY (c_name) REFERENCES colleges(c_name)
)

We tried this, removing the Amy who has s_id 123:

DELETE
FROM     students
WHERE    s_id = 123

… and it turned out that all her applications were deleted when she was deleted from the students table. To make this work it's important that we've turned the foreign_key checking on, I showed one way to make sure it's turned on by default: by putting

PRAGMA foreign_key = ON;

in the file ~/.sqliterc.

Indices and the query planner

When we define a primary key, the database creates an index for the attributes of the key, which makes searches involving the index much faster (typically \(O(\log n)\) instead of \(O(n)\)). But indices come with a cost – each time we change our tables (with an insertion, an update, or a deletion), we need to rebuild our indices.

We saw how we can create an index over any set of attributes using:

CREATE INDEX <index_name> ON <table>(<attributes>)

and this index will be around until we drop it with:

DROP INDEX <index_name>

We also talked a bit about the query planner, which is the part of the database engine which decides how to actually perform our queries – for a complex query, the planning stage may take up a substantial part of the total time of the query.

We tried out a few queries, first we opened our college application database, and tried:

  • Search for a given s_id in our students table:

    EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_id = 123;
    QUERY PLAN
    `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
    

    In this case we're looking for the primary key, and our database always indexes it, so it uses a search in the index (SEARCH students USING INTEGER PRIMARY KEY (rowid?)=) – this will be a very fast query (typically \(O(\log n)\)).

  • Search for a given s_name in our students table:

    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy';
    QUERY PLAN
    `--SCAN students
    

    In this case there is no index, and the planner have no idea of where to find our 'Amy's, so the it needs to scan from the first to the last student (SCAN students) – this will be a linear traversal, which is \(O(n)\).

    To alleviate this, we added an index, and tried again:

    sqlite> CREATE INDEX students_by_name ON students(s_name);
    
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy';
    QUERY PLAN
    `--SEARCH students USING INDEX students_by_name (s_name=?)
    

    This time it could do a search, which means that our SELECT will run much faster (\(O(\log n)\)), but it would make insertions and deletions to our table slower, since the index must be updated each time.

  • We checked what happens when we do a very simple JOIN:

    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students JOIN applications USING (s_id);
    
    QUERY PLAN
    |--SCAN applications
    `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
    

    We asked it to join students with applications, i.e., to combine each student with each of its applications, but the query planner wisely turned it around, so it scans all applications, and then seach for each student, since it has an index for the students (we're joining on its primary key).

These were very simple queries, but in cases when we need to increase the performance of our database, looking at the query plans for more complex queries can give important insights (like what tables we should index, and on what).

SQL injections

We then saw a very simple program to keep track of someones friends – first a database:

DROP TABLE IF EXISTS friends;
CREATE TABLE friends (
  name         TEXT
);

and then some Python code:

import sqlite3


conn = sqlite3.connect('lect10.sqlite')


def get_friends():
    c = conn.cursor()
    c.execute(
        "SELECT name "
        "FROM   friends "
    )
    return [name for name, in c]


def show_all():
    print('==== All friends ====')
    for name in get_friends():
        print(name)
    print('---------------------')


def add_friend(name):
    statement = "INSERT INTO friends(name) VALUES ('" + name + "')"
    c = conn.cursor()
    c.executescript(statement)
    conn.commit()


def add_friends():
    name = input('Name: ')
    if len(name) > 0:
        add_friend(name)
        show_all()
        add_friends()


def main():
    add_friends()


main()

When I ran this code, it added more and more friends to my database, until I entered the name:

'); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops

What happens is that statement in add_friend is concatenated into the string

INSERT INTO friends(name) VALUES (''); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops')

and this would obviously remove all our previous friends from the friends table – this is a simple example of what's called SQL injection, and we could have avoided it above by writing our add_friend function as:

def add_friend(name):
    statement = "INSERT INTO friends(name) VALUES (?);"
    c = conn.cursor()
    c.execute(statement, [name])
    conn.commit()

The take-away from this is to never, never, ever concatenate query strings with input from the outside, and thereby allow users to run queries they dictate themselves.

Self joins, and one limitation of SQL

We then looked at a simple employee database …:

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  employee_id    INT,
  name           TEXT,
  supervisor_id  INT,
  PRIMARY KEY    (employee_id)
);

INSERT
INTO   employees(employee_id, name, supervisor_id)
VALUES (1, 'Liv', 0),
       (2, 'Adam', 1),
       (3, 'Ida', 1),
       (4, 'Hans', 3),
       (5, 'Emma', 1),
       (6, 'Filip', 5),
       (7, 'Gabriel', 4),
       (8, 'Hanna', 2),
       (9, 'Greta', 7);

… and solved a few simple problems:

  • Find all immediate supervisors. Here we need to join the table with itself, and find the row where the employee_id is the same as 'our' employee_id:

    SELECT    emp.name, sup.name
    FROM      employees AS emp
              JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
    ORDER BY  emp.name
    

    Since Liv doesn't have any supervisor, she's not in this listing, we could have used a LEFT JOIN to bring her in:

    SELECT    emp.name as employee, coalesce(sup.name, '-') AS supervisor
    FROM      employees AS emp
              LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
    ORDER BY  emp.name
    
  • Find all supervisors two steps up. Now we need to do the same as above, but in two steps:

    SELECT    emp.name, supsup.name
    FROM      employees AS emp
              JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
              JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id
    ORDER BY  emp.name
    

    It's clumsy, but it works.

    And as above, we could have used outer joins to include employees whose supervisors have no supervisor:

    SELECT    emp.name AS employee, coalesce(supsup.name, '-') AS super2visor
    FROM      employees AS emp
              LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
              LEFT JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id
    ORDER BY  emp.name
    
  • Find all supervisors of Greta. This was not even possible before recursive WITH statements were introduced in SQL – we can solve it with the following CTE (the initial number 7 is Gabriel's id, since he's Gretas nearest supervisor):

    WITH RECURSIVE supervisors(id) AS (
      VALUES (7)
      UNION
      SELECT supervisor_id
      FROM   employees
             JOIN supervisors ON id = employee_id
    )
    SELECT id
    FROM   supervisors
    

It works, but it's clearly not ideal. And the problem is that the problem isn't really suited for a relational database – on Monday next week, Neo4j will describe Graph Databases, which are much, much, much better tools for problems such as this.

A note on self joins and cascading foreign keys

In the examples above, we didn't declare supervisor_id as a foreign key – we could have done so, but it requires us to be a bit careful (it's called a self referencing foreign key, since it refers to an attribute in the same table).

The topmost supervisor have no supervisor, so she can't have her supervisor's id in the table – that means we need to come up with a way to put a row without a proper foreign key into the table, and there are at least three alternatives:

  • We can turn off foreign key checking (or just wait to turn it on until after we've inserted the first row).
  • We could allow the top supervisor to be her own supervisor (it's a bit of a hack).
  • We could set her supervisor_id to NULL, in which case the foreign key would not be checked.

Below, we use the third alternative:

PRAGMA foreign_keys = ON;

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  employee_id    INT,
  name           TEXT,
  supervisor_id  INT,
  PRIMARY KEY    (employee_id),
  FOREIGN KEY    (supervisor_id) REFERENCES employees(employee_id) ON DELETE CASCADE
);

INSERT
INTO   employees(employee_id, name, supervisor_id)
VALUES (1, 'Liv', NULL),
       (2, 'Adam', 1),
       (3, 'Ida', 1),
       (4, 'Hans', 3),
       (6, 'Filip', 5),
       (5, 'Emma', 1),
       (7, 'Gabriel', 4),
       (8, 'Hanna', 2),
       (9, 'Greta', 7);

Here we've also used the ON DELETE CASCADE, and it will in this case work recursively. If we first try:

sqlite> SELECT * FROM employees;
employee_id  name     supervisor_id
-----------  -------  -------------
1            Liv      0
2            Adam     1
3            Ida      1
4            Hans     3
5            Emma     1
6            Filip    5
7            Gabriel  4
8            Hanna    2
9            Greta    7

and then remove Ida, who is the immediate supervisor of Hans, the super-supervisor of Gabriel, and the super-super-supervisor of Greta, all four of them will be removed:

sqlite> DELETE FROM employees WHERE employee_id = 3;
sqlite> SELECT * FROM employees;
employee_id  name   supervisor_id
-----------  -----  -------------
1            Liv    0
2            Adam   1
5            Emma   1
6            Filip  5
8            Hanna  2

Other kinds of databases

We finished by talking about different kinds of databases, see the slides. On Monday next week you'll get a first hand view of a really exciting graph database (Neo4J) – if you bring your laptop you can play along with some example databases they'll provide.