Session 10 – notes

Here are some notes for the Thursday lecture in week 5. I spent most of the time going through the slides and also wrote some code in a notebook – the slides are here, and the code I wrote in the notebook is interspersed below (you can download the initial notebook here).

Triggers and rollbacks, and how to handle them from Python

We began with an example from the previous lecture, to discuss triggers and transactions – you can find notes for it towards the end of the notes to lecture 9.

Indices and the query planner

When we define a primary key, the database creates an index for the attributes of the key, which makes searches involving the index much faster (typically \(O(\log n)\) instead of \(O(n)\)). But indices come with a cost – each time we change our tables (with an insertion, an update, or a deletion), we need to rebuild our indices.

We saw how we can create an index over any set of attributes using:

CREATE INDEX <index_name> ON <table>(<attributes>)

and this index will be around until we drop it with:

DROP INDEX <index_name>

We also talked a bit about the query planner, which is the part of the database engine which decides how to actually perform our queries – for a complex query, the planning stage may take up a substantial part of the total time of the query.

We tried out a few queries, first we opened our college application database, and tried:

  • Search for a given s_id in our students table:

    EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_id = 123;
    QUERY PLAN
    `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
    

    In this case we're looking for the primary key, and our database always indexes it, so it uses a search in the index (SEARCH students USING INTEGER PRIMARY KEY (rowid?)=) – this will be a very fast query (typically \(O(\log n)\)).

  • Search for a given s_name in our students table:

    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy';
    QUERY PLAN
    `--SCAN students
    

    In this case there is no index, and the planner have no idea of where to find our 'Amy's, so the it needs to scan from the first to the last student (SCAN students) – this will be a linear traversal, which is \(O(n)\).

    To alleviate this, we added an index, and tried again:

    sqlite> CREATE INDEX students_by_name ON students(s_name);
    
    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy';
    QUERY PLAN
    `--SEARCH students USING INDEX students_by_name (s_name=?)
    

    This time it could do a search, which means that our SELECT will run much faster (\(O(\log n)\)), but it would make insertions and deletions to our table slower, since the index must be updated each time.

  • We checked what happens when we do a very simple JOIN:

    sqlite> EXPLAIN QUERY PLAN SELECT * FROM students JOIN applications USING (s_id);
    
    QUERY PLAN
    |--SCAN applications
    `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
    

    We asked it to join students with applications, i.e., to combine each student with each of its applications, but the query planner wisely turned it around, so it scans all applications, and then seach for each student, since it has an index for the students (we're joining on its primary key).

These were very simple queries, but in cases when we need to increase the performance of our database, looking at the query plans for more complex queries can give important insights (like what tables we should index, and on what).

SQL injections

We then saw a very simple program to keep track of someones friends – first a database:

DROP TABLE IF EXISTS friends;
CREATE TABLE friends (
  name         TEXT
);

and then some Python code:

import sqlite3


conn = sqlite3.connect('lect10.sqlite')


def get_friends():
    c = conn.cursor()
    c.execute(
        "SELECT name "
        "FROM   friends "
    )
    return [name for name, in c]


def show_all():
    print('==== All friends ====')
    for name in get_friends():
        print(name)
    print('---------------------')


def add_friend(name):
    statement = "INSERT INTO friends(name) VALUES ('" + name + "')"
    c = conn.cursor()
    c.executescript(statement)
    conn.commit()


def add_friends():
    name = input('Name: ')
    if len(name) > 0:
        add_friend(name)
        show_all()
        add_friends()


def main():
    add_friends()


main()

When I ran this code, it added more and more friends to my database, until I entered the name:

'); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops

What happens is that statement in add_friend is concatenated into the string

INSERT INTO friends(name) VALUES (''); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops')

and this would obviously remove all our previous friends from the friends table if it was run as a command (and that's what executescript does) – this is a simple example of what's called SQL injection, and we could have avoided it above by writing our add_friend function as:

def add_friend(name):
    statement = "INSERT INTO friends(name) VALUES (?);"
    c = conn.cursor()
    c.execute(statement, [name])
    conn.commit()

The take-away from this is to never, never, ever concatenate query strings with input from the outside, and thereby allow users to run queries they dictate themselves.

Self joins, and one limitation of SQL

We then looked at a simple employee database …:

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  employee_id    INT,
  name           TEXT,
  supervisor_id  INT,
  PRIMARY KEY    (employee_id)
);

INSERT
INTO   employees(employee_id, name, supervisor_id)
VALUES (1, 'Liv', 0),
       (2, 'Adam', 1),
       (3, 'Ida', 1),
       (4, 'Hans', 3),
       (5, 'Emma', 1),
       (6, 'Filip', 5),
       (7, 'Gabriel', 4),
       (8, 'Hanna', 2),
       (9, 'Greta', 7);

… and solved a few simple problems:

  • Find all immediate supervisors. Here we need to join the table with itself, and find the row where the employee_id is the same as 'our' employee_id:

    SELECT    emp.name, sup.name
    FROM      employees AS emp
              JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
    ORDER BY  emp.name
    

    Normally I prefer JOIN ... USING over JOIN ... ON, but that only works for equijoins where the attribut has the same name in both tables. Here we're looking att different attributes, so alas we can't use JOIN ... USING.

    Since Liv doesn't have any supervisor, she's not in this listing, we could have used a LEFT JOIN to bring her in:

    SELECT    emp.name as employee, coalesce(sup.name, '-') AS supervisor
    FROM      employees AS emp
              LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
    ORDER BY  emp.name
    
  • Find all supervisors two steps up. Now we need to do the same as above, but in two steps:

    SELECT    emp.name, supsup.name
    FROM      employees AS emp
              JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
              JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id
    ORDER BY  emp.name
    

    It's clumsy, but it works.

    And as above, we could have used outer joins to include employees whose supervisors have no supervisor:

    SELECT    emp.name AS employee, coalesce(supsup.name, '-') AS super2visor
    FROM      employees AS emp
              LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id
              LEFT JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id
    ORDER BY  emp.name
    
  • Find all supervisors of Greta. This was not even possible before recursive WITH statements were introduced in SQL a couple of years ago – we can solve it with the following CTE (the initial number 7 is Gabriel's id, since he's Gretas nearest supervisor):

    WITH RECURSIVE supervisors(id) AS (
      VALUES (7)
      UNION
      SELECT supervisor_id
      FROM   employees
             JOIN supervisors ON id = employee_id
    )
    SELECT id
    FROM   supervisors
    

It works, but it's clearly not ideal. And the problem is that the problem isn't really suited for a relational database – on Monday next week, Neo4j will describe Graph Databases, which are much, much better tools for problems such as this.

A note on self joins and cascading foreign keys

In the examples above, we didn't declare supervisor_id as a foreign key – we could have done so, but it requires us to be a bit careful (it's called a self referencing foreign key, since it refers to an attribute in the same table).

The topmost supervisor have no supervisor, so she can't have her supervisor's id in the table – that means we need to come up with a way to put a row without a proper foreign key into the table, and there are at least three alternatives:

  • We can turn off foreign key checking (or just wait to turn it on until after we've inserted the first row).
  • We could allow the top supervisor to be her own supervisor (it's a bit of a hack).
  • We could set her supervisor_id to NULL, in which case the foreign key would not be checked.

Below, we use the third alternative:

PRAGMA foreign_keys = ON;

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  employee_id    INT,
  name           TEXT,
  supervisor_id  INT,
  PRIMARY KEY    (employee_id),
  FOREIGN KEY    (supervisor_id) REFERENCES employees(employee_id) ON DELETE CASCADE
);

INSERT
INTO   employees(employee_id, name, supervisor_id)
VALUES (1, 'Liv', NULL),
       (2, 'Adam', 1),
       (3, 'Ida', 1),
       (4, 'Hans', 3),
       (6, 'Filip', 5),
       (5, 'Emma', 1),
       (7, 'Gabriel', 4),
       (8, 'Hanna', 2),
       (9, 'Greta', 7);

Here we've also used the ON DELETE CASCADE, and it will in this case work recursively. If we first try:

sqlite> SELECT * FROM employees;
employee_id  name     supervisor_id
-----------  -------  -------------
1            Liv      0
2            Adam     1
3            Ida      1
4            Hans     3
5            Emma     1
6            Filip    5
7            Gabriel  4
8            Hanna    2
9            Greta    7

and then remove Ida, who is the immediate supervisor of Hans, the super-supervisor of Gabriel, and the super-super-supervisor of Greta, all four of them will be removed:

sqlite> DELETE FROM employees WHERE employee_id = 3;
sqlite> SELECT * FROM employees;
employee_id  name   supervisor_id
-----------  -----  -------------
1            Liv    0
2            Adam   1
5            Emma   1
6            Filip  5
8            Hanna  2

Other kinds of databases

We finished by talking about different kinds of databases, see the slides. On Monday next week you'll get a first hand view of a really exciting graph database (Neo4J) – if you bring your laptop you can play along with some example databases they'll provide.