Session 10 – notes
Here are some notes for the Thursday lecture in week 5. I spent most of the time going through the slides and also wrote some code in a notebook – the slides are here, and the code I wrote in the notebook is interspersed below (you can download the initial notebook here).
Triggers and rollbacks, and how to handle them from Python
We began with an example from the previous lecture, to discuss triggers and transactions – you can find notes for it towards the end of the notes to lecture 9.
Indices and the query planner
When we define a primary key, the database creates an index for the attributes of the key, which makes searches involving the index much faster (typically \(O(\log n)\) instead of \(O(n)\)). But indices come with a cost – each time we change our tables (with an insertion, an update, or a deletion), we need to rebuild our indices.
We saw how we can create an index over any set of attributes using:
CREATE INDEX <index_name> ON <table>(<attributes>)
and this index will be around until we drop it with:
DROP INDEX <index_name>
We also talked a bit about the query planner, which is the part of the database engine which decides how to actually perform our queries – for a complex query, the planning stage may take up a substantial part of the total time of the query.
We tried out a few queries, first we opened our college application database, and tried:
Search for a given
s_idin ourstudentstable:EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_id = 123; QUERY PLAN `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
In this case we're looking for the primary key, and our database always indexes it, so it uses a search in the index (
SEARCH students USING INTEGER PRIMARY KEY (rowid?)=) – this will be a very fast query (typically \(O(\log n)\)).Search for a given
s_namein ourstudentstable:sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy'; QUERY PLAN `--SCAN students
In this case there is no index, and the planner have no idea of where to find our 'Amy's, so the it needs to scan from the first to the last student (
SCAN students) – this will be a linear traversal, which is \(O(n)\).To alleviate this, we added an index, and tried again:
sqlite> CREATE INDEX students_by_name ON students(s_name); sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy'; QUERY PLAN `--SEARCH students USING INDEX students_by_name (s_name=?)
This time it could do a search, which means that our
SELECTwill run much faster (\(O(\log n)\)), but it would make insertions and deletions to our table slower, since the index must be updated each time.We checked what happens when we do a very simple
JOIN:sqlite> EXPLAIN QUERY PLAN SELECT * FROM students JOIN applications USING (s_id); QUERY PLAN |--SCAN applications `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
We asked it to join
studentswithapplications, i.e., to combine each student with each of its applications, but the query planner wisely turned it around, so it scans all applications, and then seach for each student, since it has an index for the students (we're joining on its primary key).
These were very simple queries, but in cases when we need to increase the performance of our database, looking at the query plans for more complex queries can give important insights (like what tables we should index, and on what).
SQL injections
We then saw a very simple program to keep track of someones friends – first a database:
DROP TABLE IF EXISTS friends; CREATE TABLE friends ( name TEXT );
and then some Python code:
import sqlite3 conn = sqlite3.connect('lect10.sqlite') def get_friends(): c = conn.cursor() c.execute( "SELECT name " "FROM friends " ) return [name for name, in c] def show_all(): print('==== All friends ====') for name in get_friends(): print(name) print('---------------------') def add_friend(name): statement = "INSERT INTO friends(name) VALUES ('" + name + "')" c = conn.cursor() c.executescript(statement) conn.commit() def add_friends(): name = input('Name: ') if len(name) > 0: add_friend(name) show_all() add_friends() def main(): add_friends() main()
When I ran this code, it added more and more friends to my database, until I entered the name:
'); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops
What happens is that statement in add_friend is concatenated into the string
INSERT INTO friends(name) VALUES (''); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops')
and this would obviously remove all our previous friends from the friends table if it was run as a command (and that's what executescript does) – this is a simple example of what's called SQL injection, and we could have avoided it above by writing our add_friend function as:
def add_friend(name): statement = "INSERT INTO friends(name) VALUES (?);" c = conn.cursor() c.execute(statement, [name]) conn.commit()
The take-away from this is to never, never, ever concatenate query strings with input from the outside, and thereby allow users to run queries they dictate themselves.
Self joins, and one limitation of SQL
We then looked at a simple employee database …:
DROP TABLE IF EXISTS employees; CREATE TABLE employees ( employee_id INT, name TEXT, supervisor_id INT, PRIMARY KEY (employee_id) ); INSERT INTO employees(employee_id, name, supervisor_id) VALUES (1, 'Liv', 0), (2, 'Adam', 1), (3, 'Ida', 1), (4, 'Hans', 3), (5, 'Emma', 1), (6, 'Filip', 5), (7, 'Gabriel', 4), (8, 'Hanna', 2), (9, 'Greta', 7);
… and solved a few simple problems:
Find all immediate supervisors. Here we need to join the table with itself, and find the row where the
employee_idis the same as 'our'employee_id:SELECT emp.name, sup.name FROM employees AS emp JOIN employees AS sup ON emp.supervisor_id = sup.employee_id ORDER BY emp.name
Normally I prefer
JOIN ... USINGoverJOIN ... ON, but that only works for equijoins where the attribut has the same name in both tables. Here we're looking att different attributes, so alas we can't useJOIN ... USING.Since Liv doesn't have any supervisor, she's not in this listing, we could have used a
LEFT JOINto bring her in:SELECT emp.name as employee, coalesce(sup.name, '-') AS supervisor FROM employees AS emp LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id ORDER BY emp.name
Find all supervisors two steps up. Now we need to do the same as above, but in two steps:
SELECT emp.name, supsup.name FROM employees AS emp JOIN employees AS sup ON emp.supervisor_id = sup.employee_id JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id ORDER BY emp.name
It's clumsy, but it works.
And as above, we could have used outer joins to include employees whose supervisors have no supervisor:
SELECT emp.name AS employee, coalesce(supsup.name, '-') AS super2visor FROM employees AS emp LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id LEFT JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id ORDER BY emp.name
Find all supervisors of Greta. This was not even possible before recursive
WITHstatements were introduced in SQL a couple of years ago – we can solve it with the following CTE (the initial number 7 is Gabriel's id, since he's Gretas nearest supervisor):WITH RECURSIVE supervisors(id) AS ( VALUES (7) UNION SELECT supervisor_id FROM employees JOIN supervisors ON id = employee_id ) SELECT id FROM supervisors
It works, but it's clearly not ideal. And the problem is that the problem isn't really suited for a relational database – on Monday next week, Neo4j will describe Graph Databases, which are much, much better tools for problems such as this.
A note on self joins and cascading foreign keys
In the examples above, we didn't declare supervisor_id as a foreign key – we could have done so, but it requires us to be a bit careful (it's called a self referencing foreign key, since it refers to an attribute in the same table).
The topmost supervisor have no supervisor, so she can't have her supervisor's id in the table – that means we need to come up with a way to put a row without a proper foreign key into the table, and there are at least three alternatives:
- We can turn off foreign key checking (or just wait to turn it on until after we've inserted the first row).
- We could allow the top supervisor to be her own supervisor (it's a bit of a hack).
- We could set her
supervisor_idtoNULL, in which case the foreign key would not be checked.
Below, we use the third alternative:
PRAGMA foreign_keys = ON; DROP TABLE IF EXISTS employees; CREATE TABLE employees ( employee_id INT, name TEXT, supervisor_id INT, PRIMARY KEY (employee_id), FOREIGN KEY (supervisor_id) REFERENCES employees(employee_id) ON DELETE CASCADE ); INSERT INTO employees(employee_id, name, supervisor_id) VALUES (1, 'Liv', NULL), (2, 'Adam', 1), (3, 'Ida', 1), (4, 'Hans', 3), (6, 'Filip', 5), (5, 'Emma', 1), (7, 'Gabriel', 4), (8, 'Hanna', 2), (9, 'Greta', 7);
Here we've also used the ON DELETE CASCADE, and it will in this case work recursively.
If we first try:
sqlite> SELECT * FROM employees; employee_id name supervisor_id ----------- ------- ------------- 1 Liv 0 2 Adam 1 3 Ida 1 4 Hans 3 5 Emma 1 6 Filip 5 7 Gabriel 4 8 Hanna 2 9 Greta 7
and then remove Ida, who is the immediate supervisor of Hans, the super-supervisor of Gabriel, and the super-super-supervisor of Greta, all four of them will be removed:
sqlite> DELETE FROM employees WHERE employee_id = 3; sqlite> SELECT * FROM employees; employee_id name supervisor_id ----------- ----- ------------- 1 Liv 0 2 Adam 1 5 Emma 1 6 Filip 5 8 Hanna 2
Other kinds of databases
We finished by talking about different kinds of databases, see the slides. On Monday next week you'll get a first hand view of a really exciting graph database (Neo4J) – if you bring your laptop you can play along with some example databases they'll provide.