Session 10 – notes
Here are some notes for the Thursday lecture in week 5. I spent most of the time going through the slides and also wrote some code in a notebook – the slides are here, and the code I wrote in the notebook is interspersed below (you can download the initial notebook here).
Triggers and rollbacks, and how to handle them from Python
We started out with a slight variation of the college application database we've seen many times by now:
PRAGMA foreign_keys = ON; DROP TABLE IF EXISTS applications; DROP TABLE IF EXISTS colleges; DROP TABLE IF EXISTS students; CREATE TABLE students( s_id INTEGER, s_name TEXT, gpa REAL, PRIMARY KEY (s_id) ); CREATE TABLE colleges( c_name TEXT, state TEXT, enrollment INT, PRIMARY KEY (c_name) ); CREATE TABLE applications( s_id INTEGER, c_name TEXT, major TEXT, decision CHAR(1) DEFAULT 'N', PRIMARY KEY (s_id, c_name, major), FOREIGN KEY (s_id) REFERENCES students(s_id), FOREIGN KEY (c_name) REFERENCES colleges(c_name) ); INSERT INTO students(s_id, s_name, gpa) VALUES (123, 'Amy', 3.9), (234, 'Bob', 3.6), (345, 'Craig', 3.5), (456, 'Doris', 3.9), (567, 'Edward', 2.9), (678, 'Fay', 3.8), (789, 'Gary', 3.4), (987, 'Helen', 3.7), (876, 'Irene', 3.9), (765, 'Jay', 2.9), (654, 'Amy', 3.9), (543, 'Craig', 3.4); INSERT INTO colleges(c_name, state, enrollment) VALUES ('Stanford', 'CA', 15000), ('Berkeley', 'CA', 36000), ('MIT', 'MA', 10000), ('Cornell', 'NY', 21000); INSERT INTO applications(s_id, c_name, major, decision) VALUES (123, 'Stanford', 'CS', 'Y'), (123, 'Stanford', 'EE', 'N'), (123, 'Berkeley', 'CS', 'Y'), (123, 'Cornell', 'EE', 'Y'), (234, 'Berkeley', 'biology', 'N'), (345, 'MIT', 'bioengineering', 'Y'), (345, 'Cornell', 'bioengineering', 'N'), (345, 'Cornell', 'CS', 'Y'), (345, 'Cornell', 'EE', 'N'), (678, 'Stanford', 'history', 'Y'), (987, 'Stanford', 'CS', 'Y'), (987, 'Berkeley', 'CS', 'Y'), (876, 'Stanford', 'CS', 'N'), (876, 'MIT', 'biology', 'Y'), (876, 'MIT', 'marine biology', 'N'), (765, 'Stanford', 'history', 'Y'), (765, 'Cornell', 'history', 'N'), (765, 'Cornell', 'psychology', 'Y'), (543, 'MIT', 'CS', 'N');
We then had a Python program which lets us add applications to the database:
import sqlite3 db = sqlite3.connect("colleges.sqlite") db.execute("PRAGMA foreign_keys = ON") def show_applications(): c = db.cursor() c.execute( """ SELECT s_id, s_name, c_name, group_concat(major) FROM students JOIN applications USING (s_id) GROUP BY s_id, c_name ORDER BY s_name, s_id """ ) for s_id, s_name, c_name, majors in c: print(f"{s_id:3d}: {s_name:<16} {c_name:<10}: {majors}") def add_application(s_id, c_name, major): c = db.cursor() try: c.execute( """ INSERT INTO applications(s_id, c_name, major) VALUES (?, ?, ?) """, [s_id, c_name, major] ) db.commit() except sqlite3.Error as e: print(f"Error when adding application: {str(e)}") def main(): show_applications() s_id = input("student id: ") if s_id: c_name = input("college: ") major = input("major: ") add_application(s_id, c_name, major) main() main()
There are a few things which are noteworthy here:
- To turn on checking of foreign keys, I need to set the
PRAGMA
foreign_keys
toON
, and I need to do that on the connection to our database (when SQLite was released, a long time ago, it didn't have foreign key checking, and by the time it was introduced, there were many billions of SQLite databases in use, so it was introduced as an opt-in). So, in SQLite3 we have to turn foreign key checking, and we need to do it once for every connection. - In order to catch any errors from the foreign key checking, such as someone adding an application to a college not in our database, we need to have a
try
-except
statement, as seen above.
We tried to add a new application which conflicted with an earlier application (same student, college, and major), and we got:
$ uv run applications.py 123: Amy Berkeley : CS 123: Amy Cornell : EE 123: Amy Stanford : CS,EE 234: Bob Berkeley : biology 345: Craig Cornell : CS,EE,bioengineering 345: Craig MIT : bioengineering 543: Craig MIT : CS 678: Fay Stanford : history 987: Helen Berkeley : CS 987: Helen Stanford : CS 876: Irene MIT : biology,marine biology 876: Irene Stanford : CS 765: Jay Cornell : history,psychology 765: Jay Stanford : history student id: 123 college: Berkeley major: CS Error when adding application: UNIQUE constraint failed: applications.s_id, applications.c_name, applications.major
So, the database itself returns the text 'UNIQUE constraint failed: applications.s_id, applications.c_name, applications.major'
.
Last time we added a trigger to make sure no student applies for more than two majors at any college:
DROP TRIGGER IF EXISTS limit_applications; CREATE TRIGGER limit_applications BEFORE INSERT ON applications WHEN (SELECT count(c_name) FROM applications WHERE s_id = NEW.s_id AND c_name = NEW.c_name) >= 2 BEGIN SELECT RAISE (ROLLBACK, 'Too many applications to one college'); END;
Having this trigger allows us to try to insert each new application, without first checking the number of earlier applications – the database takes care of that check itself now (and we end up in the except
statement if someone applies to many times). The error message in this case would be 'Too many applications to one college'
We also talked briefly about the problem left as en exercise at the end of the notes to lecture 9.
Using 'foreign key clauses'
One of the main points of having attributes marked as foreign keys is that we can make sure we have no 'dangling pointers' (i.e., rows referring to missing values).
If we want to remove a student who hasn't applied for any colleges, just removing the row in the students
table should be OK (since there is no foreign key pointing to it), but what should we do if we really wanted to remove a student who actually has applied to one or more majors?
It turns out that we can ask the database to remove any applications for our student, using a foreign-key-clause
:
CREATE TABLE applications( a_id INTEGER, s_id INTEGER, c_name TEXT, major TEXT, decision CHAR(1) DEFAULT 'N', PRIMARY KEY (a_id), UNIQUE (s_id, c_name, major), FOREIGN KEY (s_id) REFERENCES students(s_id) ON DELETE CASCADE, FOREIGN KEY (c_name) REFERENCES colleges(c_name) )
We tried this, removing the Amy who has s_id
123:
DELETE FROM students WHERE s_id = 123
… and it turned out that all her applications were deleted when she was deleted from the students
table. To make this work it's important that we've turned the foreign_key
checking on, I showed one way to make sure it's turned on by default: by putting
PRAGMA foreign_key = ON;
in the file ~/.sqliterc
.
Indices and the query planner
When we define a primary key, the database creates an index for the attributes of the key, which makes searches involving the index much faster (typically \(O(\log n)\) instead of \(O(n)\)). But indices come with a cost – each time we change our tables (with an insertion, an update, or a deletion), we need to rebuild our indices.
We saw how we can create an index over any set of attributes using:
CREATE INDEX <index_name> ON <table>(<attributes>)
and this index will be around until we drop it with:
DROP INDEX <index_name>
We also talked a bit about the query planner, which is the part of the database engine which decides how to actually perform our queries – for a complex query, the planning stage may take up a substantial part of the total time of the query.
We tried out a few queries, first we opened our college application database, and tried:
Search for a given
s_id
in ourstudents
table:EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_id = 123; QUERY PLAN `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
In this case we're looking for the primary key, and our database always indexes it, so it uses a search in the index (
SEARCH students USING INTEGER PRIMARY KEY (rowid
?)=) – this will be a very fast query (typically \(O(\log n)\)).Search for a given
s_name
in ourstudents
table:sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy'; QUERY PLAN `--SCAN students
In this case there is no index, and the planner have no idea of where to find our 'Amy's, so the it needs to scan from the first to the last student (
SCAN students
) – this will be a linear traversal, which is \(O(n)\).To alleviate this, we added an index, and tried again:
sqlite> CREATE INDEX students_by_name ON students(s_name); sqlite> EXPLAIN QUERY PLAN SELECT * FROM students WHERE s_name = 'Amy'; QUERY PLAN `--SEARCH students USING INDEX students_by_name (s_name=?)
This time it could do a search, which means that our
SELECT
will run much faster (\(O(\log n)\)), but it would make insertions and deletions to our table slower, since the index must be updated each time.We checked what happens when we do a very simple
JOIN
:sqlite> EXPLAIN QUERY PLAN SELECT * FROM students JOIN applications USING (s_id); QUERY PLAN |--SCAN applications `--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
We asked it to join
students
withapplications
, i.e., to combine each student with each of its applications, but the query planner wisely turned it around, so it scans all applications, and then seach for each student, since it has an index for the students (we're joining on its primary key).
These were very simple queries, but in cases when we need to increase the performance of our database, looking at the query plans for more complex queries can give important insights (like what tables we should index, and on what).
SQL injections
We then saw a very simple program to keep track of someones friends – first a database:
DROP TABLE IF EXISTS friends; CREATE TABLE friends ( name TEXT );
and then some Python code:
import sqlite3 conn = sqlite3.connect('lect10.sqlite') def get_friends(): c = conn.cursor() c.execute( "SELECT name " "FROM friends " ) return [name for name, in c] def show_all(): print('==== All friends ====') for name in get_friends(): print(name) print('---------------------') def add_friend(name): statement = "INSERT INTO friends(name) VALUES ('" + name + "')" c = conn.cursor() c.executescript(statement) conn.commit() def add_friends(): name = input('Name: ') if len(name) > 0: add_friend(name) show_all() add_friends() def main(): add_friends() main()
When I ran this code, it added more and more friends to my database, until I entered the name:
'); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops
What happens is that statement
in add_friend
is concatenated into the string
INSERT INTO friends(name) VALUES (''); DELETE FROM friends; INSERT INTO friends(name) VALUES ('woops')
and this would obviously remove all our previous friends from the friends
table – this is a simple example of what's called SQL injection, and we could have avoided it above by writing our add_friend
function as:
def add_friend(name): statement = "INSERT INTO friends(name) VALUES (?);" c = conn.cursor() c.execute(statement, [name]) conn.commit()
The take-away from this is to never, never, ever concatenate query strings with input from the outside, and thereby allow users to run queries they dictate themselves.
Self joins, and one limitation of SQL
We then looked at a simple employee database …:
DROP TABLE IF EXISTS employees; CREATE TABLE employees ( employee_id INT, name TEXT, supervisor_id INT, PRIMARY KEY (employee_id) ); INSERT INTO employees(employee_id, name, supervisor_id) VALUES (1, 'Liv', 0), (2, 'Adam', 1), (3, 'Ida', 1), (4, 'Hans', 3), (5, 'Emma', 1), (6, 'Filip', 5), (7, 'Gabriel', 4), (8, 'Hanna', 2), (9, 'Greta', 7);
… and solved a few simple problems:
Find all immediate supervisors. Here we need to join the table with itself, and find the row where the
employee_id
is the same as 'our'employee_id
:SELECT emp.name, sup.name FROM employees AS emp JOIN employees AS sup ON emp.supervisor_id = sup.employee_id ORDER BY emp.name
Since Liv doesn't have any supervisor, she's not in this listing, we could have used a
LEFT JOIN
to bring her in:SELECT emp.name as employee, coalesce(sup.name, '-') AS supervisor FROM employees AS emp LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id ORDER BY emp.name
Find all supervisors two steps up. Now we need to do the same as above, but in two steps:
SELECT emp.name, supsup.name FROM employees AS emp JOIN employees AS sup ON emp.supervisor_id = sup.employee_id JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id ORDER BY emp.name
It's clumsy, but it works.
And as above, we could have used outer joins to include employees whose supervisors have no supervisor:
SELECT emp.name AS employee, coalesce(supsup.name, '-') AS super2visor FROM employees AS emp LEFT JOIN employees AS sup ON emp.supervisor_id = sup.employee_id LEFT JOIN employees AS supsup ON sup.supervisor_id = supsup.employee_id ORDER BY emp.name
Find all supervisors of Greta. This was not even possible before recursive
WITH
statements were introduced in SQL – we can solve it with the following CTE (the initial number 7 is Gabriel's id, since he's Gretas nearest supervisor):WITH RECURSIVE supervisors(id) AS ( VALUES (7) UNION SELECT supervisor_id FROM employees JOIN supervisors ON id = employee_id ) SELECT id FROM supervisors
It works, but it's clearly not ideal. And the problem is that the problem isn't really suited for a relational database – on Monday next week, Neo4j will describe Graph Databases, which are much, much, much better tools for problems such as this.
A note on self joins and cascading foreign keys
In the examples above, we didn't declare supervisor_id
as a foreign key – we could have done so, but it requires us to be a bit careful (it's called a self referencing foreign key, since it refers to an attribute in the same table).
The topmost supervisor have no supervisor, so she can't have her supervisor's id in the table – that means we need to come up with a way to put a row without a proper foreign key into the table, and there are at least three alternatives:
- We can turn off foreign key checking (or just wait to turn it on until after we've inserted the first row).
- We could allow the top supervisor to be her own supervisor (it's a bit of a hack).
- We could set her
supervisor_id
toNULL
, in which case the foreign key would not be checked.
Below, we use the third alternative:
PRAGMA foreign_keys = ON; DROP TABLE IF EXISTS employees; CREATE TABLE employees ( employee_id INT, name TEXT, supervisor_id INT, PRIMARY KEY (employee_id), FOREIGN KEY (supervisor_id) REFERENCES employees(employee_id) ON DELETE CASCADE ); INSERT INTO employees(employee_id, name, supervisor_id) VALUES (1, 'Liv', NULL), (2, 'Adam', 1), (3, 'Ida', 1), (4, 'Hans', 3), (6, 'Filip', 5), (5, 'Emma', 1), (7, 'Gabriel', 4), (8, 'Hanna', 2), (9, 'Greta', 7);
Here we've also used the ON DELETE CASCADE
, and it will in this case work recursively.
If we first try:
sqlite> SELECT * FROM employees; employee_id name supervisor_id ----------- ------- ------------- 1 Liv 0 2 Adam 1 3 Ida 1 4 Hans 3 5 Emma 1 6 Filip 5 7 Gabriel 4 8 Hanna 2 9 Greta 7
and then remove Ida, who is the immediate supervisor of Hans, the super-supervisor of Gabriel, and the super-super-supervisor of Greta, all four of them will be removed:
sqlite> DELETE FROM employees WHERE employee_id = 3; sqlite> SELECT * FROM employees; employee_id name supervisor_id ----------- ----- ------------- 1 Liv 0 2 Adam 1 5 Emma 1 6 Filip 5 8 Hanna 2
Other kinds of databases
We finished by talking about different kinds of databases, see the slides. On Monday next week you'll get a first hand view of a really exciting graph database (Neo4J) – if you bring your laptop you can play along with some example databases they'll provide.