EDAF75 - notebook for lecture 2¶
Author: Christian Söderberg
Week 1: Introduction, relational databases, and SQL, continued…¶
In order to run SQL in this notebook, we need to load the SQL extension:
%load_ext sql
If we use jupysql (instead of ipython-sql) we get a default limit of 10 rows for any query – to turn this limit off we can write:
%config SqlMagic.displaylimit = None
I've put all the tables used for this lecture in a file called lect02.sqlite, and we can load them using:
%sql sqlite:///lect02.sqlite
Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).
When one table is not enough¶
In the preparation notes for this week's lectures, you were asked to think through the problem below.
Problem: We've been asked to write code to handle students and their college applications. Each student has a name, a student id (stil), and an average grade. The students apply for various majors at colleges, where each college has a name, is situated in a given state, and has a given enrollment. Each application is for a specific major at a specific college, and it can be accepted, rejected, or pending.
- What classes and/or data structures would you use to keep track of this data in Python or Java?
- How do you list all applications for a given student?
- How do you list all applications for a given major at a given college?
- Approximately how many lines of Python/Java code do you need to list the three programs (major@college) with the highest grade average for the 'last' accepted student (assuming the students are accepted in order of decreasing grade averages)?
- How would you persist your data, so it doesn't get lost if we have to restart the program?
- Can you come up with a way to let several programs access the data simultaneously?
- If you solve the previous problem, and want to allow several programs update the data simultaneously, how do you make sure the data isn't corrupted?
If we wanted to keep all the information above in one table, we'd need the following columns (below I use some column names which are in a style I don't normally use, but I think it could be good for you to be exposed to different styles):
- student id (
s_id) - student name (
s_name) - average grade (
gpa) - college (
c_name) - state (
state) - enrollment (
enrollment) - major (
major) - decision (
decision)
Here is a (very small) example:
s_id |
s_name |
gpa |
c_name |
state |
enrollment |
major |
decision |
|---|---|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Stanford | CA | 15000 | CS | Y |
| 123 | Amy | 3.9 | Stanford | CA | 15000 | EE | N |
| 123 | Amy | 3.9 | Berkeley | CA | 36000 | CS | Y |
| 123 | Amy | 3.9 | Cornell | NY | 21000 | EE | Y |
| 234 | Bob | 3.6 | Berkeley | CA | 36000 | biology | N |
| 345 | Craig | 3.5 | MIT | MA | 10000 | bioengineering | Y |
| 345 | Craig | 3.5 | Cornell | NY | 21000 | bioengineering | N |
| 345 | Craig | 3.5 | Cornell | NY | 21000 | CS | Y |
| 345 | Craig | 3.5 | Cornell | NY | 21000 | EE | N |
| 678 | Fay | 3.8 | Stanford | CA | 15000 | history | Y |
| 987 | Helen | 3.7 | Stanford | CA | 15000 | CS | Y |
| 987 | Helen | 3.7 | Berkeley | CA | 36000 | CS | Y |
| 876 | Irene | 3.9 | Stanford | CA | 15000 | CS | N |
| 876 | Irene | 3.9 | MIT | MA | 10000 | biology | Y |
| 876 | Irene | 3.9 | MIT | MA | 10000 | marine biology | N |
| 765 | Jay | 2.9 | Stanford | CA | 15000 | history | Y |
| 765 | Jay | 2.9 | Cornell | NY | 21000 | history | N |
| 765 | Jay | 2.9 | Cornell | NY | 21000 | psychology | Y |
| 543 | Craig | 3.4 | MIT | MA | 10000 | CS | N |
This table is wonderful for finding information, but it comes with some big problems:
Some of the information is repeated in many places, which is wasteful and makes updates error prone. For example: if we want to change Amy's gpa, we'd have to change it once per application she's made, and if we're not diligent enough, we'll end up with inconsistent data.
If a student withdraws her last application, we'd loose all information about that student (this has actually happened here already, as you'll see below).
So, much of the information in the table is redundant, and it may seem that we could remove many rows without losing information, but in fact we can't – any row we remove will mean that we lose that specific combination of values.
In week 4 we'll discuss the theory behind these problems (functional dependencies), and see a technique we can use to find well behaved tables (normalization), but I hope that you intuitively can see that everything would be easier if we had the following three tables instead of the big table above:
- one table,
students, with information about the students (id, name and gpa)
s_id |
s_name |
gpa |
|---|---|---|
| 123 | Amy | 3.9 |
| 234 | Bob | 3.6 |
| 345 | Craig | 3.5 |
| 456 | Doris | 3.9 |
| 543 | Craig | 3.4 |
| 567 | Edward | 2.9 |
| 654 | Amy | 3.9 |
| 678 | Fay | 3.8 |
| 765 | Jay | 2.9 |
| 789 | Gary | 3.4 |
| 876 | Irene | 3.9 |
| 987 | Helen | 3.7 |
Here we have some students which weren't seen in the original table above, since they did not have any applications (there is no data to put in the columns relating to applications for them – we'll soon see that there is a primitive way around this, but it's clumpsy).
%%sql
- one table,
colleges, with college information (name and state)
c_name |
state |
enrollment |
|---|---|---|
| Stanford | CA | 15000 |
| Berkeley | CA | 36000 |
| MIT | MA | 10000 |
| Cornell | NY | 21000 |
%%sql
- one table,
applications, with applications (student id, college name, major, and decision)
s_id |
c_name |
major |
decision |
|---|---|---|---|
| 123 | Stanford | CS | Y |
| 123 | Stanford | EE | N |
| 123 | Berkeley | CS | Y |
| 123 | Cornell | EE | Y |
| 234 | Berkeley | biology | N |
| 345 | MIT | bioengineering | Y |
| 345 | Cornell | bioengineering | N |
| 345 | Cornell | CS | Y |
| 345 | Cornell | EE | N |
| 678 | Stanford | history | Y |
| 987 | Stanford | CS | Y |
| 987 | Berkeley | CS | Y |
| 876 | Stanford | CS | N |
| 876 | MIT | biology | Y |
| 876 | MIT | marine biology | N |
| 765 | Stanford | history | Y |
| 765 | Cornell | history | N |
| 765 | Cornell | psychology | Y |
| 543 | MIT | CS | N |
%%sql
Nothing in these tables is redundant, so if we use these tables, the issues we saw above goes away:
if we want to update Amy's
gpa, we just update her single row in thestudentstable,we now can see some students who haven't applied for anything yet (Amy with
s_id654, Doris, Edward and Gary).
We'll talk more about this in week 4.
But the benefits come with a cost, we now need a way to combine information from several tables at once.
Let's say we want to show the names and grade averages of all students who have applied for CS at Stanford – none of our three tables above contain all information we need, but if we start by looking in the applications table, we can get …
%%sql
SELECT s_name, gpa
FROM applications
WHERE c_name = 'Stanford' AND major = 'CS'
RuntimeError: (sqlite3.OperationalError) no such column: s_name [SQL: SELECT s_name, gpa FROM applications WHERE c_name = 'Stanford' AND major = 'CS'] (Background on this error at: https://sqlalche.me/e/20/e3q8)
%%sql
SELECT s_name, gpa
FROM applications
INNER JOIN students ON applications.s_id = students.s_id
WHERE c_name = 'Stanford' AND major = 'CS'
| s_name | gpa |
|---|---|
| Amy | 3.9 |
| Irene | 3.9 |
| Helen | 3.7 |
The best we can get here is the s_id of all relevant students, but given the s_id we can look in the students table for the name and gpa.
In regular programming languages, such as Python or Java, we could do that with functions such as find_name(s_id) and find_gpa(s_id):
%%sql
SELECT find_name(s_id), find_gpa(s_id) -- not in SQL!
FROM applications
WHERE c_name = 'Stanford' AND major = 'CS'
But SQL is no regular language, it's based on relational algebra, and there we have another, really cool way of solving this problem: joins).
Inner joins¶
We solve the problem above by combining (joining) the applications table with the students table. There are several kinds of joins in SQL (and relational algebra), but the most straightforward kind is what's called an inner join#Inner_join).
The inner join of two tables is a new table, where rows from each table are joined together if the attributes of the rows matches a join predicate – in our case we want to add student data to the application data where the student id in the applications table matches the student id in the students table.
This can be expressed as (see the docs):
%%sql
SELECT *
FROM applications
INNER JOIN students ON applications.s_id = students.s_id
| s_id | c_name | major | decision | s_id_1 | s_name | gpa |
|---|---|---|---|---|---|---|
| 123 | Stanford | CS | Y | 123 | Amy | 3.9 |
| 123 | Stanford | EE | N | 123 | Amy | 3.9 |
| 123 | Berkeley | CS | Y | 123 | Amy | 3.9 |
| 123 | Cornell | EE | Y | 123 | Amy | 3.9 |
| 234 | Berkeley | biology | N | 234 | Bob | 3.6 |
| 345 | MIT | bioengineering | Y | 345 | Craig | 3.5 |
| 345 | Cornell | bioengineering | N | 345 | Craig | 3.5 |
| 345 | Cornell | CS | Y | 345 | Craig | 3.5 |
| 345 | Cornell | EE | N | 345 | Craig | 3.5 |
| 678 | Stanford | history | Y | 678 | Fay | 3.8 |
| 987 | Stanford | CS | Y | 987 | Helen | 3.7 |
| 987 | Berkeley | CS | Y | 987 | Helen | 3.7 |
| 876 | Stanford | CS | N | 876 | Irene | 3.9 |
| 876 | MIT | biology | Y | 876 | Irene | 3.9 |
| 876 | MIT | marine biology | N | 876 | Irene | 3.9 |
| 765 | Stanford | history | Y | 765 | Jay | 2.9 |
| 765 | Cornell | history | N | 765 | Jay | 2.9 |
| 765 | Cornell | psychology | Y | 765 | Jay | 2.9 |
| 543 | MIT | CS | N | 543 | Craig | 3.4 |
and we would get the joined table (note that the first columns are from the applications table, whereas the last columns are from the students table – also note that s_id and s_id' always are the same in each row, since that was what we joined on):
s_id |
c_name |
major |
decision |
s_id' |
s_name |
gpa |
|---|---|---|---|---|---|---|
| 123 | Stanford | CS | Y | 123 | Amy | 3.9 |
| 123 | Stanford | EE | N | 123 | Amy | 3.9 |
| 123 | Berkeley | CS | Y | 123 | Amy | 3.9 |
| 123 | Cornell | EE | Y | 123 | Amy | 3.9 |
| 234 | Berkeley | biology | N | 234 | Bob | 3.6 |
| 345 | MIT | bioengineering | Y | 345 | Craig | 3.5 |
| 345 | Cornell | bioengineering | N | 345 | Craig | 3.5 |
| 345 | Cornell | CS | Y | 345 | Craig | 3.5 |
| 345 | Cornell | EE | N | 345 | Craig | 3.5 |
| 678 | Stanford | history | Y | 678 | Fay | 3.8 |
| 987 | Stanford | CS | Y | 987 | Helen | 3.7 |
| 987 | Berkeley | CS | Y | 987 | Helen | 3.7 |
| 876 | Stanford | CS | N | 876 | Irene | 3.9 |
| 876 | MIT | biology | Y | 876 | Irene | 3.9 |
| 876 | MIT | marine biology | N | 876 | Irene | 3.9 |
| 765 | Stanford | history | Y | 765 | Jay | 2.9 |
| 765 | Cornell | history | N | 765 | Jay | 2.9 |
| 765 | Cornell | psychology | Y | 765 | Jay | 2.9 |
| 543 | MIT | CS | N | 543 | Craig | 3.4 |
This contains all information we need for our query.
We'll soon see other kinds of joins, but inner joins are the 'default', and so we can skip the reserved word INNER, and just write:
%%sql
SELECT *
FROM applications
JOIN students ON applications.s_id = students.s_id
Since the join predicate is just a test of equality of one attribute from each table (it's called an equijoin), and the attribute has the same name in both tables, we can use the simpler form:
%%sql
SELECT *
FROM applications
JOIN students USING (s_id)
which gives (observe that we now only have one s_id column):
s_id |
c_name |
major |
decision |
s_name |
gpa |
|---|---|---|---|---|---|
| 123 | Stanford | CS | Y | Amy | 3.9 |
| 123 | Stanford | EE | N | Amy | 3.9 |
| 123 | Berkeley | CS | Y | Amy | 3.9 |
| 123 | Cornell | EE | Y | Amy | 3.9 |
| 234 | Berkeley | biology | N | Bob | 3.6 |
| 345 | MIT | bioengineering | Y | Craig | 3.5 |
| 345 | Cornell | bioengineering | N | Craig | 3.5 |
| 345 | Cornell | CS | Y | Craig | 3.5 |
| 345 | Cornell | EE | N | Craig | 3.5 |
| 678 | Stanford | history | Y | Fay | 3.8 |
| 987 | Stanford | CS | Y | Helen | 3.7 |
| 987 | Berkeley | CS | Y | Helen | 3.7 |
| 876 | Stanford | CS | N | Irene | 3.9 |
| 876 | MIT | biology | Y | Irene | 3.9 |
| 876 | MIT | marine biology | N | Irene | 3.9 |
| 765 | Stanford | history | Y | Jay | 2.9 |
| 765 | Cornell | history | N | Jay | 2.9 |
| 765 | Cornell | psychology | Y | Jay | 2.9 |
| 543 | MIT | CS | N | Craig | 3.4 |
In the first join above we got two s_id columns (SQL can't know for certain that s_id describes the same thing in both tables), but with the JOIN USING construct, we tell SQL that they describe the same thing, so the join returns only one column for s_id.
In general, we can compare sets of several attributes in a JOIN USING query using a tuple (as we'll soon see).
The number of rows in a joined table¶
At this point, we should consider how many rows we get when we join two tables.
In the table above, each valid application is matched with one student, so the number of rows will be the total number of applications for all students in the students table (students with no applications will not contribute in the resulting table, and applications which does not have a corresponding student will not show up either).
Assume we want to keep track of the students' phone numbers – the first 'rule' of relational databases is that all values must be primitive, so if a student has more than one phone number, she can't have them saved in a list in a table.
Instead we introduce a new table, phone_numbers, like this:
%%sql
SELECT *
FROM phone_numbers
| s_id | phone_no |
|---|---|
| 123 | 650-550-8671 |
| 345 | 347-467-6210 |
| 345 | 516-881-8303 |
| 456 | 650-255-3090 |
| 456 | 650-376-9029 |
| 456 | 628-265-5898 |
| 543 | 617-155-4170 |
| 567 | 650-412-4837 |
| 654 | 619-339-6883 |
| 678 | 650-228-4538 |
| 765 | 508-765-8878 |
| 876 | 774-877-3733 |
| 876 | 617-522-3278 |
| 987 | 650-964-4042 |
s_id |
phone_no |
|---|---|
| 123 | 650-550-8671 |
| 345 | 347-467-6210 |
| 345 | 516-881-8303 |
| 456 | 650-255-3090 |
| 456 | 650-376-9029 |
| 456 | 628-265-5898 |
| 543 | 617-155-4170 |
| 567 | 650-412-4837 |
| 654 | 619-339-6883 |
| 678 | 650-228-4538 |
| 765 | 508-765-8878 |
| 876 | 774-877-3733 |
| 876 | 617-522-3278 |
| 987 | 650-964-4042 |
Here we use the students' s_id numbers, since their names might not be unique (we have two Amys and two Craigs), and then provide a phone number – we'll talk more about uniqueness in the next few weeks.
Each student can have either no phone number, one phone number, or several phone numbers, in the table above, Amy (s_id 101) has 1, Bob (s_id 234) has none, Craig (s_id 345) has two, and Doris (s_id 456) has three.
If we join our students table with this table, using an equijoin on s_id, we get:
%%sql
SELECT *
FROM students
JOIN phone_numbers USING (s_id)
| s_id | s_name | gpa | phone_no |
|---|---|---|---|
| 123 | Amy | 3.9 | 650-550-8671 |
| 345 | Craig | 3.5 | 347-467-6210 |
| 345 | Craig | 3.5 | 516-881-8303 |
| 456 | Doris | 3.9 | 650-255-3090 |
| 456 | Doris | 3.9 | 650-376-9029 |
| 456 | Doris | 3.9 | 628-265-5898 |
| 543 | Craig | 3.4 | 617-155-4170 |
| 567 | Edward | 2.9 | 650-412-4837 |
| 654 | Amy | 3.9 | 619-339-6883 |
| 678 | Fay | 3.8 | 650-228-4538 |
| 765 | Jay | 2.9 | 508-765-8878 |
| 876 | Irene | 3.9 | 774-877-3733 |
| 876 | Irene | 3.9 | 617-522-3278 |
| 987 | Helen | 3.7 | 650-964-4042 |
s_id |
s_name |
phone_no |
|---|---|---|
| 123 | Amy | 650-550-8671 |
| 345 | Craig | 347-467-6210 |
| 345 | Craig | 516-881-8303 |
| 456 | Doris | 628-265-5898 |
| 456 | Doris | 650-255-3090 |
| 456 | Doris | 650-376-9029 |
| … | … | … |
So, since each student has its own row in the students table, they get one row per phone number in the joined table – this means that Bob, who didn't have any phone number, will not survive the join, whereas Craig and Doris get several rows (one row per phone number).
In the section on outer joins below, we'll see how we can keep Bob alive, even if he doesn't have any phone number.
If each student already had several rows, such as if we had joined students and applications, and then joined this combined table with phone_numbers, we would get:
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
JOIN phone_numbers USING (s_id)
WHERE s_id < 500
| s_id | s_name | gpa | c_name | major | decision | phone_no |
|---|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Berkeley | CS | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Cornell | EE | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Stanford | CS | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Stanford | EE | N | 650-550-8671 |
| 345 | Craig | 3.5 | Cornell | CS | Y | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | CS | Y | 516-881-8303 |
| 345 | Craig | 3.5 | Cornell | EE | N | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | EE | N | 516-881-8303 |
| 345 | Craig | 3.5 | Cornell | bioengineering | N | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | bioengineering | N | 516-881-8303 |
| 345 | Craig | 3.5 | MIT | bioengineering | Y | 347-467-6210 |
| 345 | Craig | 3.5 | MIT | bioengineering | Y | 516-881-8303 |
s_id |
s_name |
gpa |
c_name |
major |
decision |
phone_no |
|---|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Berkeley | CS | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Cornell | EE | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Stanford | CS | Y | 650-550-8671 |
| 123 | Amy | 3.9 | Stanford | EE | N | 650-550-8671 |
| 345 | Craig | 3.5 | Cornell | CS | Y | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | CS | Y | 516-881-8303 |
| 345 | Craig | 3.5 | Cornell | EE | N | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | EE | N | 516-881-8303 |
| 345 | Craig | 3.5 | Cornell | bioengineering | N | 347-467-6210 |
| 345 | Craig | 3.5 | Cornell | bioengineering | N | 516-881-8303 |
| 345 | Craig | 3.5 | MIT | bioengineering | Y | 347-467-6210 |
| 345 | Craig | 3.5 | MIT | bioengineering | Y | 516-881-8303 |
So, the number of rows per student will be the product of the number of rows they had in each of the joined tables.
Also, if we had a phone number with an unknown s_id, the number would not end up in our joined table.
The results of an inner join of several tables does not depend on in which order we join our tables (but joining them in the right order can potentially save an enormous amount of work for a big join, just as multiplying matrices in the right order depending on their shape can save lots of work – fortunately the 'query planner' is very clever when it decides how to join tables, we'll return to that later).
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
JOIN colleges USING (c_name)
| s_id | s_name | gpa | c_name | major | decision | state | enrollment |
|---|---|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Stanford | CS | Y | CA | 15000 |
| 123 | Amy | 3.9 | Stanford | EE | N | CA | 15000 |
| 123 | Amy | 3.9 | Berkeley | CS | Y | CA | 36000 |
| 123 | Amy | 3.9 | Cornell | EE | Y | NY | 21000 |
| 234 | Bob | 3.6 | Berkeley | biology | N | CA | 36000 |
| 345 | Craig | 3.5 | MIT | bioengineering | Y | MA | 10000 |
| 345 | Craig | 3.5 | Cornell | bioengineering | N | NY | 21000 |
| 345 | Craig | 3.5 | Cornell | CS | Y | NY | 21000 |
| 345 | Craig | 3.5 | Cornell | EE | N | NY | 21000 |
| 678 | Fay | 3.8 | Stanford | history | Y | CA | 15000 |
| 987 | Helen | 3.7 | Stanford | CS | Y | CA | 15000 |
| 987 | Helen | 3.7 | Berkeley | CS | Y | CA | 36000 |
| 876 | Irene | 3.9 | Stanford | CS | N | CA | 15000 |
| 876 | Irene | 3.9 | MIT | biology | Y | MA | 10000 |
| 876 | Irene | 3.9 | MIT | marine biology | N | MA | 10000 |
| 765 | Jay | 2.9 | Stanford | history | Y | CA | 15000 |
| 765 | Jay | 2.9 | Cornell | history | N | NY | 21000 |
| 765 | Jay | 2.9 | Cornell | psychology | Y | NY | 21000 |
| 543 | Craig | 3.4 | MIT | CS | N | MA | 10000 |
After we've joined…¶
Once we've joined our tables, we can proceed as we did before, with a WHERE clause:
%%sql
SELECT s_name, gpa
FROM applications
JOIN students USING (s_id)
WHERE c_name = 'Stanford' AND major = 'CS'
If we look at the syntax diagram for the SELECT statement, we see that we join our tables in the FROM clause (see join-clause), and that we can use everything we've learned so far to work on the joined tables (so we can group and use window functions and subqueries and CTE:s just as we've done before).
The combined, rather clumsy table will never be created, database engines are extremely clever when they decide on what data they need to keep around to execute their queries – so we normally do not need to worry much about efficiency (we'll talk a little more about this later in the course).
It could be useful to stop here for a while, and consider how you'd solve the problem above using Python or Java (or any other regular programming language). In case we let the students keep track of their applications, we'd need to loop through all students, and then all their applications looking for applications to CS@Stanford.
Deciding on who should keep track of what is a typical example of premature optimization, and using a relational database relieves us of this problem.
Problem: Show the id and name for all students who have applied for CS in California (state code CA).
%%sql
SELECT s_name, gpa, c_name
FROM applications
JOIN students USING (s_id)
JOIN colleges USING (c_name)
WHERE state = 'CA' AND major = 'CS'
| s_name | gpa | c_name |
|---|---|---|
| Amy | 3.9 | Berkeley |
| Amy | 3.9 | Stanford |
| Irene | 3.9 | Stanford |
| Helen | 3.7 | Berkeley |
| Helen | 3.7 | Stanford |
Cross joins¶
As we saw above, an inner join combines rows in two tables when the rows matches on some join predicate.
The predicate is typically dependent on column values in both rows (e.g., that the student ids in each table should match), but we can also use a seemingly nonsensical predicate such as true (which is always true):
%%sql
SELECT *
FROM applications
JOIN students ON true
LIMIT 10
| s_id | c_name | major | decision | s_id_1 | s_name | gpa |
|---|---|---|---|---|---|---|
| 123 | Stanford | CS | Y | 123 | Amy | 3.9 |
| 123 | Stanford | CS | Y | 234 | Bob | 3.6 |
| 123 | Stanford | CS | Y | 345 | Craig | 3.5 |
| 123 | Stanford | CS | Y | 456 | Doris | 3.9 |
| 123 | Stanford | CS | Y | 543 | Craig | 3.4 |
| 123 | Stanford | CS | Y | 567 | Edward | 2.9 |
| 123 | Stanford | CS | Y | 654 | Amy | 3.9 |
| 123 | Stanford | CS | Y | 678 | Fay | 3.8 |
| 123 | Stanford | CS | Y | 765 | Jay | 2.9 |
| 123 | Stanford | CS | Y | 789 | Gary | 3.4 |
This would mean that all rows in the applications table would match all rows in the students table, and we would get what's called the cartesian product of the two tables (or sets of rows), i.e., all combinations of rows in the tables.
This is exactly what a cross join is – we can write it as:
%%sql
SELECT *
FROM applications
CROSS JOIN students
and we typically use a WHERE clause to narrow the generated table into a table where only matching rows remains afterwards.
So, the query
%%sql
SELECT s_name, gpa
FROM applications
JOIN students USING (s_id)
WHERE c_name = 'Stanford' AND major = 'CS'
could have been written using a cross join instead:
%%sql
SELECT s_name, gpa
FROM applications
CROSS JOIN students
WHERE applications.s_id = students.s_id
AND c_name = 'Stanford'
AND major = 'CS'
| s_name | gpa |
|---|---|
| Amy | 3.9 |
| Irene | 3.9 |
| Helen | 3.7 |
This may look terrible at first sight, creating a huge table and then throwing away most of the combined rows, but, as said above, database engines are very, very clever, and they might find a way to make even this query fast. But I'd still recommend that you use inner queries, since it might help the database to find a faster query plan, and it's conceptually easier to understand for a person looking at your code.
By the way, there is a shortcut for cross joins, …
%%sql
SELECT s_name, gpa
FROM applications, students
WHERE applications.s_id = students.s_id
AND c_name = 'Stanford'
AND major = 'CS'
but, once again, opt for inner joins instead!
Outer joins¶
We saw above that a row in one table which matches several rows in the other table would give rise to several rows in the combined table. But it also goes the other way – when we do an inner join on two tables, rows for which the join predicate doesn't match any row in the other table will 'disappear' in the combined table.
So if we wanted to list all students and their applications, and wrote:
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
ORDER BY s_name
| s_id | s_name | gpa | c_name | major | decision |
|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Stanford | CS | Y |
| 123 | Amy | 3.9 | Stanford | EE | N |
| 123 | Amy | 3.9 | Berkeley | CS | Y |
| 123 | Amy | 3.9 | Cornell | EE | Y |
| 234 | Bob | 3.6 | Berkeley | biology | N |
| 345 | Craig | 3.5 | MIT | bioengineering | Y |
| 345 | Craig | 3.5 | Cornell | bioengineering | N |
| 345 | Craig | 3.5 | Cornell | CS | Y |
| 345 | Craig | 3.5 | Cornell | EE | N |
| 543 | Craig | 3.4 | MIT | CS | N |
| 678 | Fay | 3.8 | Stanford | history | Y |
| 987 | Helen | 3.7 | Stanford | CS | Y |
| 987 | Helen | 3.7 | Berkeley | CS | Y |
| 876 | Irene | 3.9 | Stanford | CS | N |
| 876 | Irene | 3.9 | MIT | biology | Y |
| 876 | Irene | 3.9 | MIT | marine biology | N |
| 765 | Jay | 2.9 | Stanford | history | Y |
| 765 | Jay | 2.9 | Cornell | history | N |
| 765 | Jay | 2.9 | Cornell | psychology | Y |
we would get all applications, but those students who hadn't applied for any major would be left out.
If we want to make sure that all rows in the first ('left') table survives the join, we can use a left outer join.
An outer join is a join in which we guarantee that all rows in one or both our tables will be present in the result.
But that leads to an important question: how do we combine an 'uncombinable' row?
The answer is that we combine it with a row full of NULL values in the other table – if we try a left outer join, and put the students table first (on the left), we get:
%%sql
SELECT *
FROM students
LEFT OUTER JOIN applications USING (s_id)
ORDER BY s_name
| s_id | s_name | gpa | c_name | major | decision |
|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Berkeley | CS | Y |
| 123 | Amy | 3.9 | Cornell | EE | Y |
| 123 | Amy | 3.9 | Stanford | CS | Y |
| 123 | Amy | 3.9 | Stanford | EE | N |
| 654 | Amy | 3.9 | None | None | None |
| 234 | Bob | 3.6 | Berkeley | biology | N |
| 345 | Craig | 3.5 | Cornell | CS | Y |
| 345 | Craig | 3.5 | Cornell | EE | N |
| 345 | Craig | 3.5 | Cornell | bioengineering | N |
| 345 | Craig | 3.5 | MIT | bioengineering | Y |
| 543 | Craig | 3.4 | MIT | CS | N |
| 456 | Doris | 3.9 | None | None | None |
| 567 | Edward | 2.9 | None | None | None |
| 678 | Fay | 3.8 | Stanford | history | Y |
| 789 | Gary | 3.4 | None | None | None |
| 987 | Helen | 3.7 | Berkeley | CS | Y |
| 987 | Helen | 3.7 | Stanford | CS | Y |
| 876 | Irene | 3.9 | MIT | biology | Y |
| 876 | Irene | 3.9 | MIT | marine biology | N |
| 876 | Irene | 3.9 | Stanford | CS | N |
| 765 | Jay | 2.9 | Cornell | history | N |
| 765 | Jay | 2.9 | Cornell | psychology | Y |
| 765 | Jay | 2.9 | Stanford | history | Y |
s_id |
c_name |
major |
decision |
s_name |
gpa |
|---|---|---|---|---|---|
| 123 | Amy | 3.9 | Berkeley | CS | Y |
| 123 | Amy | 3.9 | Cornell | EE | Y |
| 123 | Amy | 3.9 | Stanford | CS | Y |
| 123 | Amy | 3.9 | Stanford | EE | N |
| 654 | Amy | 3.9 | NULL |
NULL |
NULL |
| 234 | Bob | 3.6 | Berkeley | biology | N |
| 345 | Craig | 3.5 | Cornell | CS | Y |
| 345 | Craig | 3.5 | Cornell | EE | N |
| 345 | Craig | 3.5 | Cornell | bioengineering | N |
| 345 | Craig | 3.5 | MIT | bioengineering | Y |
| 543 | Craig | 3.4 | MIT | CS | N |
| 456 | Doris | 3.9 | NULL |
NULL |
NULL |
| 567 | Edward | 2.9 | NULL |
NULL |
NULL |
| 678 | Fay | 3.8 | Stanford | history | Y |
| 789 | Gary | 3.4 | NULL |
NULL |
NULL |
| 987 | Helen | 3.7 | Berkeley | CS | Y |
| 987 | Helen | 3.7 | Stanford | CS | Y |
| 876 | Irene | 3.9 | MIT | biology | Y |
| 876 | Irene | 3.9 | MIT | marine biology | N |
| 876 | Irene | 3.9 | Stanford | CS | N |
| 765 | Jay | 2.9 | Cornell | history | N |
| 765 | Jay | 2.9 | Cornell | psychology | Y |
| 765 | Jay | 2.9 | Stanford | history | Y |
Here we see 'the other Amy' (i.e., the one with s_id 654), Doris, Edward, and Gary, who apparently haven't applied for anything – the columns which would have been taken from the applications table in case of a match are now set to NULL.
Using outer joins to find 'missing' values¶
The fact that an outer join pads rows which can't be matched in the 'other' table with NULL values gives us a nice way to find out which values might be missing in a table.
Problem: We want to find all students who haven't applied for any major – do it in two ways:
- Using a subquery.
%%sql
SELECT s_id, s_name
FROM students
WHERE NOT s_id IN (
SELECT s_id
FROM applications)
| s_id | s_name |
|---|---|
| 456 | Doris |
| 567 | Edward |
| 654 | Amy |
| 789 | Gary |
- Using an outer join (checking where we've padded with
NULLcolumns).
%%sql
SELECT s_id, s_name
FROM students
LEFT JOIN applications USING (s_id)
WHERE major IS NULL
| s_id | s_name |
|---|---|
| 456 | Doris |
| 567 | Edward |
| 654 | Amy |
| 789 | Gary |
%%sql
SELECT NULL = NULL
| NULL = NULL |
|---|
| None |
This way of finding missing values may look like a hack, but it's in fact idiomatic.
Problem: In 1940-1942 no Nobel prizes were awarded (because of the Second World War), but the prize has been withheld in one or more categories in other years as well. Write a query which shows all such withheld awards, use suitable CTEs and use a subquery – observe that we only want 'missing' awards in years in which there was an award in at least one other category (such as literature in 1914 or medicine in 1915).
%%sql
WITH
award_years(year) AS (
SELECT DISTINCT year
FROM nobel_prizes
),
categories(category) AS (
SELECT DISTINCT category
FROM nobel_prizes
),
potential_awards(year, category) AS (
SELECT year, category
FROM award_years CROSS JOIN categories
),
actual_awards(year, category) AS (
SELECT DISTINCT year, category
FROM nobel_prizes
)
SELECT year, category
FROM potential_awards
WHERE (year, category) NOT IN actual_awards
| year | category |
|---|---|
| 1914 | literature |
| 1915 | medicine |
| 1916 | chemistry |
| 1916 | medicine |
| 1916 | physics |
| 1917 | chemistry |
| 1917 | medicine |
| 1918 | literature |
| 1918 | medicine |
| 1919 | chemistry |
| 1921 | medicine |
| 1924 | chemistry |
| 1925 | medicine |
| 1931 | physics |
| 1933 | chemistry |
| 1934 | physics |
| 1935 | literature |
| 1943 | literature |
Problem: The problem above can be solved with an outer join instead of a subquery, but we need to make sure we have attributes to check after the outer join. Solve it this way.
%%sql
WITH
award_years(year) AS (
SELECT DISTINCT year
FROM nobel_prizes
),
categories(category) AS (
SELECT DISTINCT category
FROM nobel_prizes
),
potential_awards(year, category) AS (
SELECT year, category
FROM award_years CROSS JOIN categories
),
actual_awards(year, category, flag) AS (
SELECT DISTINCT year, category, 1
FROM nobel_prizes
)
SELECT *
FROM potential_awards
LEFT JOIN actual_awards USING (year, category)
WHERE flag IS NULL
| year | category | flag |
|---|---|---|
| 1914 | literature | None |
| 1915 | medicine | None |
| 1916 | chemistry | None |
| 1916 | medicine | None |
| 1916 | physics | None |
| 1917 | chemistry | None |
| 1917 | medicine | None |
| 1918 | literature | None |
| 1918 | medicine | None |
| 1919 | chemistry | None |
| 1921 | medicine | None |
| 1924 | chemistry | None |
| 1925 | medicine | None |
| 1931 | physics | None |
| 1933 | chemistry | None |
| 1934 | physics | None |
| 1935 | literature | None |
| 1943 | literature | None |
%%sql
WITH
award_years(year) AS (
SELECT DISTINCT year
FROM nobel_prizes
),
categories(category) AS (
SELECT DISTINCT category
FROM nobel_prizes
),
potential_awards(year, category) AS (
SELECT year, category
FROM award_years CROSS JOIN categories
),
actual_awards(year, category, count) AS (
SELECT year, category, count()
FROM nobel_prizes
GROUP BY year, category
)
SELECT *
FROM potential_awards
LEFT JOIN actual_awards USING (year, category)
WHERE count IS NULL
| year | category | count |
|---|---|---|
| 1914 | literature | None |
| 1915 | medicine | None |
| 1916 | chemistry | None |
| 1916 | medicine | None |
| 1916 | physics | None |
| 1917 | chemistry | None |
| 1917 | medicine | None |
| 1918 | literature | None |
| 1918 | medicine | None |
| 1919 | chemistry | None |
| 1921 | medicine | None |
| 1924 | chemistry | None |
| 1925 | medicine | None |
| 1931 | physics | None |
| 1933 | chemistry | None |
| 1934 | physics | None |
| 1935 | literature | None |
| 1943 | literature | None |
Exercise: Add the number of other categories in which prizes were awarded for each of the missing prizes above.
%%sql
The first lines of the output should be something like:
1914: literature (3)
1915: medicine (3)
1916: chemistry (1)
1916: medicine (1)
1916: physics (1)
...
There are some more exercises on joins below.
Other outer joins¶
There are three kinds of outer joins:
LEFT OUTER JOIN, which can be abbreviatedLEFT JOIN– this is when we keep all rows in the first (leftmost) table, padding withNULLvalues when they don't match any row in the other (right) table.RIGHT OUTER JOIN, which can be abbreviatedRIGHT JOIN– this is when we keep all rows in the second (rightmost) table, padding withNULLvalues when they don't match any row in the other (left) table.FULL OUTER JOIN, which can be abbreviatedFULL JOIN– this is when we keep all rows in both tables, padding withNULLvalues when they don't match any row in the other table.
Exercises on JOIN¶
Exercise: Which students with a gpa above 3.5 have applied for a CS major at a college in California (CA) – show id, name, gpa and the names of the colleges they've applied to the CS program on?
%%sql
Hint: It could be a good idea to first define a CTE with the relevant students.
Your query should return 3 rows:
id name gpa colleges
--- ----- --- -----------------
123 Amy 3.9 Berkeley,Stanford
876 Irene 3.9 Stanford
987 Helen 3.7 Berkeley,Stanford
Exercise: List the id, name, and gpa for all students who have applied for more than one major in California (CA).
%%sql
Your query should return:
s_id s_name gpa
---- ------ ---
123 Amy 3.9
987 Helen 3.7
Exercise: Which are the three programs (major@college) with the lowest gpa for any accepted student.
%%sql
The output should be:
name major lowest gpa
-------- ---------- ----------
Cornell psychology 2.9
Stanford history 2.9
Cornell CS 3.5
Exercise: And which are the three programs with the highest gpa for a student not accepted?
%%sql
The output should be:
name major highest gpa
-------- -------------- -----------
MIT marine biology 3.9
Stanford CS 3.9
Stanford EE 3.9
Comment: Apparently the gpa isn't the only criterion for acceptance, since Irene (s_id 876) wasn't accepted at CS@Stanford, but Helen (s_id 987) was, although Irene has a higher gpa…
Exercise: In the section about correlated subqueries, we solved the problem of finding the laureates with the longest name for each category – solve this problem using a CTE instead of a correlated subquery (you can save the longest name length of each category in a CTE).
%%sql
Set operations¶
Relational databases and SQL are based on relational algebra, and this means that we can use simple set operations on our data.
If we look at the compound-operator in the SELECT statement, we see that we can combine the results of queries using UNION, INTERSECT, and EXCEPT.
This only works if the queries return exactly the same columns.
Problem: List the names and gpa for students who have applied for a major at a college in California, but haven't applied in any other state – use a set operation (it could also be solved in a number of other ways).
%%sql
WITH
ca_but_not_elsewhere(s_id) AS (
SELECT s_id
FROM applications
JOIN colleges USING (c_name)
WHERE state = 'CA'
EXCEPT
SELECT s_id
FROM applications
JOIN colleges USING (c_name)
WHERE state != 'CA'
)
SELECT s_name, gpa
FROM ca_but_not_elsewhere
JOIN students USING (s_id)
| s_name | gpa |
|---|---|
| Bob | 3.6 |
| Fay | 3.8 |
| Helen | 3.7 |
INSERT, UPDATE and DELETE¶
To maintain our databases, we need ways to change their contents, and for that we have tree important functions:
INSERT: inserts a row into a tableUPDATE: updates rows which matches some predicateDELETE: deletes rows which matches some predicate
We'll talk about these functions next week, when we talk about modeling (then we'll also discuss the concept of keys, which is very important when we modify our tables).