EDAF75 - notebook for lecture 2¶
Author: Christian Söderberg
Week 1: Introduction, relational databases, and SQL, continued…¶
In order to run SQL in this notebook, we need to load the SQL extension:
%load_ext sql
/home/cso/lects/db/lect-01/notebooks/.venv/lib/python3.12/site-packages/sql/parse.py:338: SyntaxWarning: invalid escape sequence '\:' """ /home/cso/lects/db/lect-01/notebooks/.venv/lib/python3.12/site-packages/sql/parse.py:368: SyntaxWarning: invalid escape sequence '\:' """
If we use jupysql
(instead of ipython-sql
) we get a default limit of 10 rows for any query – to turn this limit off we can write:
%config SqlMagic.displaylimit = None
I've put all the tables used for this lecture in a file called lect01.sqlite
, and we can load them using:
%sql sqlite:///lect02.sqlite
Now we're good to go, we just have to prefix our SQL queries with %sql
(one line of SQL) or %%sql
(several lines of SQL, this is the form we will use in most cases).
Subqueries, Views and Common Table Expressions (CTEs)¶
As we noted above, the result of a SELECT
statement is itself a (kind of) table, and we can use such a table inside other SELECT
statements.
One useful pattern is:
SELECT ...
FROM ...
WHERE ... IN
(SELECT ...
FROM ...
WHERE ...)
The second, nested query is called a subquery.
We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?
%%sql
SELECT year
FROM nobel_prizes
WHERE category = 'literature'
GROUP BY year
HAVING count() > 1
year |
---|
1904 |
1917 |
1966 |
1974 |
… and now we use the result of that query to find out what we're really looking for:
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN (
SELECT year
FROM nobel_prizes
WHERE category = 'literature'
GROUP BY year
HAVING count() > 1
)
year | name |
---|---|
1904 | Frédéric Mistral |
1904 | José Echegaray y Eizaguirre |
1917 | Henrik Pontoppidan |
1917 | Karl Adolph Gjellerup |
1966 | Nelly Sachs |
1966 | Shmuel Yosef Agnon |
1974 | Eyvind Johnson |
1974 | Harry Martinson |
This can be simplified by using either of two ways to define 'temporary tables':
- views
- Common Table Expressions (CTE:s).
A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:
%%sql
CREATE VIEW shared_literature_prize(year) AS
SELECT year
FROM nobel_prizes
WHERE category = 'literature'
GROUP BY year
HAVING count() > 1
We can now use our view in a query:
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN (
SELECT year
FROM shared_literature_prize
)
year | name |
---|---|
1904 | Frédéric Mistral |
1904 | José Echegaray y Eizaguirre |
1917 | Henrik Pontoppidan |
1917 | Karl Adolph Gjellerup |
1966 | Nelly Sachs |
1966 | Shmuel Yosef Agnon |
1974 | Eyvind Johnson |
1974 | Harry Martinson |
We can simplify this expression, by writing:
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN shared_literature_prize
year | name |
---|---|
1904 | Frédéric Mistral |
1904 | José Echegaray y Eizaguirre |
1917 | Henrik Pontoppidan |
1917 | Karl Adolph Gjellerup |
1966 | Nelly Sachs |
1966 | Shmuel Yosef Agnon |
1974 | Eyvind Johnson |
1974 | Harry Martinson |
If we have more than one column in the output of the subquery, we must compare it to a tuple within parentheses (this is a quite new addition to SQL, so it might not work everywhere).
The view will be around until we decide to remove it with:
%%sql
DROP VIEW shared_literature_prize
A view isn't actually a table – each time the view is used it runs its query, which will take some time, but it means we get all updates in the underlying table(s).
A Common Table Expression (or CTE) is like a view, but only defined as part of a query, so we'd write it like:
%%sql
WITH
shared_literature_prize(year) AS (
SELECT ... -- as above...
)
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN shared_literature_prize
There are some things which makes CTEs very nice:
- they're defined as part of a
SELECT
statement (so there is nothing to drop afterwards), - since they're part of a
SELECT
statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and - they can be defined recursively (we'll return to that later in the course).
Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.
%%sql
-- exercise
Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery:
%%sql
and then rewrite it using a CTE:
%%sql
We saw above that we can't have another WHERE
after the HAVING
clause:
%%sql
SELECT category, count() AS count
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3
WHERE count > 1 -- <-- not allowed!
but we can make our 'grouping query' into a subquery, and have another WHERE
in the outer query:
%%sql
SELECT category, count
FROM (
SELECT category, count() AS count
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3)
WHERE count > 1
A tidier way of expressing this is to use a WITH-statement (CTE):
%%sql
WITH category_count(category, count) AS (
SELECT category, count()
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3
)
SELECT category, count
FROM category_count
WHERE count > 1
Exercise: How many different laureates do we have in each category? Observe that there are a couple of laureates with two awards in the same category, they should only be counted once – Marie Curie, on the other hand, has won in two categories, so she should be counted in two different categories. Use a CTE to keep track of all laureates in each category.
%%sql
Problem: There is a 'funny' function which returns series of values – it's called generate_series
, and you can try to use it to see which year no nobel price at all was awarded (the function generate_series
is quite new, so this might not work in your notebook if it uses an older version of sqlite3
).
%%sql
Exercise: Some of the years when Nobel prizes actually were awarded, it was withheld for one or more categories – in which years did the Swedish Academy refuse to award anyone, despite prizes being awarded in other categories?
%%sql
Correlated subqueries¶
Another form of subquery allows us to correlate two SELECT
statements, so that we use values from an outer SELECT
statement inside the SELECT
statement in a subquery. As an example, solve the following problem:
Problem: List the names of all laureates who have the longest name of all laureates in their categories (in case of ties, all should be listed) – order by category.
Here we can use a subquery which is 'run' for each row in our main query:
%%sql
SELECT category, year, name
FROM nobel_prizes AS outer_nobel
WHERE length(name) = (
SELECT max(length(name))
FROM nobel_prizes
WHERE category = outer_nobel.category)
ORDER BY category
This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel table in the outer query and the nobel table in the subquery (it's the same table, but we 'iterate' through it separately).
BTW, we could have skipped the AS
in
...
FROM nobel_prizes AS outer_nobel
...
and just written:
...
FROM nobel_prizes outer_nobel
...
The general opinion is that we should use AS
, as it makes it more obvious that we're defining an alias.
When one table is not enough¶
In the preparation notes for this week's lectures, you were asked to think through the problem below.
Problem: We've been asked to write code to handle students and their college applications. Each student has a name, a student id (stil), and an average grade. The students apply for various majors at colleges, where each college has a name, is situated in a given state, and has a given enrollment. Each application is for a specific major at a specific college, and it can be accepted, rejected, or pending.
- What classes and/or data structures would you use to keep track of this data in Python or Java?
- How do you list all applications for a given student?
- How do you list all applications for a given major at a given college?
- Approximately how many lines of Python/Java code do you need to list the three programs (major@college) with the highest grade average for the 'last' accepted student (assuming the students are accepted in order of decreasing grade averages)?
- How would you persist your data, so it doesn't get lost if we have to restart the program?
- Can you come up with a way to let several programs access the data simultaneously?
- If you solve the previous problem, and want to allow several programs update the data simultaneously, how do you make sure the data isn't corrupted?
If we wanted to keep all the information above in one table, we'd need the following columns (below I use some column names which are in a style I don't normally use, but I think it could be good for you to be exposed to different styles):
- student id (
s_id
) - student name (
s_name
) - average grade (
gpa
) - college (
c_name
) - state (
state
) - enrollment (
enrollment
) - major (
major
) - decision (
decision
)
Here is a (very small) example:
s_id |
s_name |
gpa |
c_name |
state |
enrollment |
major |
decision |
---|---|---|---|---|---|---|---|
123 | Amy | 3.9 | Stanford | CA | 15000 | CS | Y |
123 | Amy | 3.9 | Stanford | CA | 15000 | EE | N |
123 | Amy | 3.9 | Berkeley | CA | 36000 | CS | Y |
123 | Amy | 3.9 | Cornell | NY | 21000 | EE | Y |
234 | Bob | 3.6 | Berkeley | CA | 36000 | biology | N |
345 | Craig | 3.5 | MIT | MA | 10000 | bioengineering | Y |
345 | Craig | 3.5 | Cornell | NY | 21000 | bioengineering | N |
345 | Craig | 3.5 | Cornell | NY | 21000 | CS | Y |
345 | Craig | 3.5 | Cornell | NY | 21000 | EE | N |
678 | Fay | 3.8 | Stanford | CA | 15000 | history | Y |
987 | Helen | 3.7 | Stanford | CA | 15000 | CS | Y |
987 | Helen | 3.7 | Berkeley | CA | 36000 | CS | Y |
876 | Irene | 3.9 | Stanford | CA | 15000 | CS | N |
876 | Irene | 3.9 | MIT | MA | 10000 | biology | Y |
876 | Irene | 3.9 | MIT | MA | 10000 | marine biology | N |
765 | Jay | 2.9 | Stanford | CA | 15000 | history | Y |
765 | Jay | 2.9 | Cornell | NY | 21000 | history | N |
765 | Jay | 2.9 | Cornell | NY | 21000 | psychology | Y |
543 | Craig | 3.4 | MIT | MA | 10000 | CS | N |
This table is wonderful for finding information, but it comes with some big problems:
some of the information is repeated in many places (which is wasteful and error prone)
if we want to change Amy's gpa, we'd have to change it in several places (once per application she's made)
if a student withdraws her last application, we'd loose all information about that student (this has actually happened here already, as you'll see below)
So, although much of the information in the table is redundant, and it may seem that we easily could remove many rows without losing any information, in fact we can't (as we'll see in week 4) – any row we remove will mean that we lose some potentially important information (that specific combination of values will get lost).
In week 4 we'll discuss the theory behind these problems (functional dependencies), and see a technique we can use to find well behaved tables (normalization), but I hope that you intuitively can see that everything would be easier if we had the following three tables instead of the big table above:
- one table,
students
, with information about the students (id, name and gpa)
s_id |
s_name |
gpa |
---|---|---|
123 | Amy | 3.9 |
234 | Bob | 3.6 |
345 | Craig | 3.5 |
456 | Doris | 3.9 |
543 | Craig | 3.4 |
567 | Edward | 2.9 |
654 | Amy | 3.9 |
678 | Fay | 3.8 |
765 | Jay | 2.9 |
789 | Gary | 3.4 |
876 | Irene | 3.9 |
987 | Helen | 3.7 |
Here we have some students which weren't seen in the original table above, since they did not have any applications (there is no data to put in the columns relating to applications for them – we'll soon see that there is a way around this).
%%sql
- one table,
colleges
, with college information (name and state)
c_name |
state |
enrollment |
---|---|---|
Stanford | CA | 15000 |
Berkeley | CA | 36000 |
MIT | MA | 10000 |
Cornell | NY | 21000 |
%%sql
- one table,
applications
, with applications (student id, college name, major, and decision)
s_id |
c_name |
major |
decision |
---|---|---|---|
123 | Stanford | CS | Y |
123 | Stanford | EE | N |
123 | Berkeley | CS | Y |
123 | Cornell | EE | Y |
234 | Berkeley | biology | N |
345 | MIT | bioengineering | Y |
345 | Cornell | bioengineering | N |
345 | Cornell | CS | Y |
345 | Cornell | EE | N |
678 | Stanford | history | Y |
987 | Stanford | CS | Y |
987 | Berkeley | CS | Y |
876 | Stanford | CS | N |
876 | MIT | biology | Y |
876 | MIT | marine biology | N |
765 | Stanford | history | Y |
765 | Cornell | history | N |
765 | Cornell | psychology | Y |
543 | MIT | CS | N |
%%sql
Nothing in these tables is redundant, so if we use these tables, the issues we saw above goes away (we'll talk more about it in week 4). But the benefits come with a cost, we now need a way to combine information from several tables at once.
Let's say we want to show the names and grade averages of all students who have applied for CS at Stanford – none of our three tables above contain all information we need, but if we start by looking in the applications table, we can get …
%%sql
SELECT s_id
FROM applications
WHERE c_name = 'Stanford' AND major = 'CS'
s_id |
---|
123 |
876 |
987 |
The best we can get here is the id of all relevant students, but given the id we can look in the students
table for the name and gpa.
In regular programming languages, such as Python or Java, we could do that with functions such as find_name(s_id)
and find_gpa(s_id)
:
%%sql
SELECT find_name(s_id), find_gpa(s_id) -- not in SQL!
FROM applications
WHERE c_name = 'Stanford' AND major = 'CS'
But SQL is no regular language, it's based on relational algebra, and there we have another, really cool way of solving this problem: joins).
Inner joins¶
We solve the problem above by combining (joining) the applications
table with the students
table. There are several kinds of joins in SQL (and relational algebra), but the most straightforward kind is what's called an inner join#Inner_join).
The inner join of two tables is a new table, where rows from each table are joined together if the attributes of the rows matches a join predicate – in our case we want to add student data to the application data where the student id in the applications table matches the student id in the students table.
This can be expressed as (see the docs):
%%sql
SELECT *
FROM applications
INNER JOIN students ON applications.s_id = students.s_id
s_id | c_name | major | decision | s_id_1 | s_name | gpa |
---|---|---|---|---|---|---|
123 | Stanford | CS | Y | 123 | Amy | 3.9 |
123 | Stanford | EE | N | 123 | Amy | 3.9 |
123 | Berkeley | CS | Y | 123 | Amy | 3.9 |
123 | Cornell | EE | Y | 123 | Amy | 3.9 |
234 | Berkeley | biology | N | 234 | Bob | 3.6 |
345 | MIT | bioengineering | Y | 345 | Craig | 3.5 |
345 | Cornell | bioengineering | N | 345 | Craig | 3.5 |
345 | Cornell | CS | Y | 345 | Craig | 3.5 |
345 | Cornell | EE | N | 345 | Craig | 3.5 |
678 | Stanford | history | Y | 678 | Fay | 3.8 |
987 | Stanford | CS | Y | 987 | Helen | 3.7 |
987 | Berkeley | CS | Y | 987 | Helen | 3.7 |
876 | Stanford | CS | N | 876 | Irene | 3.9 |
876 | MIT | biology | Y | 876 | Irene | 3.9 |
876 | MIT | marine biology | N | 876 | Irene | 3.9 |
765 | Stanford | history | Y | 765 | Jay | 2.9 |
765 | Cornell | history | N | 765 | Jay | 2.9 |
765 | Cornell | psychology | Y | 765 | Jay | 2.9 |
543 | MIT | CS | N | 543 | Craig | 3.4 |
and we would get the joined table (note that the first columns are from the applications
table, whereas the last columns are from the students
table – also note that s_id
and s_id'
always are the same in each row):
s_id |
c_name |
major |
decision |
s_id' |
s_name |
gpa |
---|---|---|---|---|---|---|
123 | Stanford | CS | Y | 123 | Amy | 3.9 |
123 | Stanford | EE | N | 123 | Amy | 3.9 |
123 | Berkeley | CS | Y | 123 | Amy | 3.9 |
123 | Cornell | EE | Y | 123 | Amy | 3.9 |
234 | Berkeley | biology | N | 234 | Bob | 3.6 |
345 | MIT | bioengineering | Y | 345 | Craig | 3.5 |
345 | Cornell | bioengineering | N | 345 | Craig | 3.5 |
345 | Cornell | CS | Y | 345 | Craig | 3.5 |
345 | Cornell | EE | N | 345 | Craig | 3.5 |
678 | Stanford | history | Y | 678 | Fay | 3.8 |
987 | Stanford | CS | Y | 987 | Helen | 3.7 |
987 | Berkeley | CS | Y | 987 | Helen | 3.7 |
876 | Stanford | CS | N | 876 | Irene | 3.9 |
876 | MIT | biology | Y | 876 | Irene | 3.9 |
876 | MIT | marine biology | N | 876 | Irene | 3.9 |
765 | Stanford | history | Y | 765 | Jay | 2.9 |
765 | Cornell | history | N | 765 | Jay | 2.9 |
765 | Cornell | psychology | Y | 765 | Jay | 2.9 |
543 | MIT | CS | N | 543 | Craig | 3.4 |
which contains all information we need.
We'll soon see other kinds of joins, but inner joins are the 'default', and so we can skip the reserved word INNER
, and just write:
%%sql
SELECT *
FROM applications
JOIN students ON applications.s_id = students.s_id
Since the join predicate is just a test of equality of one attribute from each table (it's called an equijoin), and the attribute has the same name in both tables, we can use the simpler form:
%%sql
SELECT *
FROM applications
JOIN students USING (s_id)
s_id | c_name | major | decision | s_name | gpa |
---|---|---|---|---|---|
123 | Stanford | CS | Y | Amy | 3.9 |
123 | Stanford | EE | N | Amy | 3.9 |
123 | Berkeley | CS | Y | Amy | 3.9 |
123 | Cornell | EE | Y | Amy | 3.9 |
234 | Berkeley | biology | N | Bob | 3.6 |
345 | MIT | bioengineering | Y | Craig | 3.5 |
345 | Cornell | bioengineering | N | Craig | 3.5 |
345 | Cornell | CS | Y | Craig | 3.5 |
345 | Cornell | EE | N | Craig | 3.5 |
678 | Stanford | history | Y | Fay | 3.8 |
987 | Stanford | CS | Y | Helen | 3.7 |
987 | Berkeley | CS | Y | Helen | 3.7 |
876 | Stanford | CS | N | Irene | 3.9 |
876 | MIT | biology | Y | Irene | 3.9 |
876 | MIT | marine biology | N | Irene | 3.9 |
765 | Stanford | history | Y | Jay | 2.9 |
765 | Cornell | history | N | Jay | 2.9 |
765 | Cornell | psychology | Y | Jay | 2.9 |
543 | MIT | CS | N | Craig | 3.4 |
which gives (observe that we now only have one s_id
column):
s_id |
c_name |
major |
decision |
s_name |
gpa |
---|---|---|---|---|---|
123 | Stanford | CS | Y | Amy | 3.9 |
123 | Stanford | EE | N | Amy | 3.9 |
123 | Berkeley | CS | Y | Amy | 3.9 |
123 | Cornell | EE | Y | Amy | 3.9 |
234 | Berkeley | biology | N | Bob | 3.6 |
345 | MIT | bioengineering | Y | Craig | 3.5 |
345 | Cornell | bioengineering | N | Craig | 3.5 |
345 | Cornell | CS | Y | Craig | 3.5 |
345 | Cornell | EE | N | Craig | 3.5 |
678 | Stanford | history | Y | Fay | 3.8 |
987 | Stanford | CS | Y | Helen | 3.7 |
987 | Berkeley | CS | Y | Helen | 3.7 |
876 | Stanford | CS | N | Irene | 3.9 |
876 | MIT | biology | Y | Irene | 3.9 |
876 | MIT | marine biology | N | Irene | 3.9 |
765 | Stanford | history | Y | Jay | 2.9 |
765 | Cornell | history | N | Jay | 2.9 |
765 | Cornell | psychology | Y | Jay | 2.9 |
543 | MIT | CS | N | Craig | 3.4 |
In the first join above we got two s_id
columns (SQL can't possibly know that s_id
describes the same thing in both tables), but with the JOIN
USING
construct, we tell SQL that they describe the same thing, so the join returns only one column for s_id
.
In general, we can compare sets of several attributes in a JOIN
USING
query.
The number of rows in a joined table¶
At this point, we should consider how many rows we get when we join two tables.
Assume we want to keep track of the students' phone numbers – the first 'rule' of relational databases is that all values must be primitive, so if a student has more than one phone number, she can't have them saved in a list in a table.
Instead we introduce a new table, phone_numbers
, like this:
%%sql
SELECT *
FROM phone_numbers
s_id |
phone_no |
---|---|
123 | 650-550-8671 |
345 | 347-467-6210 |
345 | 516-881-8303 |
456 | 650-255-3090 |
456 | 650-376-9029 |
456 | 628-265-5898 |
543 | 617-155-4170 |
567 | 650-412-4837 |
654 | 619-339-6883 |
678 | 650-228-4538 |
765 | 508-765-8878 |
876 | 774-877-3733 |
876 | 617-522-3278 |
987 | 650-964-4042 |
Here we use the students' s_id
numbers, since their names might not be unique (we have two Amys and two Craigs), and then provide a phone number.
Each student can have either no phone number, one phone number, or several phone numbers, in the table above, Amy (s_id
101) has 1, Bob (s_id
234) has none, Craig (s_id
345) has two, and Doris (s_id
456) has three.
If we join our students
table with this table, using an equijoin on s_id
, we get:
%%sql
SELECT *
FROM students
JOIN phone_numbers USING (s_id)
s_id | s_name | gpa | phone_no |
---|---|---|---|
123 | Amy | 3.9 | 650-550-8671 |
345 | Craig | 3.5 | 347-467-6210 |
345 | Craig | 3.5 | 516-881-8303 |
456 | Doris | 3.9 | 650-255-3090 |
456 | Doris | 3.9 | 650-376-9029 |
456 | Doris | 3.9 | 628-265-5898 |
543 | Craig | 3.4 | 617-155-4170 |
567 | Edward | 2.9 | 650-412-4837 |
654 | Amy | 3.9 | 619-339-6883 |
678 | Fay | 3.8 | 650-228-4538 |
765 | Jay | 2.9 | 508-765-8878 |
876 | Irene | 3.9 | 774-877-3733 |
876 | Irene | 3.9 | 617-522-3278 |
987 | Helen | 3.7 | 650-964-4042 |
s_id |
s_name |
phone_no |
---|---|---|
123 | Amy | 650-550-8671 |
345 | Craig | 347-467-6210 |
345 | Craig | 516-881-8303 |
456 | Doris | 628-265-5898 |
456 | Doris | 650-255-3090 |
456 | Doris | 650-376-9029 |
… | … | … |
So, since each student has its own row in the students
table, they get one row per phone number in the joined table – this means that Bob, who didn't have any phone number, will not survive the join, whereas Craig and Doris get several rows (one row per phone number).
In the section on outer joins below, we'll see how we can keep Bob alive, even if he doesn't have any phone number.
If each student already had several rows, such as if we had joined students
and applications
, and then joined this combined table with phone_numbers
, we would get:
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
JOIN phone_numbers USING (s_id)
WHERE s_id < 500
s_id |
s_name |
gpa |
c_name |
major |
decision |
phone_no |
---|---|---|---|---|---|---|
123 | Amy | 3.9 | Berkeley | CS | Y | 650-550-8671 |
123 | Amy | 3.9 | Cornell | EE | Y | 650-550-8671 |
123 | Amy | 3.9 | Stanford | CS | Y | 650-550-8671 |
123 | Amy | 3.9 | Stanford | EE | N | 650-550-8671 |
345 | Craig | 3.5 | Cornell | CS | Y | 347-467-6210 |
345 | Craig | 3.5 | Cornell | CS | Y | 516-881-8303 |
345 | Craig | 3.5 | Cornell | EE | N | 347-467-6210 |
345 | Craig | 3.5 | Cornell | EE | N | 516-881-8303 |
345 | Craig | 3.5 | Cornell | bioengineering | N | 347-467-6210 |
345 | Craig | 3.5 | Cornell | bioengineering | N | 516-881-8303 |
345 | Craig | 3.5 | MIT | bioengineering | Y | 347-467-6210 |
345 | Craig | 3.5 | MIT | bioengineering | Y | 516-881-8303 |
So, the number of rows per student will be the product of the number of rows they had in each of the joined tables.
Also, if we had a phone number with an unknown s_id
, the number would not end up in our joined table.
The results of an inner join of several tables does not depend on in which order we join our tables (but joining them in the right order can potentially save an enormous amount of work for a big join, just as multiplying matrices in the right order can save lots of work – fortunately the 'query planner' is very clever when it decides how to join tables, we'll return to that later).
After we've joined…¶
Once we've joined our tables, we can proceed as we did before, with a WHERE
clause:
%%sql
SELECT s_name, gpa
FROM applications
JOIN students USING (s_id)
WHERE c_name = 'Stanford' AND major = 'CS'
If we look at the syntax diagram for the SELECT
statement, we see that we join our tables in the FROM
clause (see join-clause
), and that we can use everything we've learned so far to work on the joined tables (so we can group and use window functions and subqueries just as we've done before).
The combined, rather clumsy table will never be created, database engines are extremely clever when they decide on what data they need to keep around to execute their queries – so we normally do not need to worry much about efficiency (we'll talk a little more about this later in the course).
It could be useful to stop here for a while, and consider how you'd solve the problem above using Python or Java (or any other regular programming language). In case we let the students keep track of their applications, we'd need to loop through all students, and then all their applications looking for applications to CS@Stanford.
Deciding on who should keep track of what is a typical example of premature optimization, and using a relational database relieves us of this problem.
Problem: Show the id and name for all students who has applied for CS in California (state code CA
).
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
JOIN colleges USING (c_name)
WHERE state = 'CA'
s_id | s_name | gpa | c_name | major | decision | state | enrollment |
---|---|---|---|---|---|---|---|
123 | Amy | 3.9 | Stanford | CS | Y | CA | 15000 |
123 | Amy | 3.9 | Stanford | EE | N | CA | 15000 |
123 | Amy | 3.9 | Berkeley | CS | Y | CA | 36000 |
234 | Bob | 3.6 | Berkeley | biology | N | CA | 36000 |
678 | Fay | 3.8 | Stanford | history | Y | CA | 15000 |
987 | Helen | 3.7 | Stanford | CS | Y | CA | 15000 |
987 | Helen | 3.7 | Berkeley | CS | Y | CA | 36000 |
876 | Irene | 3.9 | Stanford | CS | N | CA | 15000 |
765 | Jay | 2.9 | Stanford | history | Y | CA | 15000 |
Cross joins¶
As we saw above, an inner join combines rows in two tables when the rows matches on some join predicate.
The predicate is typically dependent on column values in both rows (e.g., that the student ids in each table should match), but we can also use a seemingly nonsensical predicate such as true
(which is always true):
%%sql
SELECT *
FROM applications
JOIN students ON false
s_id | c_name | major | decision | s_id_1 | s_name | gpa |
---|
This would mean that all rows in the applications table would match all rows in the students table, and we would get what's called the cartesian product of the two tables (or sets of rows), i.e., all combinations of rows in the tables.
This is exactly what a cross join is – we can write it as:
%%sql
SELECT *
FROM applications
CROSS JOIN students
s_id | c_name | major | decision | s_id_1 | s_name | gpa |
---|
and we typically use a WHERE
clause to narrow the generated table into a table where only matching rows remains afterwards.
So, the query
%%sql
SELECT s_name, gpa
FROM applications
JOIN students USING (s_id)
WHERE c_name = 'Stanford' AND major = 'CS'
could have been written using a cross join instead:
%%sql
SELECT s_name, gpa
FROM applications
CROSS JOIN students
WHERE applications.s_id = students.s_id
AND c_name = 'Stanford'
AND major = 'CS'
This may look terrible at first sight, creating a huge table and then throwing away most of the combined rows, but, as said above, database engines are very, very clever, and they might find a way to make even this query fast. But I'd still recommend that you use inner queries, since it might help the database to find a faster query plan, and it's conceptually easier to understand for a person looking at your code.
By the way, there is a shortcut for cross joins, …
%%sql
SELECT s_name, gpa
FROM applications, students
WHERE applications.s_id = students.s_id
AND c_name = 'Stanford'
AND major = 'CS'
but, once again, opt for inner joins instead!
Outer joins¶
We saw above that a row in one table which matches several rows in the other table would give rise to several rows in the combined table. But it also goes the other way – when we do an inner join on two tables, rows for which the join predicate doesn't match any row in the other table will 'disappear' in the combined table.
So if we wanted to list all students and their applications, and wrote:
%%sql
SELECT *
FROM students
JOIN applications USING (s_id)
ORDER BY s_name
s_id | s_name | gpa | c_name | major | decision |
---|---|---|---|---|---|
123 | Amy | 3.9 | Stanford | CS | Y |
123 | Amy | 3.9 | Stanford | EE | N |
123 | Amy | 3.9 | Berkeley | CS | Y |
123 | Amy | 3.9 | Cornell | EE | Y |
234 | Bob | 3.6 | Berkeley | biology | N |
345 | Craig | 3.5 | MIT | bioengineering | Y |
345 | Craig | 3.5 | Cornell | bioengineering | N |
345 | Craig | 3.5 | Cornell | CS | Y |
345 | Craig | 3.5 | Cornell | EE | N |
543 | Craig | 3.4 | MIT | CS | N |
678 | Fay | 3.8 | Stanford | history | Y |
987 | Helen | 3.7 | Stanford | CS | Y |
987 | Helen | 3.7 | Berkeley | CS | Y |
876 | Irene | 3.9 | Stanford | CS | N |
876 | Irene | 3.9 | MIT | biology | Y |
876 | Irene | 3.9 | MIT | marine biology | N |
765 | Jay | 2.9 | Stanford | history | Y |
765 | Jay | 2.9 | Cornell | history | N |
765 | Jay | 2.9 | Cornell | psychology | Y |
we would get all applications, but those students who hadn't applied for any major would be left out.
If we want to make sure that all rows in the first ('left') table survives the join, we can use a left outer join.
An outer join is a join in which we guarantee that all rows in one or both our tables will be present in the result.
But that leads to an important question: how do we combine an 'uncombinable' row?
The answer is that we combine it with a row full of NULL
values in the other table – if we try a left outer join, and put the students
table first (on the left), we get:
%%sql
SELECT *
FROM students
LEFT OUTER JOIN applications USING (s_id)
ORDER BY s_name
s_id | s_name | gpa | c_name | major | decision |
---|---|---|---|---|---|
123 | Amy | 3.9 | Berkeley | CS | Y |
123 | Amy | 3.9 | Cornell | EE | Y |
123 | Amy | 3.9 | Stanford | CS | Y |
123 | Amy | 3.9 | Stanford | EE | N |
654 | Amy | 3.9 | None | None | None |
234 | Bob | 3.6 | Berkeley | biology | N |
345 | Craig | 3.5 | Cornell | CS | Y |
345 | Craig | 3.5 | Cornell | EE | N |
345 | Craig | 3.5 | Cornell | bioengineering | N |
345 | Craig | 3.5 | MIT | bioengineering | Y |
543 | Craig | 3.4 | MIT | CS | N |
456 | Doris | 3.9 | None | None | None |
567 | Edward | 2.9 | None | None | None |
678 | Fay | 3.8 | Stanford | history | Y |
789 | Gary | 3.4 | None | None | None |
987 | Helen | 3.7 | Berkeley | CS | Y |
987 | Helen | 3.7 | Stanford | CS | Y |
876 | Irene | 3.9 | MIT | biology | Y |
876 | Irene | 3.9 | MIT | marine biology | N |
876 | Irene | 3.9 | Stanford | CS | N |
765 | Jay | 2.9 | Cornell | history | N |
765 | Jay | 2.9 | Cornell | psychology | Y |
765 | Jay | 2.9 | Stanford | history | Y |
s_id |
c_name |
major |
decision |
s_name |
gpa |
---|---|---|---|---|---|
123 | Amy | 3.9 | Berkeley | CS | Y |
123 | Amy | 3.9 | Cornell | EE | Y |
123 | Amy | 3.9 | Stanford | CS | Y |
123 | Amy | 3.9 | Stanford | EE | N |
654 | Amy | 3.9 | NULL |
NULL |
NULL |
234 | Bob | 3.6 | Berkeley | biology | N |
345 | Craig | 3.5 | Cornell | CS | Y |
345 | Craig | 3.5 | Cornell | EE | N |
345 | Craig | 3.5 | Cornell | bioengineering | N |
345 | Craig | 3.5 | MIT | bioengineering | Y |
543 | Craig | 3.4 | MIT | CS | N |
456 | Doris | 3.9 | NULL |
NULL |
NULL |
567 | Edward | 2.9 | NULL |
NULL |
NULL |
678 | Fay | 3.8 | Stanford | history | Y |
789 | Gary | 3.4 | NULL |
NULL |
NULL |
987 | Helen | 3.7 | Berkeley | CS | Y |
987 | Helen | 3.7 | Stanford | CS | Y |
876 | Irene | 3.9 | MIT | biology | Y |
876 | Irene | 3.9 | MIT | marine biology | N |
876 | Irene | 3.9 | Stanford | CS | N |
765 | Jay | 2.9 | Cornell | history | N |
765 | Jay | 2.9 | Cornell | psychology | Y |
765 | Jay | 2.9 | Stanford | history | Y |
Here we see 'the other Amy' (i.e., the one with s_id
654), Doris, Edward, and Gary, who apparently haven't applied for anything – the columns which would have been taken from the applications
table in case of a match are now set to NULL
.
The fact that an outer join pads rows which can't be matched in the 'other' table with NULL
values gives us a nice way to find out which values might be missing in a table – se the following problem.
Problem: We want to find all students who hasn't applied for any major – do it in two ways:
- Using a subquery.
%%sql
SELECT s_id, s_name
FROM students
LEFT JOIN applications USING (s_id)
WHERE major IS NULL
s_id | s_name |
---|---|
456 | Doris |
567 | Edward |
654 | Amy |
789 | Gary |
- Using an outer join (checking where we've padded with
NULL
columns).
%%sql
This way of finding missing values may look like a hack, but it's in fact idiomatic.
Other outer joins¶
There are three kinds of outer joins:
LEFT OUTER JOIN
, which can be abbreviatedLEFT JOIN
– this is when we keep all rows in the first (leftmost) table, padding withNULL
values when they don't match any row in the other table.RIGHT OUTER JOIN
, which can be abbreviatedRIGHT JOIN
– this is when we keep all rows in the second (rightmost) table, padding withNULL
values when they don't match any row in the other table.FULL OUTER JOIN
, which can be abbreviatedFULL JOIN
– this is when we keep all rows in both tables, padding withNULL
values when they don't match any row in the other table.
Exercises¶
Exercise: How many students with a gpa above 3.5 have applied for a CS major at a college in California (CA)?
%%sql
Hint: It could be a good idea to first define a CTE with the relevant students.
Your query should return 3.
Exercise: List the id, name, and gpa for all students who have applied for more than one major in California (CA).
%%sql
Your query should return:
s_id s_name gpa
---- ------ ---
123 Amy 3.9
987 Helen 3.7
Exercise: Which are the three programs (major@college) with the lowest gpa for any accepted student.
%%sql
The output should be:
c_name major min(gpa)
-------- ---------- --------
Cornell psychology 2.9
Stanford history 2.9
Cornell CS 3.5
Exercise: And which are the three programs with the highest gpa for a student not accepted?
%%sql
The output should be:
c_name major max(gpa)
-------- -------------- --------
MIT marine biology 3.9
Stanford CS 3.9
Stanford EE 3.9
Comment: Apparently the gpa isn't the only criterion for acceptance, since Irene (s_id
876) wasn't accepted at CS@Stanford, but Helen (s_id
987) was, although Irene has a higher gpa…
Exercise: In the section about correlated subqueries, we solved the problem of finding the laureates with the longest name for each category – solve this problem using a CTE instead of a correlated subquery (you can save the longest length of each in a CTE).
%%sql
Set operations¶
Relational databases and SQL are based on relational algebra, and this means that we can use simple set operations on our data.
If we look at the compound-operator
in the SELECT
statement, we see that we can combine the results of queries using UNION
, INTERSECT
, and EXCEPT
.
This only works if the queries return exactly the same columns.
Problem: List the names and gpa for students who have applied for a major at a college in California, but haven't applied in any other state – use a set operation (it could also be solved in a number of other ways).
%%sql
WITH only_california(s_id) AS (
SELECT s_id
FROM applications
JOIN colleges USING (c_name)
WHERE state = 'CA'
EXCEPT
SELECT s_id
FROM applications
JOIN colleges USING (c_name)
WHERE state <> 'CA'
)
SELECT s_name, gpa
FROM only_california
JOIN students USING (s_id)
s_name | gpa |
---|---|
Bob | 3.6 |
Fay | 3.8 |
Helen | 3.7 |
INSERT, UPDATE and DELETE¶
To maintain our databases, we need ways to change their contents, and for that we have tree important functions:
INSERT
: inserts a row into a tableUPDATE
: updates rows which matches some predicateDELETE
: deletes rows which matches some predicate
We'll talk more about these functions next week, when we talk about modeling (then we'll also discuss the concept of keys, which is quite important when we modify our tables).