EDAF75 - notebook for lecture 2¶

Author: Christian Söderberg

Week 1: Introduction, relational databases, and SQL, continued…¶

In order to run SQL in this notebook, we need to load the SQL extension:

In [1]:
%load_ext sql

If we use jupysql (instead of ipython-sql) we get a default limit of 10 rows for any query – to turn this limit off we can write:

In [2]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)

I've put all the tables used for this lecture in a file called lect02.sqlite, and we can load them using:

In [3]:
%sql sqlite:///lect02.sqlite
Connecting to 'sqlite:///lect02.sqlite'

Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).

When one table is not enough¶

In the preparation notes for this week's lectures, you were asked to think through the problem below.


Problem: We've been asked to write code to handle students and their college applications. Each student has a name, a student id (stil), and an average grade. The students apply for various majors at colleges, where each college has a name, is situated in a given state, and has a given enrollment. Each application is for a specific major at a specific college, and it can be accepted, rejected, or pending.

  1. What classes and/or data structures would you use to keep track of this data in Python or Java?
  2. How do you list all applications for a given student?
  3. How do you list all applications for a given major at a given college?
  4. Approximately how many lines of Python/Java code do you need to list the three programs (major@college) with the highest grade average for the 'last' accepted student (assuming the students are accepted in order of decreasing grade averages)?
  5. How would you persist your data, so it doesn't get lost if we have to restart the program?
  6. Can you come up with a way to let several programs access the data simultaneously?
  7. If you solve the previous problem, and want to allow several programs update the data simultaneously, how do you make sure the data isn't corrupted?

If we wanted to keep all the information above in one table, we'd need the following columns (below I use some column names which are in a style I don't normally use, but I think it could be good for you to be exposed to different styles):

  • student id (s_id)
  • student name (s_name)
  • average grade (gpa)
  • college (c_name)
  • state (state)
  • enrollment (enrollment)
  • major (major)
  • decision (decision)

Here is a (very small) example:

s_id s_name gpa c_name state enrollment major decision
123Amy3.9 StanfordCA15000 CSY
123Amy3.9 StanfordCA15000 EEN
123Amy3.9 BerkeleyCA36000 CSY
123Amy3.9 CornellNY21000 EEY
234Bob3.6 BerkeleyCA36000 biologyN
345Craig3.5 MITMA10000 bioengineeringY
345Craig3.5 CornellNY21000 bioengineeringN
345Craig3.5 CornellNY21000 CSY
345Craig3.5 CornellNY21000 EEN
678Fay3.8 StanfordCA15000 historyY
987Helen3.7 StanfordCA15000 CSY
987Helen3.7 BerkeleyCA36000 CSY
876Irene3.9 StanfordCA15000 CSN
876Irene3.9 MITMA10000 biologyY
876Irene3.9 MITMA10000 marine biologyN
765Jay2.9 StanfordCA15000 historyY
765Jay2.9 CornellNY21000 historyN
765Jay2.9 CornellNY21000 psychologyY
543Craig3.4 MITMA10000 CSN

This table is wonderful for finding information, but it comes with some big problems:

  • Some of the information is repeated in many places, which is wasteful and makes updates error prone. For example: if we want to change Amy's gpa, we'd have to change it once per application she's made, and if we're not diligent enough, we'll end up with inconsistent data.

  • If a student withdraws her last application, we'd loose all information about that student (this has actually happened here already, as you'll see below).

So, much of the information in the table is redundant, and it may seem that we could remove many rows without losing information, but in fact we can't – any row we remove will mean that we lose that specific combination of values.

In week 4 we'll discuss the theory behind these problems (functional dependencies), and see a technique we can use to find well behaved tables (normalization), but I hope that you intuitively can see that everything would be easier if we had the following three tables instead of the big table above:

  • one table, students, with information about the students (id, name and gpa)
s_id s_name gpa
123 Amy 3.9
234 Bob 3.6
345 Craig 3.5
456 Doris 3.9
543 Craig 3.4
567 Edward 2.9
654 Amy 3.9
678 Fay 3.8
765 Jay 2.9
789 Gary 3.4
876 Irene 3.9
987 Helen 3.7

Here we have some students which weren't seen in the original table above, since they did not have any applications (there is no data to put in the columns relating to applications for them – we'll soon see that there is a primitive way around this, but it's clumpsy).

In [1]:
%%sql
  • one table, colleges, with college information (name and state)
c_name state enrollment
Stanford CA 15000
Berkeley CA 36000
MIT MA 10000
Cornell NY 21000
In [1]:
%%sql
  • one table, applications, with applications (student id, college name, major, and decision)
s_id c_name major decision
123 Stanford CS Y
123 Stanford EE N
123 Berkeley CS Y
123 Cornell EE Y
234 Berkeley biology N
345 MIT bioengineering Y
345 Cornell bioengineering N
345 Cornell CS Y
345 Cornell EE N
678 Stanford history Y
987 Stanford CS Y
987 Berkeley CS Y
876 Stanford CS N
876 MIT biology Y
876 MIT marine biology N
765 Stanford history Y
765 Cornell history N
765 Cornell psychology Y
543 MIT CS N
In [1]:
%%sql

Nothing in these tables is redundant, so if we use these tables, the issues we saw above goes away:

  • if we want to update Amy's gpa, we just update her single row in the students table,

  • we now can see some students who haven't applied for anything yet (Amy with s_id 654, Doris, Edward and Gary).

We'll talk more about this in week 4.

But the benefits come with a cost, we now need a way to combine information from several tables at once.

Let's say we want to show the names and grade averages of all students who have applied for CS at Stanford – none of our three tables above contain all information we need, but if we start by looking in the applications table, we can get …

In [5]:
%%sql
SELECT    s_name, gpa
FROM      applications
WHERE     c_name = 'Stanford' AND major = 'CS'
Running query in 'sqlite:///lect02.sqlite'
RuntimeError: (sqlite3.OperationalError) no such column: s_name
[SQL: SELECT    s_name, gpa
FROM      applications
WHERE     c_name = 'Stanford' AND major = 'CS']
(Background on this error at: https://sqlalche.me/e/20/e3q8)
In [11]:
%%sql
SELECT   s_name, gpa
FROM     applications
         INNER JOIN students ON applications.s_id = students.s_id
WHERE    c_name = 'Stanford' AND major = 'CS'
Running query in 'sqlite:///lect02.sqlite'
Out[11]:
s_name gpa
Amy 3.9
Irene 3.9
Helen 3.7

The best we can get here is the s_id of all relevant students, but given the s_id we can look in the students table for the name and gpa.

In regular programming languages, such as Python or Java, we could do that with functions such as find_name(s_id) and find_gpa(s_id):

In [1]:
%%sql
SELECT    find_name(s_id), find_gpa(s_id)     -- not in SQL!
FROM      applications
WHERE     c_name = 'Stanford' AND major = 'CS'

But SQL is no regular language, it's based on relational algebra, and there we have another, really cool way of solving this problem: joins).

Inner joins¶

We solve the problem above by combining (joining) the applications table with the students table. There are several kinds of joins in SQL (and relational algebra), but the most straightforward kind is what's called an inner join#Inner_join).

The inner join of two tables is a new table, where rows from each table are joined together if the attributes of the rows matches a join predicate – in our case we want to add student data to the application data where the student id in the applications table matches the student id in the students table.

This can be expressed as (see the docs):

In [10]:
%%sql
SELECT *
FROM   applications
       INNER JOIN students ON applications.s_id = students.s_id
Running query in 'sqlite:///lect02.sqlite'
Out[10]:
s_id c_name major decision s_id_1 s_name gpa
123 Stanford CS Y 123 Amy 3.9
123 Stanford EE N 123 Amy 3.9
123 Berkeley CS Y 123 Amy 3.9
123 Cornell EE Y 123 Amy 3.9
234 Berkeley biology N 234 Bob 3.6
345 MIT bioengineering Y 345 Craig 3.5
345 Cornell bioengineering N 345 Craig 3.5
345 Cornell CS Y 345 Craig 3.5
345 Cornell EE N 345 Craig 3.5
678 Stanford history Y 678 Fay 3.8
987 Stanford CS Y 987 Helen 3.7
987 Berkeley CS Y 987 Helen 3.7
876 Stanford CS N 876 Irene 3.9
876 MIT biology Y 876 Irene 3.9
876 MIT marine biology N 876 Irene 3.9
765 Stanford history Y 765 Jay 2.9
765 Cornell history N 765 Jay 2.9
765 Cornell psychology Y 765 Jay 2.9
543 MIT CS N 543 Craig 3.4

and we would get the joined table (note that the first columns are from the applications table, whereas the last columns are from the students table – also note that s_id and s_id' always are the same in each row, since that was what we joined on):

s_id c_name major decision s_id' s_name gpa
123StanfordCS Y123Amy3.9
123StanfordEE N123Amy3.9
123BerkeleyCS Y123Amy3.9
123CornellEE Y123Amy3.9
234Berkeleybiology N234Bob3.6
345MITbioengineering Y345Craig3.5
345Cornellbioengineering N345Craig3.5
345CornellCS Y345Craig3.5
345CornellEE N345Craig3.5
678Stanfordhistory Y678Fay3.8
987StanfordCS Y987Helen3.7
987BerkeleyCS Y987Helen3.7
876StanfordCS N876Irene3.9
876MITbiology Y876Irene3.9
876MITmarine biology N876Irene3.9
765Stanfordhistory Y765Jay2.9
765Cornellhistory N765Jay2.9
765Cornellpsychology Y765Jay2.9
543MITCS N543Craig3.4

This contains all information we need for our query.

We'll soon see other kinds of joins, but inner joins are the 'default', and so we can skip the reserved word INNER, and just write:

In [1]:
%%sql
SELECT *
FROM   applications
       JOIN students ON applications.s_id = students.s_id

Since the join predicate is just a test of equality of one attribute from each table (it's called an equijoin), and the attribute has the same name in both tables, we can use the simpler form:

In [1]:
%%sql
SELECT *
FROM   applications
       JOIN students USING (s_id)

which gives (observe that we now only have one s_id column):

s_id c_name major decision s_name gpa
123StanfordCS YAmy3.9
123StanfordEE NAmy3.9
123BerkeleyCS YAmy3.9
123CornellEE YAmy3.9
234Berkeleybiology NBob3.6
345MITbioengineering YCraig3.5
345Cornellbioengineering NCraig3.5
345CornellCS YCraig3.5
345CornellEE NCraig3.5
678Stanfordhistory YFay3.8
987StanfordCS YHelen3.7
987BerkeleyCS YHelen3.7
876StanfordCS NIrene3.9
876MITbiology YIrene3.9
876MITmarine biology NIrene3.9
765Stanfordhistory YJay2.9
765Cornellhistory NJay2.9
765Cornellpsychology YJay2.9
543MITCS NCraig3.4

In the first join above we got two s_id columns (SQL can't know for certain that s_id describes the same thing in both tables), but with the JOIN USING construct, we tell SQL that they describe the same thing, so the join returns only one column for s_id.

In general, we can compare sets of several attributes in a JOIN USING query using a tuple (as we'll soon see).

The number of rows in a joined table¶

At this point, we should consider how many rows we get when we join two tables.

In the table above, each valid application is matched with one student, so the number of rows will be the total number of applications for all students in the students table (students with no applications will not contribute in the resulting table, and applications which does not have a corresponding student will not show up either).

Assume we want to keep track of the students' phone numbers – the first 'rule' of relational databases is that all values must be primitive, so if a student has more than one phone number, she can't have them saved in a list in a table. Instead we introduce a new table, phone_numbers, like this:

In [12]:
%%sql
SELECT   *
FROM     phone_numbers
Running query in 'sqlite:///lect02.sqlite'
Out[12]:
s_id phone_no
123 650-550-8671
345 347-467-6210
345 516-881-8303
456 650-255-3090
456 650-376-9029
456 628-265-5898
543 617-155-4170
567 650-412-4837
654 619-339-6883
678 650-228-4538
765 508-765-8878
876 774-877-3733
876 617-522-3278
987 650-964-4042
s_id phone_no
123 650-550-8671
345 347-467-6210
345 516-881-8303
456 650-255-3090
456 650-376-9029
456 628-265-5898
543 617-155-4170
567 650-412-4837
654 619-339-6883
678 650-228-4538
765 508-765-8878
876 774-877-3733
876 617-522-3278
987 650-964-4042

Here we use the students' s_id numbers, since their names might not be unique (we have two Amys and two Craigs), and then provide a phone number – we'll talk more about uniqueness in the next few weeks.

Each student can have either no phone number, one phone number, or several phone numbers, in the table above, Amy (s_id 101) has 1, Bob (s_id 234) has none, Craig (s_id 345) has two, and Doris (s_id 456) has three.

If we join our students table with this table, using an equijoin on s_id, we get:

In [13]:
%%sql
SELECT     *
FROM       students
           JOIN phone_numbers USING (s_id)
Running query in 'sqlite:///lect02.sqlite'
Out[13]:
s_id s_name gpa phone_no
123 Amy 3.9 650-550-8671
345 Craig 3.5 347-467-6210
345 Craig 3.5 516-881-8303
456 Doris 3.9 650-255-3090
456 Doris 3.9 650-376-9029
456 Doris 3.9 628-265-5898
543 Craig 3.4 617-155-4170
567 Edward 2.9 650-412-4837
654 Amy 3.9 619-339-6883
678 Fay 3.8 650-228-4538
765 Jay 2.9 508-765-8878
876 Irene 3.9 774-877-3733
876 Irene 3.9 617-522-3278
987 Helen 3.7 650-964-4042
s_id s_name phone_no
123 Amy 650-550-8671
345 Craig 347-467-6210
345 Craig 516-881-8303
456 Doris 628-265-5898
456 Doris 650-255-3090
456 Doris 650-376-9029
… … …

So, since each student has its own row in the students table, they get one row per phone number in the joined table – this means that Bob, who didn't have any phone number, will not survive the join, whereas Craig and Doris get several rows (one row per phone number). In the section on outer joins below, we'll see how we can keep Bob alive, even if he doesn't have any phone number.

If each student already had several rows, such as if we had joined students and applications, and then joined this combined table with phone_numbers, we would get:

In [14]:
%%sql
SELECT    *
FROM      students
          JOIN applications USING (s_id)
          JOIN phone_numbers USING (s_id)
WHERE     s_id < 500
Running query in 'sqlite:///lect02.sqlite'
Out[14]:
s_id s_name gpa c_name major decision phone_no
123 Amy 3.9 Berkeley CS Y 650-550-8671
123 Amy 3.9 Cornell EE Y 650-550-8671
123 Amy 3.9 Stanford CS Y 650-550-8671
123 Amy 3.9 Stanford EE N 650-550-8671
345 Craig 3.5 Cornell CS Y 347-467-6210
345 Craig 3.5 Cornell CS Y 516-881-8303
345 Craig 3.5 Cornell EE N 347-467-6210
345 Craig 3.5 Cornell EE N 516-881-8303
345 Craig 3.5 Cornell bioengineering N 347-467-6210
345 Craig 3.5 Cornell bioengineering N 516-881-8303
345 Craig 3.5 MIT bioengineering Y 347-467-6210
345 Craig 3.5 MIT bioengineering Y 516-881-8303
s_id s_name gpa c_name major decision phone_no
123 Amy 3.9 Berkeley CS Y 650-550-8671
123 Amy 3.9 Cornell EE Y 650-550-8671
123 Amy 3.9 Stanford CS Y 650-550-8671
123 Amy 3.9 Stanford EE N 650-550-8671
345 Craig 3.5 Cornell CS Y 347-467-6210
345 Craig 3.5 Cornell CS Y 516-881-8303
345 Craig 3.5 Cornell EE N 347-467-6210
345 Craig 3.5 Cornell EE N 516-881-8303
345 Craig 3.5 Cornell bioengineering N 347-467-6210
345 Craig 3.5 Cornell bioengineering N 516-881-8303
345 Craig 3.5 MIT bioengineering Y 347-467-6210
345 Craig 3.5 MIT bioengineering Y 516-881-8303

So, the number of rows per student will be the product of the number of rows they had in each of the joined tables.

Also, if we had a phone number with an unknown s_id, the number would not end up in our joined table.

The results of an inner join of several tables does not depend on in which order we join our tables (but joining them in the right order can potentially save an enormous amount of work for a big join, just as multiplying matrices in the right order depending on their shape can save lots of work – fortunately the 'query planner' is very clever when it decides how to join tables, we'll return to that later).

In [20]:
%%sql
SELECT    *
FROM      students
          JOIN applications USING (s_id)
          JOIN colleges USING (c_name)
Running query in 'sqlite:///lect02.sqlite'
Out[20]:
s_id s_name gpa c_name major decision state enrollment
123 Amy 3.9 Stanford CS Y CA 15000
123 Amy 3.9 Stanford EE N CA 15000
123 Amy 3.9 Berkeley CS Y CA 36000
123 Amy 3.9 Cornell EE Y NY 21000
234 Bob 3.6 Berkeley biology N CA 36000
345 Craig 3.5 MIT bioengineering Y MA 10000
345 Craig 3.5 Cornell bioengineering N NY 21000
345 Craig 3.5 Cornell CS Y NY 21000
345 Craig 3.5 Cornell EE N NY 21000
678 Fay 3.8 Stanford history Y CA 15000
987 Helen 3.7 Stanford CS Y CA 15000
987 Helen 3.7 Berkeley CS Y CA 36000
876 Irene 3.9 Stanford CS N CA 15000
876 Irene 3.9 MIT biology Y MA 10000
876 Irene 3.9 MIT marine biology N MA 10000
765 Jay 2.9 Stanford history Y CA 15000
765 Jay 2.9 Cornell history N NY 21000
765 Jay 2.9 Cornell psychology Y NY 21000
543 Craig 3.4 MIT CS N MA 10000
After we've joined…¶

Once we've joined our tables, we can proceed as we did before, with a WHERE clause:

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications
          JOIN students USING (s_id)
WHERE     c_name = 'Stanford' AND major = 'CS'

If we look at the syntax diagram for the SELECT statement, we see that we join our tables in the FROM clause (see join-clause), and that we can use everything we've learned so far to work on the joined tables (so we can group and use window functions and subqueries and CTE:s just as we've done before).

The combined, rather clumsy table will never be created, database engines are extremely clever when they decide on what data they need to keep around to execute their queries – so we normally do not need to worry much about efficiency (we'll talk a little more about this later in the course).


It could be useful to stop here for a while, and consider how you'd solve the problem above using Python or Java (or any other regular programming language). In case we let the students keep track of their applications, we'd need to loop through all students, and then all their applications looking for applications to CS@Stanford.

Deciding on who should keep track of what is a typical example of premature optimization, and using a relational database relieves us of this problem.


Problem: Show the id and name for all students who have applied for CS in California (state code CA).

In [23]:
%%sql
SELECT    s_name, gpa, c_name
FROM      applications
          JOIN students USING (s_id)
          JOIN colleges USING (c_name)
WHERE     state = 'CA' AND major = 'CS'
Running query in 'sqlite:///lect02.sqlite'
Out[23]:
s_name gpa c_name
Amy 3.9 Berkeley
Amy 3.9 Stanford
Irene 3.9 Stanford
Helen 3.7 Berkeley
Helen 3.7 Stanford

Cross joins¶

As we saw above, an inner join combines rows in two tables when the rows matches on some join predicate. The predicate is typically dependent on column values in both rows (e.g., that the student ids in each table should match), but we can also use a seemingly nonsensical predicate such as true (which is always true):

In [25]:
%%sql
SELECT   *
FROM     applications
         JOIN students ON true
LIMIT    10
Running query in 'sqlite:///lect02.sqlite'
Out[25]:
s_id c_name major decision s_id_1 s_name gpa
123 Stanford CS Y 123 Amy 3.9
123 Stanford CS Y 234 Bob 3.6
123 Stanford CS Y 345 Craig 3.5
123 Stanford CS Y 456 Doris 3.9
123 Stanford CS Y 543 Craig 3.4
123 Stanford CS Y 567 Edward 2.9
123 Stanford CS Y 654 Amy 3.9
123 Stanford CS Y 678 Fay 3.8
123 Stanford CS Y 765 Jay 2.9
123 Stanford CS Y 789 Gary 3.4

This would mean that all rows in the applications table would match all rows in the students table, and we would get what's called the cartesian product of the two tables (or sets of rows), i.e., all combinations of rows in the tables.

This is exactly what a cross join is – we can write it as:

In [1]:
%%sql
SELECT   *
FROM     applications
         CROSS JOIN students

and we typically use a WHERE clause to narrow the generated table into a table where only matching rows remains afterwards.

So, the query

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications
          JOIN students USING (s_id)
WHERE     c_name = 'Stanford' AND major = 'CS'

could have been written using a cross join instead:

In [26]:
%%sql
SELECT    s_name, gpa
FROM      applications
          CROSS JOIN students
WHERE     applications.s_id = students.s_id
          AND c_name = 'Stanford'
          AND major = 'CS'
Running query in 'sqlite:///lect02.sqlite'
Out[26]:
s_name gpa
Amy 3.9
Irene 3.9
Helen 3.7

This may look terrible at first sight, creating a huge table and then throwing away most of the combined rows, but, as said above, database engines are very, very clever, and they might find a way to make even this query fast. But I'd still recommend that you use inner queries, since it might help the database to find a faster query plan, and it's conceptually easier to understand for a person looking at your code.

By the way, there is a shortcut for cross joins, …

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications, students
WHERE     applications.s_id = students.s_id
          AND c_name = 'Stanford'
          AND major = 'CS'

but, once again, opt for inner joins instead!

Outer joins¶

We saw above that a row in one table which matches several rows in the other table would give rise to several rows in the combined table. But it also goes the other way – when we do an inner join on two tables, rows for which the join predicate doesn't match any row in the other table will 'disappear' in the combined table.

So if we wanted to list all students and their applications, and wrote:

In [27]:
%%sql
SELECT    *
FROM      students
          JOIN applications USING (s_id)
ORDER BY  s_name
Running query in 'sqlite:///lect02.sqlite'
Out[27]:
s_id s_name gpa c_name major decision
123 Amy 3.9 Stanford CS Y
123 Amy 3.9 Stanford EE N
123 Amy 3.9 Berkeley CS Y
123 Amy 3.9 Cornell EE Y
234 Bob 3.6 Berkeley biology N
345 Craig 3.5 MIT bioengineering Y
345 Craig 3.5 Cornell bioengineering N
345 Craig 3.5 Cornell CS Y
345 Craig 3.5 Cornell EE N
543 Craig 3.4 MIT CS N
678 Fay 3.8 Stanford history Y
987 Helen 3.7 Stanford CS Y
987 Helen 3.7 Berkeley CS Y
876 Irene 3.9 Stanford CS N
876 Irene 3.9 MIT biology Y
876 Irene 3.9 MIT marine biology N
765 Jay 2.9 Stanford history Y
765 Jay 2.9 Cornell history N
765 Jay 2.9 Cornell psychology Y

we would get all applications, but those students who hadn't applied for any major would be left out.

If we want to make sure that all rows in the first ('left') table survives the join, we can use a left outer join. An outer join is a join in which we guarantee that all rows in one or both our tables will be present in the result. But that leads to an important question: how do we combine an 'uncombinable' row? The answer is that we combine it with a row full of NULL values in the other table – if we try a left outer join, and put the students table first (on the left), we get:

In [28]:
%%sql
SELECT    *
FROM      students
          LEFT OUTER JOIN applications USING (s_id)
ORDER BY  s_name
Running query in 'sqlite:///lect02.sqlite'
Out[28]:
s_id s_name gpa c_name major decision
123 Amy 3.9 Berkeley CS Y
123 Amy 3.9 Cornell EE Y
123 Amy 3.9 Stanford CS Y
123 Amy 3.9 Stanford EE N
654 Amy 3.9 None None None
234 Bob 3.6 Berkeley biology N
345 Craig 3.5 Cornell CS Y
345 Craig 3.5 Cornell EE N
345 Craig 3.5 Cornell bioengineering N
345 Craig 3.5 MIT bioengineering Y
543 Craig 3.4 MIT CS N
456 Doris 3.9 None None None
567 Edward 2.9 None None None
678 Fay 3.8 Stanford history Y
789 Gary 3.4 None None None
987 Helen 3.7 Berkeley CS Y
987 Helen 3.7 Stanford CS Y
876 Irene 3.9 MIT biology Y
876 Irene 3.9 MIT marine biology N
876 Irene 3.9 Stanford CS N
765 Jay 2.9 Cornell history N
765 Jay 2.9 Cornell psychology Y
765 Jay 2.9 Stanford history Y
s_id c_name major decision s_name gpa
123Amy3.9 BerkeleyCSY
123Amy3.9 CornellEEY
123Amy3.9 StanfordCSY
123Amy3.9 StanfordEEN
654Amy3.9 NULL NULL NULL
234Bob3.6 BerkeleybiologyN
345Craig3.5 CornellCSY
345Craig3.5 CornellEEN
345Craig3.5 CornellbioengineeringN
345Craig3.5 MITbioengineeringY
543Craig3.4 MITCSN
456Doris3.9 NULL NULL NULL
567Edward2.9 NULL NULL NULL
678Fay3.8 StanfordhistoryY
789Gary3.4 NULL NULL NULL
987Helen3.7 BerkeleyCSY
987Helen3.7 StanfordCSY
876Irene3.9 MITbiologyY
876Irene3.9 MITmarine biologyN
876Irene3.9 StanfordCSN
765Jay2.9 CornellhistoryN
765Jay2.9 CornellpsychologyY
765Jay2.9 StanfordhistoryY

Here we see 'the other Amy' (i.e., the one with s_id 654), Doris, Edward, and Gary, who apparently haven't applied for anything – the columns which would have been taken from the applications table in case of a match are now set to NULL.

Using outer joins to find 'missing' values¶

The fact that an outer join pads rows which can't be matched in the 'other' table with NULL values gives us a nice way to find out which values might be missing in a table.

Problem: We want to find all students who haven't applied for any major – do it in two ways:

  • Using a subquery.
In [30]:
%%sql
SELECT    s_id, s_name
FROM      students
WHERE     NOT s_id IN (
    SELECT    s_id
    FROM      applications)
Running query in 'sqlite:///lect02.sqlite'
Out[30]:
s_id s_name
456 Doris
567 Edward
654 Amy
789 Gary
  • Using an outer join (checking where we've padded with NULL columns).
In [32]:
%%sql
SELECT    s_id, s_name
FROM      students
          LEFT JOIN applications  USING (s_id)
WHERE     major IS NULL
Running query in 'sqlite:///lect02.sqlite'
Out[32]:
s_id s_name
456 Doris
567 Edward
654 Amy
789 Gary
In [31]:
%%sql
SELECT    NULL = NULL
Running query in 'sqlite:///lect02.sqlite'
Out[31]:
NULL = NULL
None

This way of finding missing values may look like a hack, but it's in fact idiomatic.

Problem: In 1940-1942 no Nobel prizes were awarded (because of the Second World War), but the prize has been withheld in one or more categories in other years as well. Write a query which shows all such withheld awards, use suitable CTEs and use a subquery – observe that we only want 'missing' awards in years in which there was an award in at least one other category (such as literature in 1914 or medicine in 1915).

In [33]:
%%sql
WITH
  award_years(year) AS (
    SELECT DISTINCT  year
    FROM     nobel_prizes
  ),
  categories(category) AS (
    SELECT DISTINCT category
    FROM   nobel_prizes
  ),
  potential_awards(year, category) AS (
    SELECT  year, category
    FROM    award_years CROSS JOIN categories
  ),
  actual_awards(year, category) AS (
     SELECT DISTINCT year, category
     FROM nobel_prizes
  )
SELECT  year, category
FROM    potential_awards
WHERE   (year, category) NOT IN actual_awards
Running query in 'sqlite:///lect02.sqlite'
Out[33]:
year category
1914 literature
1915 medicine
1916 chemistry
1916 medicine
1916 physics
1917 chemistry
1917 medicine
1918 literature
1918 medicine
1919 chemistry
1921 medicine
1924 chemistry
1925 medicine
1931 physics
1933 chemistry
1934 physics
1935 literature
1943 literature

Problem: The problem above can be solved with an outer join instead of a subquery, but we need to make sure we have attributes to check after the outer join. Solve it this way.

In [39]:
%%sql
WITH
  award_years(year) AS (
    SELECT DISTINCT  year
    FROM     nobel_prizes
  ),
  categories(category) AS (
    SELECT DISTINCT category
    FROM   nobel_prizes
  ),
  potential_awards(year, category) AS (
    SELECT  year, category
    FROM    award_years CROSS JOIN categories
  ),
  actual_awards(year, category, flag) AS (
     SELECT DISTINCT year, category, 1
     FROM nobel_prizes
  )
SELECT  *
FROM    potential_awards
        LEFT JOIN actual_awards USING (year, category)
WHERE   flag IS NULL
Running query in 'sqlite:///lect02.sqlite'
Out[39]:
year category flag
1914 literature None
1915 medicine None
1916 chemistry None
1916 medicine None
1916 physics None
1917 chemistry None
1917 medicine None
1918 literature None
1918 medicine None
1919 chemistry None
1921 medicine None
1924 chemistry None
1925 medicine None
1931 physics None
1933 chemistry None
1934 physics None
1935 literature None
1943 literature None
In [38]:
%%sql
WITH
  award_years(year) AS (
    SELECT DISTINCT  year
    FROM     nobel_prizes
  ),
  categories(category) AS (
    SELECT DISTINCT category
    FROM   nobel_prizes
  ),
  potential_awards(year, category) AS (
    SELECT  year, category
    FROM    award_years CROSS JOIN categories
  ),
  actual_awards(year, category, count) AS (
     SELECT  year, category, count()
     FROM nobel_prizes
     GROUP BY year, category
    )
SELECT  *
FROM    potential_awards
        LEFT JOIN actual_awards USING (year, category)
WHERE   count IS NULL
Running query in 'sqlite:///lect02.sqlite'
Out[38]:
year category count
1914 literature None
1915 medicine None
1916 chemistry None
1916 medicine None
1916 physics None
1917 chemistry None
1917 medicine None
1918 literature None
1918 medicine None
1919 chemistry None
1921 medicine None
1924 chemistry None
1925 medicine None
1931 physics None
1933 chemistry None
1934 physics None
1935 literature None
1943 literature None

Exercise: Add the number of other categories in which prizes were awarded for each of the missing prizes above.

In [1]:
%%sql

The first lines of the output should be something like:

1914: literature (3)
1915: medicine   (3)
1916: chemistry  (1)
1916: medicine   (1)
1916: physics    (1)
...

There are some more exercises on joins below.

Other outer joins¶

There are three kinds of outer joins:

  • LEFT OUTER JOIN, which can be abbreviated LEFT JOIN – this is when we keep all rows in the first (leftmost) table, padding with NULL values when they don't match any row in the other (right) table.

  • RIGHT OUTER JOIN, which can be abbreviated RIGHT JOIN – this is when we keep all rows in the second (rightmost) table, padding with NULL values when they don't match any row in the other (left) table.

  • FULL OUTER JOIN, which can be abbreviated FULL JOIN – this is when we keep all rows in both tables, padding with NULL values when they don't match any row in the other table.

Exercises on JOIN¶

Exercise: Which students with a gpa above 3.5 have applied for a CS major at a college in California (CA) – show id, name, gpa and the names of the colleges they've applied to the CS program on?

In [1]:
%%sql

Hint: It could be a good idea to first define a CTE with the relevant students.

Your query should return 3 rows:

id   name   gpa  colleges
---  -----  ---  -----------------
123  Amy    3.9  Berkeley,Stanford
876  Irene  3.9  Stanford
987  Helen  3.7  Berkeley,Stanford

Exercise: List the id, name, and gpa for all students who have applied for more than one major in California (CA).

In [1]:
%%sql

Your query should return:

s_id  s_name  gpa
----  ------  ---
123   Amy     3.9
987   Helen   3.7

Exercise: Which are the three programs (major@college) with the lowest gpa for any accepted student.

In [1]:
%%sql

The output should be:

name      major       lowest gpa
--------  ----------  ----------
Cornell   psychology  2.9
Stanford  history     2.9
Cornell   CS          3.5

Exercise: And which are the three programs with the highest gpa for a student not accepted?

In [1]:
%%sql

The output should be:

name      major           highest gpa
--------  --------------  -----------
MIT       marine biology  3.9
Stanford  CS              3.9
Stanford  EE              3.9

Comment: Apparently the gpa isn't the only criterion for acceptance, since Irene (s_id 876) wasn't accepted at CS@Stanford, but Helen (s_id 987) was, although Irene has a higher gpa…

Exercise: In the section about correlated subqueries, we solved the problem of finding the laureates with the longest name for each category – solve this problem using a CTE instead of a correlated subquery (you can save the longest name length of each category in a CTE).

In [1]:
%%sql

Set operations¶

Relational databases and SQL are based on relational algebra, and this means that we can use simple set operations on our data. If we look at the compound-operator in the SELECT statement, we see that we can combine the results of queries using UNION, INTERSECT, and EXCEPT.

This only works if the queries return exactly the same columns.

No description has been provided for this image

Problem: List the names and gpa for students who have applied for a major at a college in California, but haven't applied in any other state – use a set operation (it could also be solved in a number of other ways).

In [37]:
%%sql
WITH
  ca_but_not_elsewhere(s_id) AS (
    SELECT    s_id
    FROM      applications
              JOIN colleges USING (c_name)
    WHERE     state = 'CA'

    EXCEPT

    SELECT    s_id
    FROM      applications
              JOIN colleges USING (c_name)
    WHERE     state != 'CA'
  )
SELECT    s_name, gpa
FROM      ca_but_not_elsewhere
          JOIN students USING (s_id)
Running query in 'sqlite:///lect02.sqlite'
Out[37]:
s_name gpa
Bob 3.6
Fay 3.8
Helen 3.7

INSERT, UPDATE and DELETE¶

To maintain our databases, we need ways to change their contents, and for that we have tree important functions:

  • INSERT: inserts a row into a table

    No description has been provided for this image
  • UPDATE: updates rows which matches some predicate

    No description has been provided for this image
  • DELETE: deletes rows which matches some predicate

    No description has been provided for this image

We'll talk about these functions next week, when we talk about modeling (then we'll also discuss the concept of keys, which is very important when we modify our tables).