EDAF75 - notebook for lecture 2¶

Author: Christian Söderberg

Week 1: Introduction, relational databases, and SQL, continued…¶

In order to run SQL in this notebook, we need to load the SQL extension:

In [1]:
%load_ext sql
/home/cso/lects/db/lect-01/notebooks/.venv/lib/python3.12/site-packages/sql/parse.py:338: SyntaxWarning: invalid escape sequence '\:'
  """
/home/cso/lects/db/lect-01/notebooks/.venv/lib/python3.12/site-packages/sql/parse.py:368: SyntaxWarning: invalid escape sequence '\:'
  """

If we use jupysql (instead of ipython-sql) we get a default limit of 10 rows for any query – to turn this limit off we can write:

In [2]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)

I've put all the tables used for this lecture in a file called lect01.sqlite, and we can load them using:

In [3]:
%sql sqlite:///lect02.sqlite
Connecting to 'sqlite:///lect02.sqlite'

Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).

Subqueries, Views and Common Table Expressions (CTEs)¶

As we noted above, the result of a SELECT statement is itself a (kind of) table, and we can use such a table inside other SELECT statements.

One useful pattern is:

SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)

The second, nested query is called a subquery.

We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?

In [4]:
%%sql
SELECT     year
FROM       nobel_prizes
WHERE      category = 'literature'
GROUP BY   year
HAVING     count() > 1
Running query in 'sqlite:///lect02.sqlite'
Out[4]:
year
1904
1917
1966
1974

… and now we use the result of that query to find out what we're really looking for:

In [8]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN (
    SELECT     year
    FROM       nobel_prizes
    WHERE      category = 'literature'
GROUP BY   year
HAVING     count() > 1
          )
Running query in 'sqlite:///lect02.sqlite'
Out[8]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson

This can be simplified by using either of two ways to define 'temporary tables':

  • views
  • Common Table Expressions (CTE:s).

A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:

In [9]:
%%sql
CREATE VIEW shared_literature_prize(year) AS
  SELECT     year
FROM       nobel_prizes
WHERE      category = 'literature'
GROUP BY   year
HAVING     count() > 1
Running query in 'sqlite:///lect02.sqlite'
Out[9]:

We can now use our view in a query:

In [10]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN (
            SELECT    year
            FROM      shared_literature_prize
          )
Running query in 'sqlite:///lect02.sqlite'
Out[10]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson

We can simplify this expression, by writing:

In [11]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize
Running query in 'sqlite:///lect02.sqlite'
Out[11]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson

If we have more than one column in the output of the subquery, we must compare it to a tuple within parentheses (this is a quite new addition to SQL, so it might not work everywhere).

The view will be around until we decide to remove it with:

In [12]:
%%sql
DROP VIEW shared_literature_prize
Running query in 'sqlite:///lect02.sqlite'
Out[12]:

A view isn't actually a table – each time the view is used it runs its query, which will take some time, but it means we get all updates in the underlying table(s).

A Common Table Expression (or CTE) is like a view, but only defined as part of a query, so we'd write it like:

In [1]:
%%sql
WITH
  shared_literature_prize(year) AS (
    SELECT ... -- as above...
  )
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

There are some things which makes CTEs very nice:

  • they're defined as part of a SELECT statement (so there is nothing to drop afterwards),
  • since they're part of a SELECT statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and
  • they can be defined recursively (we'll return to that later in the course).

Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.

In [1]:
%%sql
-- exercise

Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery:

In [1]:
%%sql

and then rewrite it using a CTE:

In [1]:
%%sql

We saw above that we can't have another WHERE after the HAVING clause:

In [1]:
%%sql
SELECT    category, count() AS count
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
WHERE     count > 1        --   <-- not allowed!

but we can make our 'grouping query' into a subquery, and have another WHERE in the outer query:

In [1]:
%%sql
SELECT category, count
FROM (
    SELECT    category, count() AS count
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3)
WHERE  count > 1

A tidier way of expressing this is to use a WITH-statement (CTE):

In [1]:
%%sql
WITH category_count(category, count) AS (
    SELECT    category, count()
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3
)
SELECT category, count
FROM   category_count
WHERE  count > 1

Exercise: How many different laureates do we have in each category? Observe that there are a couple of laureates with two awards in the same category, they should only be counted once – Marie Curie, on the other hand, has won in two categories, so she should be counted in two different categories. Use a CTE to keep track of all laureates in each category.

In [1]:
%%sql

Problem: There is a 'funny' function which returns series of values – it's called generate_series, and you can try to use it to see which year no nobel price at all was awarded (the function generate_series is quite new, so this might not work in your notebook if it uses an older version of sqlite3).

In [1]:
%%sql

Exercise: Some of the years when Nobel prizes actually were awarded, it was withheld for one or more categories – in which years did the Swedish Academy refuse to award anyone, despite prizes being awarded in other categories?

In [1]:
%%sql

Correlated subqueries¶

Another form of subquery allows us to correlate two SELECT statements, so that we use values from an outer SELECT statement inside the SELECT statement in a subquery. As an example, solve the following problem:

Problem: List the names of all laureates who have the longest name of all laureates in their categories (in case of ties, all should be listed) – order by category.

Here we can use a subquery which is 'run' for each row in our main query:

In [1]:
%%sql
SELECT  category, year, name
FROM    nobel_prizes AS outer_nobel
WHERE   length(name) = (
            SELECT max(length(name))
            FROM   nobel_prizes
            WHERE  category = outer_nobel.category)
ORDER BY category

This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel table in the outer query and the nobel table in the subquery (it's the same table, but we 'iterate' through it separately).

BTW, we could have skipped the AS in

...
FROM    nobel_prizes AS outer_nobel
...

and just written:

...
FROM    nobel_prizes outer_nobel
...

The general opinion is that we should use AS, as it makes it more obvious that we're defining an alias.

When one table is not enough¶

In the preparation notes for this week's lectures, you were asked to think through the problem below.


Problem: We've been asked to write code to handle students and their college applications. Each student has a name, a student id (stil), and an average grade. The students apply for various majors at colleges, where each college has a name, is situated in a given state, and has a given enrollment. Each application is for a specific major at a specific college, and it can be accepted, rejected, or pending.

  1. What classes and/or data structures would you use to keep track of this data in Python or Java?
  2. How do you list all applications for a given student?
  3. How do you list all applications for a given major at a given college?
  4. Approximately how many lines of Python/Java code do you need to list the three programs (major@college) with the highest grade average for the 'last' accepted student (assuming the students are accepted in order of decreasing grade averages)?
  5. How would you persist your data, so it doesn't get lost if we have to restart the program?
  6. Can you come up with a way to let several programs access the data simultaneously?
  7. If you solve the previous problem, and want to allow several programs update the data simultaneously, how do you make sure the data isn't corrupted?

If we wanted to keep all the information above in one table, we'd need the following columns (below I use some column names which are in a style I don't normally use, but I think it could be good for you to be exposed to different styles):

  • student id (s_id)
  • student name (s_name)
  • average grade (gpa)
  • college (c_name)
  • state (state)
  • enrollment (enrollment)
  • major (major)
  • decision (decision)

Here is a (very small) example:

s_id s_name gpa c_name state enrollment major decision
123Amy3.9 StanfordCA15000 CSY
123Amy3.9 StanfordCA15000 EEN
123Amy3.9 BerkeleyCA36000 CSY
123Amy3.9 CornellNY21000 EEY
234Bob3.6 BerkeleyCA36000 biologyN
345Craig3.5 MITMA10000 bioengineeringY
345Craig3.5 CornellNY21000 bioengineeringN
345Craig3.5 CornellNY21000 CSY
345Craig3.5 CornellNY21000 EEN
678Fay3.8 StanfordCA15000 historyY
987Helen3.7 StanfordCA15000 CSY
987Helen3.7 BerkeleyCA36000 CSY
876Irene3.9 StanfordCA15000 CSN
876Irene3.9 MITMA10000 biologyY
876Irene3.9 MITMA10000 marine biologyN
765Jay2.9 StanfordCA15000 historyY
765Jay2.9 CornellNY21000 historyN
765Jay2.9 CornellNY21000 psychologyY
543Craig3.4 MITMA10000 CSN

This table is wonderful for finding information, but it comes with some big problems:

  • some of the information is repeated in many places (which is wasteful and error prone)

  • if we want to change Amy's gpa, we'd have to change it in several places (once per application she's made)

  • if a student withdraws her last application, we'd loose all information about that student (this has actually happened here already, as you'll see below)

So, although much of the information in the table is redundant, and it may seem that we easily could remove many rows without losing any information, in fact we can't (as we'll see in week 4) – any row we remove will mean that we lose some potentially important information (that specific combination of values will get lost).

In week 4 we'll discuss the theory behind these problems (functional dependencies), and see a technique we can use to find well behaved tables (normalization), but I hope that you intuitively can see that everything would be easier if we had the following three tables instead of the big table above:

  • one table, students, with information about the students (id, name and gpa)
s_id s_name gpa
123 Amy 3.9
234 Bob 3.6
345 Craig 3.5
456 Doris 3.9
543 Craig 3.4
567 Edward 2.9
654 Amy 3.9
678 Fay 3.8
765 Jay 2.9
789 Gary 3.4
876 Irene 3.9
987 Helen 3.7

Here we have some students which weren't seen in the original table above, since they did not have any applications (there is no data to put in the columns relating to applications for them – we'll soon see that there is a way around this).

In [1]:
%%sql
  • one table, colleges, with college information (name and state)
c_name state enrollment
Stanford CA 15000
Berkeley CA 36000
MIT MA 10000
Cornell NY 21000
In [1]:
%%sql
  • one table, applications, with applications (student id, college name, major, and decision)
s_id c_name major decision
123 Stanford CS Y
123 Stanford EE N
123 Berkeley CS Y
123 Cornell EE Y
234 Berkeley biology N
345 MIT bioengineering Y
345 Cornell bioengineering N
345 Cornell CS Y
345 Cornell EE N
678 Stanford history Y
987 Stanford CS Y
987 Berkeley CS Y
876 Stanford CS N
876 MIT biology Y
876 MIT marine biology N
765 Stanford history Y
765 Cornell history N
765 Cornell psychology Y
543 MIT CS N
In [1]:
%%sql

Nothing in these tables is redundant, so if we use these tables, the issues we saw above goes away (we'll talk more about it in week 4). But the benefits come with a cost, we now need a way to combine information from several tables at once.

Let's say we want to show the names and grade averages of all students who have applied for CS at Stanford – none of our three tables above contain all information we need, but if we start by looking in the applications table, we can get …

In [13]:
%%sql
SELECT    s_id
FROM      applications
WHERE     c_name = 'Stanford' AND major = 'CS'
Running query in 'sqlite:///lect02.sqlite'
Out[13]:
s_id
123
876
987

The best we can get here is the id of all relevant students, but given the id we can look in the students table for the name and gpa.

In regular programming languages, such as Python or Java, we could do that with functions such as find_name(s_id) and find_gpa(s_id):

In [1]:
%%sql
SELECT    find_name(s_id), find_gpa(s_id)     -- not in SQL!
FROM      applications
WHERE     c_name = 'Stanford' AND major = 'CS'

But SQL is no regular language, it's based on relational algebra, and there we have another, really cool way of solving this problem: joins).

Inner joins¶

We solve the problem above by combining (joining) the applications table with the students table. There are several kinds of joins in SQL (and relational algebra), but the most straightforward kind is what's called an inner join#Inner_join).

The inner join of two tables is a new table, where rows from each table are joined together if the attributes of the rows matches a join predicate – in our case we want to add student data to the application data where the student id in the applications table matches the student id in the students table.

This can be expressed as (see the docs):

In [14]:
%%sql
SELECT *
FROM   applications
       INNER JOIN students ON applications.s_id = students.s_id
Running query in 'sqlite:///lect02.sqlite'
Out[14]:
s_id c_name major decision s_id_1 s_name gpa
123 Stanford CS Y 123 Amy 3.9
123 Stanford EE N 123 Amy 3.9
123 Berkeley CS Y 123 Amy 3.9
123 Cornell EE Y 123 Amy 3.9
234 Berkeley biology N 234 Bob 3.6
345 MIT bioengineering Y 345 Craig 3.5
345 Cornell bioengineering N 345 Craig 3.5
345 Cornell CS Y 345 Craig 3.5
345 Cornell EE N 345 Craig 3.5
678 Stanford history Y 678 Fay 3.8
987 Stanford CS Y 987 Helen 3.7
987 Berkeley CS Y 987 Helen 3.7
876 Stanford CS N 876 Irene 3.9
876 MIT biology Y 876 Irene 3.9
876 MIT marine biology N 876 Irene 3.9
765 Stanford history Y 765 Jay 2.9
765 Cornell history N 765 Jay 2.9
765 Cornell psychology Y 765 Jay 2.9
543 MIT CS N 543 Craig 3.4

and we would get the joined table (note that the first columns are from the applications table, whereas the last columns are from the students table – also note that s_id and s_id' always are the same in each row):

s_id c_name major decision s_id' s_name gpa
123StanfordCS Y123Amy3.9
123StanfordEE N123Amy3.9
123BerkeleyCS Y123Amy3.9
123CornellEE Y123Amy3.9
234Berkeleybiology N234Bob3.6
345MITbioengineering Y345Craig3.5
345Cornellbioengineering N345Craig3.5
345CornellCS Y345Craig3.5
345CornellEE N345Craig3.5
678Stanfordhistory Y678Fay3.8
987StanfordCS Y987Helen3.7
987BerkeleyCS Y987Helen3.7
876StanfordCS N876Irene3.9
876MITbiology Y876Irene3.9
876MITmarine biology N876Irene3.9
765Stanfordhistory Y765Jay2.9
765Cornellhistory N765Jay2.9
765Cornellpsychology Y765Jay2.9
543MITCS N543Craig3.4

which contains all information we need.

We'll soon see other kinds of joins, but inner joins are the 'default', and so we can skip the reserved word INNER, and just write:

In [1]:
%%sql
SELECT *
FROM   applications
       JOIN students ON applications.s_id = students.s_id

Since the join predicate is just a test of equality of one attribute from each table (it's called an equijoin), and the attribute has the same name in both tables, we can use the simpler form:

In [15]:
%%sql
SELECT *
FROM   applications
       JOIN students USING (s_id)
Running query in 'sqlite:///lect02.sqlite'
Out[15]:
s_id c_name major decision s_name gpa
123 Stanford CS Y Amy 3.9
123 Stanford EE N Amy 3.9
123 Berkeley CS Y Amy 3.9
123 Cornell EE Y Amy 3.9
234 Berkeley biology N Bob 3.6
345 MIT bioengineering Y Craig 3.5
345 Cornell bioengineering N Craig 3.5
345 Cornell CS Y Craig 3.5
345 Cornell EE N Craig 3.5
678 Stanford history Y Fay 3.8
987 Stanford CS Y Helen 3.7
987 Berkeley CS Y Helen 3.7
876 Stanford CS N Irene 3.9
876 MIT biology Y Irene 3.9
876 MIT marine biology N Irene 3.9
765 Stanford history Y Jay 2.9
765 Cornell history N Jay 2.9
765 Cornell psychology Y Jay 2.9
543 MIT CS N Craig 3.4

which gives (observe that we now only have one s_id column):

s_id c_name major decision s_name gpa
123StanfordCS YAmy3.9
123StanfordEE NAmy3.9
123BerkeleyCS YAmy3.9
123CornellEE YAmy3.9
234Berkeleybiology NBob3.6
345MITbioengineering YCraig3.5
345Cornellbioengineering NCraig3.5
345CornellCS YCraig3.5
345CornellEE NCraig3.5
678Stanfordhistory YFay3.8
987StanfordCS YHelen3.7
987BerkeleyCS YHelen3.7
876StanfordCS NIrene3.9
876MITbiology YIrene3.9
876MITmarine biology NIrene3.9
765Stanfordhistory YJay2.9
765Cornellhistory NJay2.9
765Cornellpsychology YJay2.9
543MITCS NCraig3.4

In the first join above we got two s_id columns (SQL can't possibly know that s_id describes the same thing in both tables), but with the JOIN USING construct, we tell SQL that they describe the same thing, so the join returns only one column for s_id.

In general, we can compare sets of several attributes in a JOIN USING query.

The number of rows in a joined table¶

At this point, we should consider how many rows we get when we join two tables. Assume we want to keep track of the students' phone numbers – the first 'rule' of relational databases is that all values must be primitive, so if a student has more than one phone number, she can't have them saved in a list in a table. Instead we introduce a new table, phone_numbers, like this:

In [1]:
%%sql
SELECT   *
FROM     phone_numbers
s_id phone_no
123 650-550-8671
345 347-467-6210
345 516-881-8303
456 650-255-3090
456 650-376-9029
456 628-265-5898
543 617-155-4170
567 650-412-4837
654 619-339-6883
678 650-228-4538
765 508-765-8878
876 774-877-3733
876 617-522-3278
987 650-964-4042

Here we use the students' s_id numbers, since their names might not be unique (we have two Amys and two Craigs), and then provide a phone number.

Each student can have either no phone number, one phone number, or several phone numbers, in the table above, Amy (s_id 101) has 1, Bob (s_id 234) has none, Craig (s_id 345) has two, and Doris (s_id 456) has three.

If we join our students table with this table, using an equijoin on s_id, we get:

In [16]:
%%sql
SELECT     *
FROM       students
           JOIN phone_numbers USING (s_id)
Running query in 'sqlite:///lect02.sqlite'
Out[16]:
s_id s_name gpa phone_no
123 Amy 3.9 650-550-8671
345 Craig 3.5 347-467-6210
345 Craig 3.5 516-881-8303
456 Doris 3.9 650-255-3090
456 Doris 3.9 650-376-9029
456 Doris 3.9 628-265-5898
543 Craig 3.4 617-155-4170
567 Edward 2.9 650-412-4837
654 Amy 3.9 619-339-6883
678 Fay 3.8 650-228-4538
765 Jay 2.9 508-765-8878
876 Irene 3.9 774-877-3733
876 Irene 3.9 617-522-3278
987 Helen 3.7 650-964-4042
s_id s_name phone_no
123 Amy 650-550-8671
345 Craig 347-467-6210
345 Craig 516-881-8303
456 Doris 628-265-5898
456 Doris 650-255-3090
456 Doris 650-376-9029
… … …

So, since each student has its own row in the students table, they get one row per phone number in the joined table – this means that Bob, who didn't have any phone number, will not survive the join, whereas Craig and Doris get several rows (one row per phone number). In the section on outer joins below, we'll see how we can keep Bob alive, even if he doesn't have any phone number.

If each student already had several rows, such as if we had joined students and applications, and then joined this combined table with phone_numbers, we would get:

In [1]:
%%sql
SELECT    *
FROM      students
          JOIN applications USING (s_id)
          JOIN phone_numbers USING (s_id)
WHERE     s_id < 500
s_id s_name gpa c_name major decision phone_no
123 Amy 3.9 Berkeley CS Y 650-550-8671
123 Amy 3.9 Cornell EE Y 650-550-8671
123 Amy 3.9 Stanford CS Y 650-550-8671
123 Amy 3.9 Stanford EE N 650-550-8671
345 Craig 3.5 Cornell CS Y 347-467-6210
345 Craig 3.5 Cornell CS Y 516-881-8303
345 Craig 3.5 Cornell EE N 347-467-6210
345 Craig 3.5 Cornell EE N 516-881-8303
345 Craig 3.5 Cornell bioengineering N 347-467-6210
345 Craig 3.5 Cornell bioengineering N 516-881-8303
345 Craig 3.5 MIT bioengineering Y 347-467-6210
345 Craig 3.5 MIT bioengineering Y 516-881-8303

So, the number of rows per student will be the product of the number of rows they had in each of the joined tables.

Also, if we had a phone number with an unknown s_id, the number would not end up in our joined table.

The results of an inner join of several tables does not depend on in which order we join our tables (but joining them in the right order can potentially save an enormous amount of work for a big join, just as multiplying matrices in the right order can save lots of work – fortunately the 'query planner' is very clever when it decides how to join tables, we'll return to that later).

After we've joined…¶

Once we've joined our tables, we can proceed as we did before, with a WHERE clause:

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications
          JOIN students USING (s_id)
WHERE     c_name = 'Stanford' AND major = 'CS'

If we look at the syntax diagram for the SELECT statement, we see that we join our tables in the FROM clause (see join-clause), and that we can use everything we've learned so far to work on the joined tables (so we can group and use window functions and subqueries just as we've done before).

The combined, rather clumsy table will never be created, database engines are extremely clever when they decide on what data they need to keep around to execute their queries – so we normally do not need to worry much about efficiency (we'll talk a little more about this later in the course).


It could be useful to stop here for a while, and consider how you'd solve the problem above using Python or Java (or any other regular programming language). In case we let the students keep track of their applications, we'd need to loop through all students, and then all their applications looking for applications to CS@Stanford.

Deciding on who should keep track of what is a typical example of premature optimization, and using a relational database relieves us of this problem.


Problem: Show the id and name for all students who has applied for CS in California (state code CA).

In [18]:
%%sql
SELECT     *
FROM       students
           JOIN applications USING (s_id)
           JOIN colleges USING (c_name)
WHERE      state = 'CA'
Running query in 'sqlite:///lect02.sqlite'
Out[18]:
s_id s_name gpa c_name major decision state enrollment
123 Amy 3.9 Stanford CS Y CA 15000
123 Amy 3.9 Stanford EE N CA 15000
123 Amy 3.9 Berkeley CS Y CA 36000
234 Bob 3.6 Berkeley biology N CA 36000
678 Fay 3.8 Stanford history Y CA 15000
987 Helen 3.7 Stanford CS Y CA 15000
987 Helen 3.7 Berkeley CS Y CA 36000
876 Irene 3.9 Stanford CS N CA 15000
765 Jay 2.9 Stanford history Y CA 15000

Cross joins¶

As we saw above, an inner join combines rows in two tables when the rows matches on some join predicate. The predicate is typically dependent on column values in both rows (e.g., that the student ids in each table should match), but we can also use a seemingly nonsensical predicate such as true (which is always true):

In [20]:
%%sql
SELECT   *
FROM     applications
         JOIN students ON false
Running query in 'sqlite:///lect02.sqlite'
Out[20]:
s_id c_name major decision s_id_1 s_name gpa

This would mean that all rows in the applications table would match all rows in the students table, and we would get what's called the cartesian product of the two tables (or sets of rows), i.e., all combinations of rows in the tables.

This is exactly what a cross join is – we can write it as:

In [28]:
%%sql
SELECT   *
FROM     applications
         CROSS JOIN students
Running query in 'sqlite:///lect02.sqlite'
Out[28]:
s_id c_name major decision s_id_1 s_name gpa

and we typically use a WHERE clause to narrow the generated table into a table where only matching rows remains afterwards.

So, the query

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications
          JOIN students USING (s_id)
WHERE     c_name = 'Stanford' AND major = 'CS'

could have been written using a cross join instead:

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications
          CROSS JOIN students
WHERE     applications.s_id = students.s_id
          AND c_name = 'Stanford'
          AND major = 'CS'

This may look terrible at first sight, creating a huge table and then throwing away most of the combined rows, but, as said above, database engines are very, very clever, and they might find a way to make even this query fast. But I'd still recommend that you use inner queries, since it might help the database to find a faster query plan, and it's conceptually easier to understand for a person looking at your code.

By the way, there is a shortcut for cross joins, …

In [1]:
%%sql
SELECT    s_name, gpa
FROM      applications, students
WHERE     applications.s_id = students.s_id
          AND c_name = 'Stanford'
          AND major = 'CS'

but, once again, opt for inner joins instead!

Outer joins¶

We saw above that a row in one table which matches several rows in the other table would give rise to several rows in the combined table. But it also goes the other way – when we do an inner join on two tables, rows for which the join predicate doesn't match any row in the other table will 'disappear' in the combined table.

So if we wanted to list all students and their applications, and wrote:

In [22]:
%%sql
SELECT    *
FROM      students
          JOIN applications USING (s_id)
ORDER BY  s_name
Running query in 'sqlite:///lect02.sqlite'
Out[22]:
s_id s_name gpa c_name major decision
123 Amy 3.9 Stanford CS Y
123 Amy 3.9 Stanford EE N
123 Amy 3.9 Berkeley CS Y
123 Amy 3.9 Cornell EE Y
234 Bob 3.6 Berkeley biology N
345 Craig 3.5 MIT bioengineering Y
345 Craig 3.5 Cornell bioengineering N
345 Craig 3.5 Cornell CS Y
345 Craig 3.5 Cornell EE N
543 Craig 3.4 MIT CS N
678 Fay 3.8 Stanford history Y
987 Helen 3.7 Stanford CS Y
987 Helen 3.7 Berkeley CS Y
876 Irene 3.9 Stanford CS N
876 Irene 3.9 MIT biology Y
876 Irene 3.9 MIT marine biology N
765 Jay 2.9 Stanford history Y
765 Jay 2.9 Cornell history N
765 Jay 2.9 Cornell psychology Y

we would get all applications, but those students who hadn't applied for any major would be left out.

If we want to make sure that all rows in the first ('left') table survives the join, we can use a left outer join. An outer join is a join in which we guarantee that all rows in one or both our tables will be present in the result. But that leads to an important question: how do we combine an 'uncombinable' row? The answer is that we combine it with a row full of NULL values in the other table – if we try a left outer join, and put the students table first (on the left), we get:

In [23]:
%%sql
SELECT    *
FROM      students
          LEFT OUTER JOIN applications USING (s_id)
ORDER BY  s_name
Running query in 'sqlite:///lect02.sqlite'
Out[23]:
s_id s_name gpa c_name major decision
123 Amy 3.9 Berkeley CS Y
123 Amy 3.9 Cornell EE Y
123 Amy 3.9 Stanford CS Y
123 Amy 3.9 Stanford EE N
654 Amy 3.9 None None None
234 Bob 3.6 Berkeley biology N
345 Craig 3.5 Cornell CS Y
345 Craig 3.5 Cornell EE N
345 Craig 3.5 Cornell bioengineering N
345 Craig 3.5 MIT bioengineering Y
543 Craig 3.4 MIT CS N
456 Doris 3.9 None None None
567 Edward 2.9 None None None
678 Fay 3.8 Stanford history Y
789 Gary 3.4 None None None
987 Helen 3.7 Berkeley CS Y
987 Helen 3.7 Stanford CS Y
876 Irene 3.9 MIT biology Y
876 Irene 3.9 MIT marine biology N
876 Irene 3.9 Stanford CS N
765 Jay 2.9 Cornell history N
765 Jay 2.9 Cornell psychology Y
765 Jay 2.9 Stanford history Y
s_id c_name major decision s_name gpa
123Amy3.9 BerkeleyCSY
123Amy3.9 CornellEEY
123Amy3.9 StanfordCSY
123Amy3.9 StanfordEEN
654Amy3.9 NULL NULL NULL
234Bob3.6 BerkeleybiologyN
345Craig3.5 CornellCSY
345Craig3.5 CornellEEN
345Craig3.5 CornellbioengineeringN
345Craig3.5 MITbioengineeringY
543Craig3.4 MITCSN
456Doris3.9 NULL NULL NULL
567Edward2.9 NULL NULL NULL
678Fay3.8 StanfordhistoryY
789Gary3.4 NULL NULL NULL
987Helen3.7 BerkeleyCSY
987Helen3.7 StanfordCSY
876Irene3.9 MITbiologyY
876Irene3.9 MITmarine biologyN
876Irene3.9 StanfordCSN
765Jay2.9 CornellhistoryN
765Jay2.9 CornellpsychologyY
765Jay2.9 StanfordhistoryY

Here we see 'the other Amy' (i.e., the one with s_id 654), Doris, Edward, and Gary, who apparently haven't applied for anything – the columns which would have been taken from the applications table in case of a match are now set to NULL.

The fact that an outer join pads rows which can't be matched in the 'other' table with NULL values gives us a nice way to find out which values might be missing in a table – se the following problem.

Problem: We want to find all students who hasn't applied for any major – do it in two ways:

  • Using a subquery.
In [26]:
%%sql
SELECT   s_id, s_name
FROM     students
         LEFT JOIN applications USING (s_id)
WHERE    major IS NULL
Running query in 'sqlite:///lect02.sqlite'
Out[26]:
s_id s_name
456 Doris
567 Edward
654 Amy
789 Gary
  • Using an outer join (checking where we've padded with NULL columns).
In [1]:
%%sql

This way of finding missing values may look like a hack, but it's in fact idiomatic.

Other outer joins¶

There are three kinds of outer joins:

  • LEFT OUTER JOIN, which can be abbreviated LEFT JOIN – this is when we keep all rows in the first (leftmost) table, padding with NULL values when they don't match any row in the other table.

  • RIGHT OUTER JOIN, which can be abbreviated RIGHT JOIN – this is when we keep all rows in the second (rightmost) table, padding with NULL values when they don't match any row in the other table.

  • FULL OUTER JOIN, which can be abbreviated FULL JOIN – this is when we keep all rows in both tables, padding with NULL values when they don't match any row in the other table.

Exercises¶

Exercise: How many students with a gpa above 3.5 have applied for a CS major at a college in California (CA)?

In [1]:
%%sql

Hint: It could be a good idea to first define a CTE with the relevant students.

Your query should return 3.

Exercise: List the id, name, and gpa for all students who have applied for more than one major in California (CA).

In [1]:
%%sql

Your query should return:

s_id  s_name  gpa
----  ------  ---
123   Amy     3.9
987   Helen   3.7

Exercise: Which are the three programs (major@college) with the lowest gpa for any accepted student.

In [1]:
%%sql

The output should be:

c_name    major       min(gpa)
--------  ----------  --------
Cornell   psychology  2.9
Stanford  history     2.9
Cornell   CS          3.5

Exercise: And which are the three programs with the highest gpa for a student not accepted?

In [1]:
%%sql

The output should be:

c_name    major           max(gpa)
--------  --------------  --------
MIT       marine biology  3.9
Stanford  CS              3.9
Stanford  EE              3.9

Comment: Apparently the gpa isn't the only criterion for acceptance, since Irene (s_id 876) wasn't accepted at CS@Stanford, but Helen (s_id 987) was, although Irene has a higher gpa…

Exercise: In the section about correlated subqueries, we solved the problem of finding the laureates with the longest name for each category – solve this problem using a CTE instead of a correlated subquery (you can save the longest length of each in a CTE).

In [1]:
%%sql

Set operations¶

Relational databases and SQL are based on relational algebra, and this means that we can use simple set operations on our data. If we look at the compound-operator in the SELECT statement, we see that we can combine the results of queries using UNION, INTERSECT, and EXCEPT.

This only works if the queries return exactly the same columns.

No description has been provided for this image

Problem: List the names and gpa for students who have applied for a major at a college in California, but haven't applied in any other state – use a set operation (it could also be solved in a number of other ways).

In [27]:
%%sql
WITH only_california(s_id) AS (
    SELECT   s_id
    FROM     applications
             JOIN colleges USING (c_name)
    WHERE state = 'CA'

    EXCEPT

    SELECT   s_id
    FROM     applications
             JOIN colleges USING (c_name)
    WHERE state <> 'CA'
)
SELECT     s_name, gpa
FROM       only_california
           JOIN students USING (s_id)
Running query in 'sqlite:///lect02.sqlite'
Out[27]:
s_name gpa
Bob 3.6
Fay 3.8
Helen 3.7

INSERT, UPDATE and DELETE¶

To maintain our databases, we need ways to change their contents, and for that we have tree important functions:

  • INSERT: inserts a row into a table

  • UPDATE: updates rows which matches some predicate

  • DELETE: deletes rows which matches some predicate

We'll talk more about these functions next week, when we talk about modeling (then we'll also discuss the concept of keys, which is quite important when we modify our tables).