EDAF75 - notebook for lecture 1

Week 1: Introduction, relational databases, and SQL

In the text below there are two kinds of problems:

Problems marked Problem, which I intend to solve during the lecture. Depending on how fast we progress, I may or may not have time to solve them all – those of the problems we have to skip during the lectures are left as exercises (see below), but we can discuss them during QA sessions.

Problems marked Exercise, which I suggest you solve yourselves (we can also work on them during the QA sessions).

This is a Jupyter notebook, and they have built in support for Julia, Python, and R (hence Ju-Pyt-R), here's some Python code:

def hello(name):
    print(f"hello, {name}!")

def main():
    name = input("What's your name: ")
    hello(name)

main()

You can run the code snippet above by clicking somewhere in the box, and press Shift-Enter.

We're primarily going to run SQL code (see below) in our notebooks, but I'll also show you some Python code later on in the course (you don't have to learn Python to take the course, though).

Using tables to store data

If we were to keep track of all Nobel laureates in a Python or Java program, and didn't know about relational databases, we would probably define classes for the laureates, and put them in lists. We could also define classes for the categories, and have one list for each category, or have lists with one element per year, and somehow track all laureates in that year, or use dictionaries/maps. However we chose to keep track of the data, some searches, insertions and deletions would be easy to implement, and some would be cumbersome. We'd also have to be careful to keep our data consistent.

In this course, we'll use a technique which may at first seem too simple to be useful, but which turns out to be incredibly powerful. We're going to use relational databases, and we'll store the data in 'simple' tables. Each table looks like a simple spreadsheet – here is a table with some Nobel laureates:

<div> <table rules="all"> <tr> <th>Year</th> <th>Category</th> <th>Name</th> <th>Motivation</th> </tr> <TR><TD>1901</TD> <TD>chemistry</TD> <TD>Jacobus Henricus van &#39;t Hoff</TD> <TD>in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions</TD> </TR> <TR><TD>1901</TD> <TD>literature</TD> <TD>Sully Prudhomme</TD> <TD>in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect</TD> </TR> <TR><TD>1901</TD> <TD>medicine</TD> <TD>Emil Adolf von Behring</TD> <TD>for his work on serum therapy, especially its application against diphtheria, by which he has opened a new road in the domain of medical science and thereby placed in the hands of the physician a victorious weapon against illness and deaths</TD> </TR> <TR><TD>1901</TD> <TD>physics</TD> <TD>Wilhelm Conrad Röntgen</TD> <TD>in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him</TD> </TR> <TR><TD>1902</TD> <TD>chemistry</TD> <TD>Hermann Emil Fischer</TD> <TD>in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses</TD> </TR> <TR><TD>1902</TD> <TD>literature</TD> <TD>Christian Matthias Theodor Mommsen</TD> <TD>the greatest living master of the art of historical writing, with special reference to his monumental work, A history of Rome</TD> </TR> <TR><TD>1902</TD> <TD>medicine</TD> <TD>Ronald Ross</TD> <TD>for his work on malaria, by which he has shown how it enters the organism and thereby has laid the foundation for successful research on this disease and methods of combating it</TD> </TR> <TR><TD>1902</TD> <TD>physics</TD> <TD>Hendrik Antoon Lorentz</TD> <TD>in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena</TD> </TR> <TR><TD>1902</TD> <TD>physics</TD> <TD>Pieter Zeeman</TD> <TD>in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena</TD> </TR> <TR><TD>…</TD> <TD>…</TD> <TD>…</TD> <TD>…</TD> </TR> </table> </div>

A row represents an item, and a column represents a property of the items – in the example above, each row describes a Nobel prize someone has been awarded, and for each such prize, we have columns showing what year the prize was awarded, in what category, the name of the laureate, and the motivation.

One basic idea of relational databases is that all 'cells' in the table should be simple values (no lists or objects), and that we can use simple operations from relational algebra to get information from it (cells are sometimes allowed to have the value NULL, when a value is missing, we'll return to that below). To work with our tables we use a programming language which is highly specialized for manipulating and extracting information, it is called SQL, which is short hand for Structured Query Language, and it's a Domain Specific Language (DSL) for finding information in tables. SQL can be pronounced as either "S-Q-L", or "sequel".

Setting up this notebook

There are many different Relational Database Management Systems (RDMBS:es) which implements SQL, some of the most prominent are:

Most of these systems are client-server-systems, i.e., they have one program, a SQL server, which handles the data, and clients who communicate with the server in various ways. There are several different kinds of clients:

  • We can run an IDE, which allows us to see our tables in a GUI.
  • We can run command line clients (CLI) – they are text based programs who work like typical REPLs, output will just be text in a terminal window.
  • We can write scripts which we send to the server, often through a CLI.
  • We can run a notebook (such as this one), and have it communicate with our database.
  • We can write code in a general purpose language, and have it communicate with our database.

In the course, we'll try all of these methods to access our databases.

The RDMBS we'll use in the course is SQLite, which is a lightweight but still very powerful system – it is by far the most used RDBMS, and it's probably already running on all of your phones and computers (just as an example, if you use Chrome for browsing, your browsing history is typically saved in a SQL-database file .config/google-chrome/Default/History, and Mozilla use it for storing meta-data in Firefox and Thunderbird). It's actually not a client/server system (instead it is a library which keeps our databases in files on our computer) – but in the course, we'll think of SQLite as if it were a traditional client/server system, because in many ways, it behaves as one.

In order to run SQL in this notebook, we need to load the SQL extension:

%load_ext sql

… and then we want some database to work on (I've put all the databases used for this lecture in a file called lect01.sqlite):

%sql sqlite:///lect01.sqlite

Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).

Using SQL to find information in a table (SELECT)

If we were to find out who were the Nobel laureates in 1910, we could just look through the lines of our table above, and if the rows in the table were ordered by year, it would be easy (that's essentially how mankind has used encyclopedias since they were invented about 2000 years ago).

To search for values in SQL, we use the SELECT statement:

<center> <img src="./select-core-01-SFW.svg"> </center>

(the diagram above is taken from SQLite's fabulous web site, you can find out more about the SELECT statement here).

If we follow the lines in the diagram above, we could puzzle together the query:

%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1910

This will return a new table, looking like:

<div> <table rules="all"> <tr> <th>Category</th> <th>Name</th> </tr> <TR><TD>chemistry</TD> <TD>Otto Wallach</TD> </TR> <TR><TD>literature</TD> <TD>Paul Johann Ludwig Heyse</TD> </TR> <TR><TD>medicine</TD> <TD>Albrecht Kossel</TD> </TR> <TR><TD>physics</TD> <TD>Johannes Diderik van der Waals</TD> </TR> </table> </div>

The Nobel prize was first awarded in 1901, so if we look for winners of the 1900 Nobel prizes, we would get an empty table back:

%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1900

SQL has a NULL value which is sometimes useful (we can use it to signal that one column in a row lacks a value), but observe that there is a huge difference between the empty table returned above, and NULL (if we had a list of strings in Python or Java, and wrote a function which returned all of the strings containing a specific substring, the function would return an empty list, not None or NULL).

NULL values are only used for individual columns, and we could potentially have used it in our Nobel prize database if we had an award for which there was no motivation. Let's say the fictional poet Oddput Clementin was awarded the Nobel prize in literature in 2024 – since there is no way to explain why he got it it, we could set motivation to NULL in that case (we'll return to how we insert data into our tables later).

%%sql

The value NULL itself is weird, in SQL the test NULL = NULL doesn't return TRUE or FALSE, it's NULL (but Jupyter writes that as None).

So the query:

%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation = NULL

would always return an empty result.

To see if a value is NULL, we have to use the IS comparison:

%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation IS NULL

Despite it's weirdness, NULL can actually be useful in SQL, we'll return to this later.

The fact that we get tables back from our queries is very useful – we'll return to that too in a little while.

If we wanted to see only laureates in physics in 1910, we'd just refine our condition:

%%sql
SELECT   name
FROM     nobel_prizes
WHERE    year = 1910 and category = 'physics'

Problem: How do we get all literature laureates in the 1920ies (see the docs)?

%%sql

Problem: How do we order the literature laureates in the 1920ies by name?

%%sql

How do we get the search result above in reverse order?

Problem: Show the first 10 Nobel prizes in chemistry.

%%sql

Problem: Show Nobel prize number 32 to 42 in medicine (in chronological order).

%%sql

Problem: What year did Albert Einstein get his award?

%%sql

Problem: What year, and in what category did Churchill get his award?

%%sql

Most databases have the LIKE predicate, quite a few also have some kind of regular expressions.

Selecting only distinct values

If we wanted to see which years the Nobel prize was awarded, we could write:

%%sql
SELECT    year
FROM      nobel_prizes

but we would get many repetitions. To see only the distinct values, we can use a SELECT DISTINCT query:

%%sql
SELECT DISTINCT  year
FROM             nobel_prizes

Problem: What categories are in our database?

%%sql

SQL scalar functions

In the docs, we see that the general form of the SELECT statement allows a result-column, and that a result-column can contain an expr. This opens up many, many possibilities (some of which we will get back to below), one of them is the use of simple scalar functions. We can also use these functions in other places, such as in our WHERE or ORDER BY clauses.

We can se the standard core functions here, try to use them to do the following:

Problem: Show the initial letter of each of the laureates in the year 2023.

%%sql

Problem: Show the 10 first winners in the following format (see the format-function – unfortunately it will not look great in the notebook…):

1901: chemistry  Jacobus Henricus van 't Hoff
1901: literature Sully Prudhomme
1901: medicine   Emil Adolf von Behring
1901: physics    Wilhelm Conrad Röntgen
1902: chemistry  Hermann Emil Fischer
1902: literature Christian Matthias Theodor Mommsen
1902: medicine   Ronald Ross
1902: physics    Hendrik Antoon Lorentz
1902: physics    Pieter Zeeman
1903: chemistry  Svante August Arrhenius
%%sql

Problem: List the five laureates with the shortest names.

%%sql

SQL aggregate functions

The functions above was applied to one or more columns in a row, and they returned results for each row. There is another kind of SQL function, called aggregate function. An aggregate function collapses a whole table (or part of it, see below) into one row – the SQLite docs list some aggregate functions here.

One simple such function is count(X) – it counts the number of times a values is not NULL in a column, so to see how many Nobel laureates we have in our database, we can write:

%%sql
SELECT    count()
FROM      nobel_prizes;

This query returns just one row, with the count (what else could it have returned?).

Problem: How many Nobel prizes were awarded in 2023?

%%sql

Problem: When was the first Nobel prize awarded?

%%sql

Exercise: How many Nobel prizes in chemistry have been awarded?

%%sql

GROUP BY and HAVING

Using GROUP BY we can handle rows in groups – to understand how it works, lets first look at the following query:

%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category

Here the rows of each category will end up adjacent to each other, and using GROUP BY we insert an invisible divider between the groups, and perform any aggregate function on the whole 'group':

%%sql
SELECT    category, count()
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category

So, if we apply an aggregate function, such as count(), in a table which we have grouped, the function will be applied to each group, not to the whole table. Instead of getting one count() for the whole table (it would be a single value), we get one count() for each group (as above).

If we add name in the first line, we get a somewhat arbitrary result:

%%sql
SELECT    category, count(), name
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category

The category and count is correct, but only one name is shown for each category.

The 'problem' is that we only get one row per group in the output, and that there may be several laureates in each group – our query will return one of them in a seemingly haphazard manner. To alleviate this problem, SQLite defines an aggregate function group_concat, which concatenates all values in the group (it also has the alias string_agg, to match the corresponding function in PostgreSQL):

%%sql
SELECT    category, count(), group_concat(name, ", ") AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category

In recent versions of SQLite it's also possible to order the values in an aggregate function such as group_concat by using ORDER BY inside the function call (see the docs).

Observe that there is no problem displaying category in the SELECT-statement above, we get a value which we know will be the same for each row in the group (this is by definition, since that's what we grouped by).

If we're only interested in those categories with less than three laureates, we use HAVING to select only groups with a given property:

%%sql
SELECT    category, count(), group_concat(name, ", ") AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3

This corresponds to a WHERE statement, but it applies to groups, not to individual rows (as WHERE does) – so WHERE and HAVING have similar effects (they somehow narrow a search), but they're absolutely not interchangable!

Important (and often misunderstood): In the query above we first have a WHERE statement to select some rows from the whole table, and then group the resulting selection – this can be seen in this diagram:

<center> <img src="./select-00-overview.svg"> </center>

Every time we have both a WHERE and a HAVING in the same query, we must first use WHERE to select rows we can group, and then use HAVING to select groups. We can use WITH statements or subqueries (see below) if we want to have it the other way around.

Problem: Has anyone won more than one Nobel prize? We can assume the names of the laureates are unique (so far they are!).

%%sql

Problem: How many laureates are there in each category?

%%sql

Problem: Which categories have had more than 200 laureates?

%%sql

Exercise: How many laureates were there each year between 1920 and 1930?

%%sql

Exercise: Which years saw more than 9 laureates?

%%sql

Exercise: Which have been the 20 years with most laureates? (We don't need to be precise in case of ties.)

%%sql

Using CASE WHEN

When we group values, we can make use of the CASE WHEN construct (see the penultimate track in the diagram below):

<center> <img src="./select-expr.svg"> </center>

For instance, we can use it to categorize the era for each of the physics laureates having a name beginning with 'A':

%%sql
SELECT   year, name,
         CASE
             WHEN year < 1970 THEN 'ancient era'
             WHEN year <= 2000 THEN 'a long time ago'
             ELSE 'quite recently'
         END AS era
FROM     nobel_prizes
WHERE    category = 'physics' AND name LIKE 'A%'

Problem: Use the CASE WHEN construct and GROUP BY to count the number of physics laureates beginning with 'A' in each era.

%%sql

A short intro to window functions

As we saw above, we can treat partitions of our rows as 'groups', by using GROUP BY.

We also saw than when we use an aggregate function on a group, we collapse whole groups into one row in the output (and when we apply it to a table without groups, the whole table collapses) – but sometimes we want to apply aggregate functions within a group of values, and keep each row in the output.

Above we listed all laureates in 2013, now we want to add one column to the output: for each laureate, we want to show how many laureates shared the prize in her or his category.

If we use GROUP BY and the aggregate function count() on the categories, we would only get one row per category, and using count() without grouping would collapse the whole result into just one row:

%%sql
SELECT    year, category, name, count() AS count      -- oh no!
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category

Fortunately, SQL has quite recently introduced a way to apply functions only over 'partitions' of our tables, and keep all rows in the output – using the reserved word OVER we can introduce a window of our rows, and apply the function only over this 'window':

%%sql
SELECT    year,
          category,
          name,
          count() OVER (PARTITION by category) AS count
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category

We can also give our windows names, using an alias:

%%sql
SELECT    year,
          category,
          name,
          count() OVER categories AS count
FROM      nobel_prizes
WHERE     year = 2013
WINDOW    categories AS (PARTITION by category)
ORDER BY  category

The aliased window definitions must come after any WHERE and HAVING, and before any ORDER BY.

So, if we use the reserved word OVER after a function, the function will be applied according to a 'window' (there is more to it than this, but this will suffice for now). In the window we can:

  • define a partition, using PARTITION BY,
  • define an order, using ORDER BY, and
  • define a range, which we can use to define groups of rows relative to each other (but we won't look at ranges in the course).

The function will be applied to each partition, in the same way we applied aggregate functions on groups above, but now we won't collapse the partitions. Observe that the partitioning and ordering are based only on the selection we make (i.e., only those rows which are chosen in our WHERE clause).

We can use our regular aggregate functions as window functions, but there are also a couple of dedicated window functions, such as (there are more, but we won't use them in the course):

  • rank(): ranks rows by the order specified in the window, ties can occur,
  • row_number(): as rank(), but now we avoid ties, and rank by row number in the output, and
  • percent_rank(): gives a value between 0.0 and 1.0 (so it's a bit of a misnomer), giving the row's relative rank within its partition.

You can find more here.

Window functions can be very powerful, but we'll not delve too deeply into them in the course – I want you to be aware of them, though!

Problem: Add one column which 'ranks' the laureates of 2013 in the table above according to the lengths of their names, within the categories – shorter names should come before longer names.

%%sql

Exercise: For each laureate with the initial A, list their category, year, name, and 'freshness' within that category, i.e., the most recent laureate in a category is ranked 1, the second most recent laureate is ranked 2, etc. The ranks should be confined to laureates with the initial A.

%%sql

Some exercises

To spice things up a bit, I've included a table with all olympic games since 1896 – the table olympics contains the columns:

  • year
  • city
  • country
  • continent
  • season
  • ordinal_number

If we look carefully at this table, we can find some unnecessary repetition (try to find it!) – we will address this problem during lecture 2, but for now, we'll let it pass.

Exercise: How many olympic games have each continent hosted?

%%sql

Exercise: When was the first olympic games in each continent?

%%sql

Exercise: Which countries have hosted the summer olympics more than once?

%%sql

Exercise: List the continents in descending order by the number of times they've hosted the summer olympics

%%sql

Exercise: Show a 'histogram' (no actual diagram, just the counts) over the the initial letter of the names of all Nobel laureates (see how big your chance is…).

%%sql

Exercise: Show a 'histogram' over the the initial letter of the names of all Nobel laureates, for each category.

%%sql

Exercise: Has anyone won more than one Nobel prize in the same category?

%%sql

Exercise: Has anyone won Nobel prizes in different categories?

%%sql

Exercise: For each olympic game, show how many olympic games had come before it in its continent. (Requires window functions)

%%sql

Subqueries, Views and Common Table Expressions (CTEs)

As we noted above, the result of a SELECT statement is itself a (kind of) table, and we can use such a table inside other SELECT statements.

One useful pattern is:

SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)

The second, nested query is called a subquery.

We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?

%%sql

… and now we use the result of that query to find out what we're really looking for:

%%sql

This can be simplified by using either of two ways to define 'temporary tables':

  • views
  • Common Tabale Expressions.

A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:

%%sql
CREATE VIEW shared_literature_prize(year) AS
  SELECT ... -- just as above...

We can now use our view in a query:

%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN (
            SELECT    year
            FROM      shared_literature_prize
          )

Since there is only one attribute in our shared_literature_table, we can simplify this expression:

%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

The view will be around until we decide to remove it with:

%%sql
DROP VIEW shared_literature_prize

A Common Table Expression (or CTE) is like a view, but only defined in one query, so we'd write it like:

%%sql
WITH
  shared_literature_prize(year) AS (
    SELECT ... -- as above...
  )
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

There are some things which makes CTEs very nice:

  • they're defined as part of a SELECT statement (so there is nothing to drop afterwards),
  • since they're part of a SELECT statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and
  • they can be defined recursively (we'll return to that later in the course).

Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.

%%sql

Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery, and then rewrite it using a CTE.

%%sql

We saw above that we can't have another WHERE after the HAVING clause:

%%sql
SELECT    category, count() AS count
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
WHERE     count > 1        --   <-- not allowed!

but we can make our 'grouping query' into a subquery, and have another WHERE in the outer query:

%%sql
SELECT category, count
FROM (
    SELECT    category, count() AS count
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3)
WHERE  count > 1

A somewhat tidier way of expressing this is to use a WITH-statement (CTE):

%%sql
WITH category_count(category, count) AS (
    SELECT    category, count()
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3
)
SELECT category, count
FROM   category_count
WHERE  count > 1

Problem: Show which years there were prices in some category, but not in medicine (you can do it either with or without CTEs, but use a subquery).

%%sql

Problem: There is a 'funny' function which returns series of values – it's called generate_series, and you can try to use it to see which year no nobel price at all was awarded.

%%sql

Correlated subqueries

Another form of subquery is:

SELECT ...,
       (SELECT ...
        FROM ...
        WHERE ...)
FROM   ...

This works if the subquery produces one result, such as when we use an aggregate function. As an example, solve the following problem:

Problem: List the names of all laureates who has the longest name of all laureates in their category (in case of ties, all should be listed) – order by category.

Here we can use a subquery which is 'run' for each row in our main query:

%%sql
SELECT  category, year, name
FROM    nobel AS outer_nobel
WHERE   length(name) = (
            SELECT max(length(name))
            FROM   nobel
            WHERE  category = outer_nobel.category)
ORDER BY category

This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel table in the outer query and the nobel table in the subquery (it's the same table, but we 'iterate' through it separately).

BTW, we could have skipped the AS in

...
FROM    nobel AS outer_nobel
...

and just written:

...
FROM    nobel outer_nobel
...

The general opinion is that we should use AS, as it makes it more obvious that we're defining an alias.