EDAF75 - notebook for lecture 1¶

Author: Christian Söderberg

Week 1: Introduction, relational databases, and SQL¶

In the text below there are two kinds of problems:

  • Problems marked Problem, which I intend to solve during the lecture. Depending on how fast we progress, I may or may not have time to solve them all – those of the problems we have to skip during the lectures are left as exercises (see below), and we can discuss them during QA sessions.

  • Problems marked Exercise, which I suggest you solve yourselves (we can also work on them during the QA sessions).

This is a Jupyter notebook, and they have built in support for Julia, Python, and R (hence Ju-Pyt-R), here's some Python code:

In [1]:
def hello(name):
    print(f"hello, {name}!")

def main():
    name = input("What's your name: ")
    hello(name)

main()
hello, World!

You can run the code snippet above by clicking somewhere in the box, and press Shift-Enter.

We're primarily going to run SQL code (see below) in our notebooks, but I'll also show you some Python code later on in the course (the Python code we'll mostly run from the CLI, though).

Using tables to store data¶

If we were to keep track of all Nobel laureates in a Python or Java program, and didn't know about relational databases, we would probably define classes for the laureates, and put them in lists. We could also define classes for the categories, and have one list for each category, or have lists with one element per year, and somehow track all laureates in that year, or use dictionaries/maps. However we chose to keep track of the data, some searches, insertions and deletions would be easy to implement, and some would be cumbersome. We'd also have to be careful to keep our data consistent.

In this course, we'll use a technique which may at first seem too simple to be useful, but which turns out to be incredibly powerful. We're going to use relational databases, and we'll store the data in 'simple' tables. Each table looks like a spreadsheet – here is a table with some Nobel laureates:

year category name motivation
1901 chemistry Jacobus Henricus van 't Hoff in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions
1901 literature Sully Prudhomme in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect
1901 medicine Emil Adolf von Behring for his work on serum therapy, especially its application against diphtheria, by which he has opened a new road in the domain of medical science and thereby placed in the hands of the physician a victorious weapon against illness and deaths
1901 physics Wilhelm Conrad Röntgen in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him
1902 chemistry Hermann Emil Fischer in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses
1902 literature Christian Matthias Theodor Mommsen the greatest living master of the art of historical writing, with special reference to his monumental work, A history of Rome
1902 medicine Ronald Ross for his work on malaria, by which he has shown how it enters the organism and thereby has laid the foundation for successful research on this disease and methods of combating it
1902 physics Hendrik Antoon Lorentz in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
1902 physics Pieter Zeeman in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
… … … …

A row represents an item, and a column represents a property of the items – in the example above, each row describes a Nobel prize someone has been awarded, and for each such prize, we have columns showing what year the prize was awarded, in what category, the name of the laureate, and the motivation.

The first basic rule of relational databases is that all 'cells' in the table should be simple values (no lists or objects), and that we can use simple operations from relational algebra to get information from it (cells are sometimes allowed to have the value NULL, when a value is missing, we'll return to that below). To work with our tables we use a programming language which is highly specialized for manipulating and extracting information, it is called SQL, which is short hand for Structured Query Language, and it's a Domain Specific Language (DSL) for finding information in tables. SQL can be pronounced as either "S-Q-L", or "sequel".

Setting up this notebook¶

There are many different Relational Database Management Systems (RDMBS:es) which implements SQL, some of the most prominent are:

  • PostgreSQL
  • MariaDB
  • MySQL
  • Oracle
  • Microsoft SQL Server
  • IBM DD2
  • SQLite
  • DuckDB

Most of these systems are client-server-systems, i.e., they have one program, a SQL server, which handles the data, and clients who communicate with the server in various ways.

There are several different kinds of clients:

  • We can run an IDE, which allows us to see our tables in a GUI.

  • We can run command line clients (CLI) – they are text based programs who work like typical REPLs, output will just be text in a terminal window.

  • We can write scripts which we send to the server, often through a CLI.

  • We can run a notebook (such as this one), and have it communicate with our database.

  • We can write code in a general purpose language, and have it communicate with our database.

In the course, we'll try all of these methods to access our databases.

The RDMBS we'll use in the course is SQLite, which is a lightweight but still very powerful system – it is by far the most used RDBMS, and it's probably already running on all of your phones and computers (just as an example, if you use Chrome for browsing, your browsing history is typically saved in a SQL-database file .config/google-chrome/Default/History, and Mozilla use it for storing meta-data in Firefox and Thunderbird). It's actually not a client/server system (instead it is a library which keeps our databases in files on our computer) – but in the course, we'll think of SQLite as if it were a traditional client/server system, because in many ways, it behaves as one (and new tools like Litestream gives it some of the benefits of a client/server system).

In order to run SQL in this notebook, we need to load the SQL extension:

In [2]:
%load_ext sql

… and then we want some database to work on (I've put all the databases used for this lecture in a file called lect01.sqlite):

In [3]:
%sql sqlite:///lect01.sqlite
Connecting to 'sqlite:///lect01.sqlite'

Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).

Using SQL to find information in a table (SELECT)¶

If we were to find out who were the Nobel laureates in 1910, we could just look through the lines of our table above, and if the rows in the table were ordered by year, it would be easy (that's essentially how mankind has used encyclopedias since they were invented about 2000 years ago).

To search for values in SQL, we use the SELECT statement:

No description has been provided for this image

(the diagram above is taken from SQLite's fabulous web site, you can find out more about the SELECT statement here).

If we follow the lines in the diagram above, we could puzzle together the query:

In [4]:
%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1910
Running query in 'sqlite:///lect01.sqlite'
Out[4]:
category name
chemistry Otto Wallach
literature Paul Johann Ludwig Heyse
medicine Albrecht Kossel
physics Johannes Diderik van der Waals

This will return a new table, looking like:

Category Name
chemistry Otto Wallach
literature Paul Johann Ludwig Heyse
medicine Albrecht Kossel
physics Johannes Diderik van der Waals

Nobel prize was first awarded in 1901, so if we look for winners of the 1900 Nobel prizes, we would get an empty table back:

In [5]:
%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1900
Running query in 'sqlite:///lect01.sqlite'
Out[5]:
category name

SQL has a NULL value which is sometimes useful (we can use it to signal that one column in a row lacks a value), but observe that there is a huge difference between the empty table returned above, and NULL (if we had a Python or Java function which was given a list of strings and should return all of these strings which contained an uppercase letter, it should return an empty list, not None or NULL, if there were no such string).

NULL values are only used for individual columns, and we could potentially have used it in our Nobel prize database if we had an award for which there was no motivation. Let's say the fictional poet Oddput Clementin was awarded the Nobel prize in literature in 2025 – since there is no way to explain why he got it it, we could set motivation to NULL in that case (we'll return to how we insert data into our tables later).

In [70]:
%%sql
INSERT   
INTO      nobel_prizes(year, category, name)
VALUES    (2025, 'literature', 'Oddput Clementin');
Running query in 'sqlite:///lect01.sqlite'
1 rows affected.
Out[70]:

The value NULL itself is weird, in SQL the test NULL = NULL doesn't return TRUE or FALSE, it's NULL (but Jupyter writes that as None).

So the query:

In [7]:
%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation = NULL
Running query in 'sqlite:///lect01.sqlite'
Out[7]:
year category name

would always return an empty result.

To see if a value is NULL, we have to use the IS comparison:

In [8]:
%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation IS NULL
Running query in 'sqlite:///lect01.sqlite'
Out[8]:
year category name
2025 literature Oddput Clementin

Despite it's weirdness, NULL can actually be useful in SQL, we'll return to this later.

The fact that we get tables back from our queries is very useful – we'll return to that too in a little while.

If we wanted to see only laureates in physics in 1910, we'd just refine the condition in our query:

In [9]:
%%sql
SELECT   name
FROM     nobel_prizes
WHERE    year = 1910 and category = 'physics'
Running query in 'sqlite:///lect01.sqlite'
Out[9]:
name
Johannes Diderik van der Waals

Problem: How do we get all literature laureates in the 1920ies (see the docs)?

In [10]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    year BETWEEN 1920 AND 1929
Running query in 'sqlite:///lect01.sqlite'
Out[10]:
year name
1920 Walther Hermann Nernst
1920 Knut Pedersen Hamsun
1920 Schack August Steenberg Krogh
1920 Charles Edouard Guillaume
1921 Frederick Soddy
1921 Anatole France
1921 Albert Einstein
1922 Francis William Aston
1922 Jacinto Benavente
1922 Archibald Vivian Hill
Truncated to displaylimit of 10.

To limit the number of rows in the output when using jupysql we can set the variable SqlMagic.displaylimit – here we'll turn it off:

In [11]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)

Problem: How do we order the literature laureates in the 1920ies by name?

In [13]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    year BETWEEN 1920 AND 1929 AND category = 'literature'
ORDER BY name
Running query in 'sqlite:///lect01.sqlite'
Out[13]:
year name
1921 Anatole France
1925 George Bernard Shaw
1926 Grazia Deledda
1927 Henri Bergson
1922 Jacinto Benavente
1920 Knut Pedersen Hamsun
1928 Sigrid Undset
1929 Thomas Mann
1923 William Butler Yeats
1924 Wladyslaw Stanislaw Reymont

How do we get the search result above in reverse order?

Problem: Show the first 10 Nobel prizes in chemistry.

In [71]:
%%sql
SELECT  year, name
FROM    nobel_prizes
WHERE   category = 'literature'
LIMIT  10
Running query in 'sqlite:///lect01.sqlite'
Out[71]:
year name
1901 Sully Prudhomme
1902 Christian Matthias Theodor Mommsen
1903 Bjørnstjerne Martinus Bjørnson
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1905 Henryk Sienkiewicz
1906 Giosuè Carducci
1907 Rudyard Kipling
1908 Rudolf Christoph Eucken
1909 Selma Ottilia Lovisa Lagerlöf

Problem: Show Nobel prize number 32 to 42 in medicine (in chronological order).

In [16]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    category = 'medicine'
LIMIT    11
OFFSET   31
Running query in 'sqlite:///lect01.sqlite'
Out[16]:
year name
1932 Sir Charles Scott Sherrington
1933 Thomas Hunt Morgan
1934 George Hoyt Whipple
1934 George Richards Minot
1934 William Parry Murphy
1935 Hans Spemann
1936 Otto Loewi
1936 Sir Henry Hallett Dale
1937 Albert von Szent-Györgyi Nagyrápolt
1938 Corneille Jean François Heymans
1939 Gerhard Domagk

There is a subtlety here which we'll get back to below.

Problem: What year did Albert Einstein get his award?

In [17]:
%%sql
SELECT    year, motivation
FROM      nobel_prizes
WHERE     name = 'Albert Einstein'
Running query in 'sqlite:///lect01.sqlite'
Out[17]:
year motivation
1921 for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect

Problem: What year, and in what category did Churchill get his Nobel prize?

In [21]:
%%sql
SELECT   year, category, name
FROM     nobel_prizes
WHERE    name LIKE '%Churchill%'
Running query in 'sqlite:///lect01.sqlite'
Out[21]:
year category name
1953 literature Sir Winston Leonard Spencer Churchill

Most databases have the LIKE predicate, quite a few also have some kind of regular expressions.

BTW: representing a laureate as just a string is very primitive, next time we'll see how to represent compound data (the problem here is that the first rule of relational databases only allows primitive values in our cells).

Selecting only distinct values¶

If we wanted to see what categories are in our nobel_prizes table, we could write:

In [72]:
%%sql
SELECT DISTINCT category
FROM      nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[72]:
category
chemistry
literature
medicine
physics

but we would get many repetitions. To see only the distinct values, we can use a SELECT DISTINCT query:

In [ ]:
%%sql
SELECT DISTINCT  category
FROM             nobel_prizes

Problem: What categories of Nobel prizes were awarded in each of the years between 1910 and 1919?

In [25]:
%%sql
SELECT  DISTINCT  year, category
FROM      nobel_prizes
WHERE     year BETWEEN 1910 AND 1919
Running query in 'sqlite:///lect01.sqlite'
Out[25]:
year category
1910 chemistry
1910 literature
1910 medicine
1910 physics
1911 chemistry
1911 literature
1911 medicine
1911 physics
1912 chemistry
1912 literature
1912 medicine
1912 physics
1913 chemistry
1913 literature
1913 medicine
1913 physics
1914 chemistry
1914 medicine
1914 physics
1915 chemistry
1915 literature
1915 physics
1916 literature
1917 literature
1917 physics
1918 chemistry
1918 physics
1919 literature
1919 medicine
1919 physics

Can you spot which prizes (year, category) were withheld? Next time we'll write a query to find all 'missing' prizes.

SQL scalar functions¶

In the docs, we see that the general form of the SELECT statement allows a result-column, and that a result-column can contain an expr. This opens up many, many possibilities (some of which we will get back to below), one of them is the use of simple scalar functions. We can also use these functions in other places, such as in our WHERE or ORDER BY clauses.

We can se the standard core functions here, try to use them to do the following:

Problem: Show the initial letter of each of the laureates in the year 2023.

In [26]:
%%sql
SELECT   name, substring(name, 1, 1)
FROM     nobel_prizes
WHERE    year = 2023
Running query in 'sqlite:///lect01.sqlite'
Out[26]:
name substring(name, 1, 1)
Anne L'Huillier A
Ferenc Krausz F
Pierre Agostini P
Moungi G. Bawendi M
Louis E. Brus L
Alexey Ekimov A
Katalin Karikó K
Drew Weissman D
Jon Fosse J

Problem: Show the 10 first winners in the following format (see the format-function – unfortunately it will not look great in the notebook…):

1901: chemistry  Jacobus Henricus van 't Hoff
1901: literature Sully Prudhomme
1901: medicine   Emil Adolf von Behring
1901: physics    Wilhelm Conrad Röntgen
1902: chemistry  Hermann Emil Fischer
1902: literature Christian Matthias Theodor Mommsen
1902: medicine   Ronald Ross
1902: physics    Hendrik Antoon Lorentz
1902: physics    Pieter Zeeman
1903: chemistry  Svante August Arrhenius
In [ ]:
%%sql

Problem: List the five laureates with the shortest names.

In [28]:
%%sql
SELECT    year, category, name, length(name)
FROM      nobel_prizes
ORDER BY  length(name)
LIMIT     5
Running query in 'sqlite:///lect01.sqlite'
Out[28]:
year category name length(name)
2012 literature Mo Yan 6
1954 physics Max Born 8
1997 literature Dario Fo 8
2001 medicine Tim Hunt 8
2024 literature Han Kang 8

SQL aggregate functions¶

The functions above were applied to one or more columns in a row, and they returned results for each row. There is another kind of SQL function, called an aggregate function. An aggregate function collapses a whole table (or part of it, see below) into one row – the SQLite docs list some aggregate functions here.

One simple such function is count(X) – it counts the number of times an expression is not NULL in a column, so to see how many Nobel laureates we have in our database, we can write:

In [32]:
%%sql
SELECT    count()
FROM      nobel_prizes
WHERE     year > 2000
Running query in 'sqlite:///lect01.sqlite'
Out[32]:
count()
219

This query returns just one row, with the count (what else could it have returned?).

Each row in our nobel_prize table will have a defined name (there are no NULL values), and we can use an alternative version of the count function, which is just count() (without any argument), and it count each row, without checking possible NULL values.

There is also a nice count()-related trick which can be useful sometimes – if we only want to count how many different values we find in a column (or expression potentially containing several columns), we can use count(DISTINCT X).

Problem: How many different Nobel laurates are there? (We'll assume all laureates have different names, and up until now that's actually true.)

In [34]:
%%sql
SELECT   count(DISTINCT name)
FROM     nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[34]:
count(DISTINCT name)
781

Problem: How many Nobel prizes were awarded in 2023?

In [35]:
%%sql
SELECT    count()
FROM      nobel_prizes
WHERE     year = 2023
Running query in 'sqlite:///lect01.sqlite'
Out[35]:
count()
9

Problem: How many different categories are there in the nobel_prizes table?

In [36]:
%%sql
SELECT    count(DISTINCT category)
FROM      nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[36]:
count(DISTINCT category)
4

Problem: When was the first Nobel prize awarded?

In [37]:
%%sql
SELECT    min(year)
FROM      nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[37]:
min(year)
1901

Exercise: How many Nobel prizes in chemistry have been awarded?

In [ ]:
%%sql

Exercise: How many different Nobel laureates in chemistry are there?

In [ ]:
%%sql

GROUP BY and HAVING¶

Using GROUP BY we can handle rows in groups – to understand how it works, lets first look at the following query:

In [38]:
%%sql
SELECT    category, name
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[38]:
category name
chemistry Arieh Warshel
chemistry Martin Karplus
chemistry Michael Levitt
literature Alice Munro
medicine James E. Rothman
medicine Randy W. Schekman
medicine Thomas C. Südhof
physics François Englert
physics Peter W. Higgs

Here the rows of each category will end up adjacent to each other, but if we want to see how many laureates we have in each category, we can't just call count():

In [39]:
%%sql
SELECT    category, count()
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[39]:
category count()
chemistry 9

since the aggregate function count() collapses the resulting table into one row.

But if we use GROUP BY, we insert an invisible divider between groups of values, and perform any aggregate function only on each 'group':

In [40]:
%%sql
SELECT    category, count()
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[40]:
category count()
chemistry 3
literature 1
medicine 3
physics 2

So, if we apply an aggregate function, such as count(), in a table which we have grouped, the function will be applied to each group, not to the whole table, and instead of getting one count() for the whole table, we get one count() for each group (as above).

If we add name in the first line, we get a somewhat arbitrary result:

In [41]:
%%sql
SELECT    category, count(), name
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[41]:
category count() name
chemistry 3 Arieh Warshel
literature 1 Alice Munro
medicine 3 James E. Rothman
physics 2 François Englert

The category and count is correct, but only one name is shown for each category.

The 'problem' is that we only get one row per group in the output, and that there may be several laureates in each group – our query will return one of them in a seemingly haphazard manner. To alleviate this problem, SQLite defines an aggregate function group_concat, which concatenates all values in the group (it also has the alias string_agg, to match the corresponding function in PostgreSQL):

In [44]:
%%sql
SELECT    category, count(), group_concat(name, ", " ORDER BY name DESC) AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[44]:
category count() names
chemistry 3 Michael Levitt, Martin Karplus, Arieh Warshel
literature 1 Alice Munro
medicine 3 Thomas C. Südhof, Randy W. Schekman, James E. Rothman
physics 2 Peter W. Higgs, François Englert

In recent versions of SQLite it's also possible to order the values in an aggregate function such as group_concat by using ORDER BY inside the function call (see the docs).

Observe that there is no problem displaying category in the SELECT-statement above, we get a value which we know will be the same for each row in the group (this is by definition, since that's what we grouped by).

If we're only interested in those categories with less than three laureates, we use HAVING to select only groups with a given property:

In [45]:
%%sql
SELECT    category, count(), group_concat(name, ", ") AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
Running query in 'sqlite:///lect01.sqlite'
Out[45]:
category count() names
literature 1 Alice Munro
physics 2 François Englert, Peter W. Higgs

This corresponds to a WHERE statement, but it applies to groups, not to individual rows (as WHERE does) – so WHERE and HAVING have similar effects (they somehow narrow a search), but they're absolutely not interchangable!

Important (and often misunderstood): In the query above we first have a WHERE statement to select some rows from the whole table, and then group the resulting selection – this can be seen in this diagram:

No description has been provided for this image

Every time we have both a WHERE and a HAVING in the same query, we must first use WHERE to select rows we can group, and then use HAVING to select groups. We can use WITH statements or subqueries (see below) if we want to have it the other way around.

Problem: Has anyone won more than one Nobel prize? We can assume the names of the laureates are unique (so far they are).

In [50]:
%%sql
SELECT    name, group_concat(DISTINCT category)
FROM      nobel_prizes
GROUP BY  name
HAVING    count() > 1
Running query in 'sqlite:///lect01.sqlite'
Out[50]:
name group_concat(DISTINCT category)
Frederick Sanger chemistry
John Bardeen physics
K. Barry Sharpless chemistry
Marie Curie, née Sklodowska physics,chemistry

Problem: How many laureates are there in each category?

In [51]:
%%sql
SELECT    category, count(DISTINCT name)
FROM      nobel_prizes
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[51]:
category count(DISTINCT name)
chemistry 198
literature 123
medicine 232
physics 229

Problem: Which categories have had more than 200 laureates?

In [ ]:
%%sql

Exercise: How many laureates were there each year between 1920 and 1930?

In [ ]:
%%sql

Exercise: Which years saw more than 9 laureates?

In [ ]:
%%sql

Can you also show who the laureates were in those years, and even better, for each of them also show in which category they were awarded.

Exercise: Which have been the 20 years with most laureates? (We don't need to be precise in case of ties.)

In [ ]:
%%sql

Using CASE WHEN¶

When we group values, we can make use of the CASE WHEN construct (see the penultimate track in the diagram below):

No description has been provided for this image

For instance, we can use it to categorize the era for each of the physics laureates having a name beginning with 'A':

In [52]:
%%sql
SELECT   year, name,
         CASE
             WHEN year < 1970 THEN 'ancient era'
             WHEN year <= 2000 THEN 'a long time ago'
             ELSE 'quite recently'
         END AS era
FROM     nobel_prizes
WHERE    category = 'physics' AND name LIKE 'A%'
Running query in 'sqlite:///lect01.sqlite'
Out[52]:
year name era
1903 Antoine Henri Becquerel ancient era
1907 Albert Abraham Michelson ancient era
1921 Albert Einstein ancient era
1927 Arthur Holly Compton ancient era
1964 Aleksandr Mikhailovich Prokhorov ancient era
1966 Alfred Kastler ancient era
1974 Antony Hewish a long time ago
1975 Aage Niels Bohr a long time ago
1978 Arno Allan Penzias a long time ago
1979 Abdus Salam a long time ago
1981 Arthur Leonard Schawlow a long time ago
2003 Alexei A. Abrikosov quite recently
2003 Anthony J. Leggett quite recently
2007 Albert Fert quite recently
2010 Andre Geim quite recently
2011 Adam G. Riess quite recently
2015 Arthur B. McDonald quite recently
2018 Arthur Ashkin quite recently
2020 Andrea Ghez quite recently
2022 Alain Aspect quite recently
2022 Anton Zeilinger quite recently
2023 Anne L'Huillier quite recently

Problem: Use the CASE WHEN construct and GROUP BY to count the number of physics laureates beginning with 'A' in each era.

In [73]:
%%sql
SELECT   CASE
             WHEN year < 1970 THEN 'ancient era'
             WHEN year <= 2000 THEN 'a long time ago'
             ELSE 'quite recently'
         END AS era,
         count()
FROM     nobel_prizes
WHERE    category = 'physics' AND name LIKE 'A%'
GROUP BY  era
Running query in 'sqlite:///lect01.sqlite'
Out[73]:
era count()
a long time ago 5
ancient era 6
quite recently 11

Window functions¶

As we saw above, we can treat partitions of our rows as 'groups', by using GROUP BY.

We also saw than when we use an aggregate function on a group, we collapse whole groups into one row in the output (and when we apply it to a table without groups, the whole table collapses into one row) – but sometimes we want to apply aggregate functions within a group of values, and keep each row in the output.

Above we listed all laureates in 2013, now we want to add one column to the output: for each laureate, we want to show how many laureates shared the prize in her or his category.

If we use GROUP BY and the aggregate function count() on the categories, we would only get one row per category, and using count() without grouping would collapse the whole result into just one row:

In [54]:
%%sql
SELECT    category, name, count() AS count      -- oh no!
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[54]:
category name count
chemistry Arieh Warshel 9

Fortunately, SQL has a way to apply functions only over 'partitions' of our tables, and keep all rows in the output – using the reserved word OVER we can introduce a 'window' of our rows, and apply the function only over this window:

In [59]:
%%sql
SELECT    category,
          name,
          rank() OVER (PARTITION by category ORDER BY length(name)) AS count
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[59]:
category name count
chemistry Arieh Warshel 1
chemistry Martin Karplus 2
chemistry Michael Levitt 2
literature Alice Munro 1
medicine James E. Rothman 1
medicine Thomas C. Südhof 1
medicine Randy W. Schekman 3
physics Peter W. Higgs 1
physics François Englert 2

We can also give our windows names, using an alias:

In [ ]:
%%sql
SELECT    category,
          name,
          count() OVER categories AS count
FROM      nobel_prizes
WHERE     year = 2013
WINDOW    categories AS (PARTITION by category)
ORDER BY  category

The aliased window definitions must come after any WHERE and HAVING, and before any ORDER BY.

So, if we use the reserved word OVER after a function, the function will be applied according to a 'window' (there is more to it than this, but this will suffice for now). In the window we can:

  • define a partition, using PARTITION BY,

  • define an order, using ORDER BY, and

  • define a range, which we can use to define groups of rows relative to each other (but we won't look at ranges in the course).

The function will be applied to each partition, in the same way we applied aggregate functions on groups above, but now we won't collapse the partitions. Observe that the partitioning and ordering are based only on the selection we make (i.e., only those rows which are chosen in our WHERE clause).

We can use our regular aggregate functions as window functions, but there are also a couple of dedicated window functions, such as (there are more, but we won't use them in the course):

  • rank(): ranks rows by the order specified in the window, ties can occur,

  • row_number(): as rank(), but now we avoid ties, and rank by row number in the output, and

  • percent_rank(): gives a value between 0.0 and 1.0 (so it's a bit of a misnomer), giving the row's relative rank within its partition.

You can find more here.

Problem: Add one column which 'ranks' the laureates of 2013 in the table above according to the lengths of their names, within the categories – shorter names should come before longer names.

In [ ]:
%%sql
SELECT    category,
          name,
          count() OVER (PARTITION by category) AS count
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category

Exercise: Above we listed Nobel prize number 32 to 42 in medicine, in chronological order, but our only way to check that our numbering is correct would be to list all the first laureates, and then count manually. Now use a window function to enhance the output from the query above with the chronological order of each laureate.

In [ ]:
%%sql

Exercise: For each laureate with the initial A, list their category, year, name, and 'freshness' within that category, i.e., the most recent laureate in a category is ranked 1, the second most recent laureate is ranked 2, etc. The ranks should be confined to laureates with the initial A (we need some more tools to rank the 'freshness' over the whole category, but will get to it next time).

In [ ]:
%%sql

Some exercises¶

To spice things up a bit, I've included a table with all olympic games since 1896 – the table olympics contains the columns:

  • year
  • city
  • country
  • continent
  • season
  • ordinal_number

If we look carefully at this table, we can find some unnecessary repetition (try to find it!) – we will address this problem during lecture 2, but for now, we'll let it pass.

Exercise: How many olympic games have each continent hosted?

In [62]:
%%sql
SELECT    continent, count()
FROM      olympics
GROUP BY  continent
Running query in 'sqlite:///lect01.sqlite'
Out[62]:
continent count()
Asia 8
Australia 2
Europe 33
North America 12
South America 1

Exercise: When was the first olympic games in each continent?

In [ ]:
%%sql

Exercise: Which countries have hosted the summer olympics more than once?

In [ ]:
%%sql

Exercise: List the continents in descending order by the number of times they've hosted the summer olympics

In [ ]:
%%sql

Exercise: Show a 'histogram' (no actual diagram, just the counts) over the the initial letter of the names of all Nobel laureates (see how big your chance is…).

In [ ]:
%%sql

Exercise: Show a 'histogram' over the the initial letter of the names of all Nobel laureates, for each category.

In [ ]:
%%sql

Exercise: Has anyone won more than one Nobel prize in the same category?

In [ ]:
%%sql

To find out if someone has won in more than one category, we need a subquery (see below).

Exercise: For each olympic game, show how many olympic games had come before it in its continent. (Requires window functions)

In [ ]:
%%sql

Subqueries, Views and Common Table Expressions (CTEs)¶

As we noted above, the result of a SELECT statement is itself a (kind of) table, and we can use such a table inside other SELECT statements.

One useful pattern is:

SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)

The second, nested query is called a subquery.

We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?

In [65]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    category = 'literature' AND year IN (
    SELECT  year
    FROM    nobel_prizes
    WHERE   category = 'literature'
    GROUP BY year
    HAVING  count() > 1)
Running query in 'sqlite:///lect01.sqlite'
Out[65]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson
2025 László Krasznahorkai
2025 Oddput Clementin

… and now we use the result of that query to find out what we're really looking for:

In [ ]:
%%sql

This can be simplified by using either of two ways to define 'temporary tables':

  • views
  • Common Table Expressions (CTE:s).

A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:

In [66]:
%%sql
CREATE VIEW shared_literature_prize(year) AS
    SELECT  year
    FROM    nobel_prizes
    WHERE   category = 'literature'
    GROUP BY year
    HAVING  count() > 1
Running query in 'sqlite:///lect01.sqlite'
Out[66]:

We can now use our view in a query:

In [68]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize
Running query in 'sqlite:///lect01.sqlite'
Out[68]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson
2025 László Krasznahorkai
2025 Oddput Clementin

There is only one attribute in our shared_literature_table, and we can simplify this expression:

In [ ]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

If we have more than one column in the output of the subquery, we must compare it to a tuple within parentheses (this is a quite new addition to SQL, so it might not work everywhere).

The view will be around until we decide to remove it with:

In [ ]:
%%sql
DROP VIEW shared_literature_prize

A view isn't actually a table – each time the view is used it runs its query, which will take some time, but it means we get all updates in the underlying table(s).

A Common Table Expression (or CTE) is like a view, but only defined as part of a query, so we'd write it like:

In [69]:
%%sql
WITH
  shared_literature_prize(year) AS (
    SELECT  year
    FROM    nobel_prizes
    WHERE   category = 'literature'
    GROUP BY year
    HAVING  count() > 1
  )
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize
Running query in 'sqlite:///lect01.sqlite'
Out[69]:
year name
1904 Frédéric Mistral
1904 José Echegaray y Eizaguirre
1917 Henrik Pontoppidan
1917 Karl Adolph Gjellerup
1966 Nelly Sachs
1966 Shmuel Yosef Agnon
1974 Eyvind Johnson
1974 Harry Martinson
2025 László Krasznahorkai
2025 Oddput Clementin

There are some things which makes CTEs very nice:

  • they're defined as part of a SELECT statement (so there is nothing to drop afterwards),
  • since they're part of a SELECT statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and
  • they can be defined recursively (we'll return to that too later in the course).

Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.

In [ ]:
%%sql

Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery:

In [ ]:
%%sql

… and then rewrite it using a CTE:

In [ ]:
%%sql

We saw above that we can't have another WHERE after the HAVING clause:

In [ ]:
%%sql
SELECT    category, count() AS count
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
WHERE     count > 1        --   <-- not allowed!

but we can make our 'grouping query' into a subquery, and have another WHERE in the outer query:

In [ ]:
%%sql
SELECT category, count
FROM (
    SELECT    category, count() AS count
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3)
WHERE  count > 1

A somewhat tidier way of expressing this is to use a WITH-statement (CTE):

In [ ]:
%%sql
WITH category_count(category, count) AS (
    SELECT    category, count()
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3
)
SELECT category, count
FROM   category_count
WHERE  count > 1

Exercise: How many different laureates do we have in each category? Observe that there are a couple of laureates with two awards in the same category, they should only be counted once – Marie Curie, on the other hand, has won in two categories, so she should be counted in two different categories. Use a CTE to keep track of all laureates in each category.

In [ ]:
%%sql

Exercise: Above we wrote a query to find laureates who hade more than one award in a category, now write a query to find laureates who has won in more than one category.

In [ ]:
%%sql

Problem: There is a very handy function which returns a series of values – it's called generate_series, and you can try to use it to see which years from 1901 until now when no Nobel price at all ware awarded.

In [ ]:
%%sql

Correlated subqueries¶

Another form of subquery allows us to correlate two SELECT statements, so that we use values from an outer SELECT statement inside the SELECT statement in a subquery. As an example, solve the following problem:

Problem: List the names of all laureates who have the longest name of all laureates in their categories (in case of ties, all should be listed) – order by category.

Here we can use a subquery which is 'run' for each row in our main query:

In [ ]:
%%sql
SELECT  category, year, name
FROM    nobel_prizes AS outer_nobel
WHERE   length(name) = (
            SELECT max(length(name))
            FROM   nobel_prizes
            WHERE  category = outer_nobel.category)
ORDER BY category

This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel_prizes table in the outer query and the nobel_prizes table in the subquery (it's the same table, but we 'iterate' through it separately).

BTW, we could have skipped the AS in

...
FROM    nobel_prizes AS outer_nobel
...

and just written:

...
FROM    nobel_prizes outer_nobel
...

The general opinion is that we should use AS, as it makes it more obvious that we're defining an alias.