EDAF75 - notebook for lecture 1¶

Author: Christian Söderberg

Week 1: Introduction, relational databases, and SQL¶

In the text below there are two kinds of problems:

Problems marked Problem, which I intend to solve during the lecture. Depending on how fast we progress, I may or may not have time to solve them all – those of the problems we have to skip during the lectures are left as exercises (see below), but we can discuss them during QA sessions.

Problems marked Exercise, which I suggest you solve yourselves (we can also work on them during the QA sessions).

This is a Jupyter notebook, and they have built in support for Julia, Python, and R (hence Ju-Pyt-R), here's some Python code:

In [1]:
def hello(name):
    print(f"hello, {name}!")

def main():
    name = input("What's your name: ")
    hello(name)

main()
hello, world!

You can run the code snippet above by clicking somewhere in the box, and press Shift-Enter.

We're primarily going to run SQL code (see below) in our notebooks, but I'll also show you some Python code later on in the course (you don't have to learn Python to take the course, though).

Using tables to store data¶

If we were to keep track of all Nobel laureates in a Python or Java program, and didn't know about relational databases, we would probably define classes for the laureates, and put them in lists. We could also define classes for the categories, and have one list for each category, or have lists with one element per year, and somehow track all laureates in that year, or use dictionaries/maps. However we chose to keep track of the data, some searches, insertions and deletions would be easy to implement, and some would be cumbersome. We'd also have to be careful to keep our data consistent.

In this course, we'll use a technique which may at first seem too simple to be useful, but which turns out to be incredibly powerful. We're going to use relational databases, and we'll store the data in 'simple' tables. Each table looks like a simple spreadsheet – here is a table with some Nobel laureates:

Year Category Name Motivation
1901 chemistry Jacobus Henricus van 't Hoff in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions
1901 literature Sully Prudhomme in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect
1901 medicine Emil Adolf von Behring for his work on serum therapy, especially its application against diphtheria, by which he has opened a new road in the domain of medical science and thereby placed in the hands of the physician a victorious weapon against illness and deaths
1901 physics Wilhelm Conrad Röntgen in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him
1902 chemistry Hermann Emil Fischer in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses
1902 literature Christian Matthias Theodor Mommsen the greatest living master of the art of historical writing, with special reference to his monumental work, A history of Rome
1902 medicine Ronald Ross for his work on malaria, by which he has shown how it enters the organism and thereby has laid the foundation for successful research on this disease and methods of combating it
1902 physics Hendrik Antoon Lorentz in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
1902 physics Pieter Zeeman in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
… … … …

A row represents an item, and a column represents a property of the items – in the example above, each row describes a Nobel prize someone has been awarded, and for each such prize, we have columns showing what year the prize was awarded, in what category, the name of the laureate, and the motivation.

One basic idea of relational databases is that all 'cells' in the table should be simple values (no lists or objects), and that we can use simple operations from relational algebra to get information from it (cells are sometimes allowed to have the value NULL, when a value is missing, we'll return to that below). To work with our tables we use a programming language which is highly specialized for manipulating and extracting information, it is called SQL, which is short hand for Structured Query Language, and it's a Domain Specific Language (DSL) for finding information in tables. SQL can be pronounced as either "S-Q-L", or "sequel".

Setting up this notebook¶

There are many different Relational Database Management Systems (RDMBS:es) which implements SQL, some of the most prominent are:

  • PostgreSQL
  • MariaDB
  • MySQL
  • Oracle
  • Microsoft SQL Server
  • IBM DD2
  • SQLite

Most of these systems are client-server-systems, i.e., they have one program, a SQL server, which handles the data, and clients who communicate with the server in various ways. There are several different kinds of clients:

  • We can run an IDE, which allows us to see our tables in a GUI.

  • We can run command line clients (CLI) – they are text based programs who work like typical REPLs, output will just be text in a terminal window.

  • We can write scripts which we send to the server, often through a CLI.

  • We can run a notebook (such as this one), and have it communicate with our database.

  • We can write code in a general purpose language, and have it communicate with our database.

In the course, we'll try all of these methods to access our databases.

The RDMBS we'll use in the course is SQLite, which is a lightweight but still very powerful system – it is by far the most used RDBMS, and it's probably already running on all of your phones and computers (just as an example, if you use Chrome for browsing, your browsing history is typically saved in a SQL-database file .config/google-chrome/Default/History, and Mozilla use it for storing meta-data in Firefox and Thunderbird). It's actually not a client/server system (instead it is a library which keeps our databases in files on our computer) – but in the course, we'll think of SQLite as if it were a traditional client/server system, because in many ways, it behaves as one.

In order to run SQL in this notebook, we need to load the SQL extension:

In [2]:
%load_ext sql

… and then we want some database to work on (I've put all the databases used for this lecture in a file called lect01.sqlite):

In [3]:
%sql sqlite:///lect01.sqlite
Connecting to 'sqlite:///lect01.sqlite'

Now we're good to go, we just have to prefix our SQL queries with %sql (one line of SQL) or %%sql (several lines of SQL, this is the form we will use in most cases).

Using SQL to find information in a table (SELECT)¶

If we were to find out who were the Nobel laureates in 1910, we could just look through the lines of our table above, and if the rows in the table were ordered by year, it would be easy (that's essentially how mankind has used encyclopedias since they were invented about 2000 years ago).

To search for values in SQL, we use the SELECT statement:

No description has been provided for this image

(the diagram above is taken from SQLite's fabulous web site, you can find out more about the SELECT statement here).

If we follow the lines in the diagram above, we could puzzle together the query:

In [4]:
%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1910
Running query in 'sqlite:///lect01.sqlite'
Out[4]:
category name
chemistry Otto Wallach
literature Paul Johann Ludwig Heyse
medicine Albrecht Kossel
physics Johannes Diderik van der Waals

This will return a new table, looking like:

Category Name
chemistry Otto Wallach
literature Paul Johann Ludwig Heyse
medicine Albrecht Kossel
physics Johannes Diderik van der Waals

Nobel prize was first awarded in 1901, so if we look for winners of the 1900 Nobel prizes, we would get an empty table back:

In [5]:
%%sql
SELECT   category, name
FROM     nobel_prizes
WHERE    year = 1900
Running query in 'sqlite:///lect01.sqlite'
Out[5]:
category name

SQL has a NULL value which is sometimes useful (we can use it to signal that one column in a row lacks a value), but observe that there is a huge difference between the empty table returned above, and NULL (if we had a Python or Java function which was given a list of strings and should return all of these strings which contained an uppercase letter, it should return an empty list, not None or NULL, if there was no such string).

NULL values are only used for individual columns, and we could potentially have used it in our Nobel prize database if we had an award for which there was no motivation. Let's say the fictional poet Oddput Clementin was awarded the Nobel prize in literature in 2025 – since there is no way to explain why he got it it, we could set motivation to NULL in that case (we'll return to how we insert data into our tables later).

In [6]:
%%sql
INSERT INTO nobel_prizes(year, category, name, motivation)
VALUES (1900, 'literature', 'Oddput Clementin', NULL)
Running query in 'sqlite:///lect01.sqlite'
1 rows affected.
Out[6]:

The value NULL itself is weird, in SQL the test NULL = NULL doesn't return TRUE or FALSE, it's NULL (but Jupyter writes that as None).

So the query:

In [7]:
%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation = NULL
Running query in 'sqlite:///lect01.sqlite'
Out[7]:
year category name

would always return an empty result.

To see if a value is NULL, we have to use the IS comparison:

In [8]:
%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     motivation IS NULL
Running query in 'sqlite:///lect01.sqlite'
Out[8]:
year category name
1900 literature Oddput Clementin

Despite it's weirdness, NULL can actually be useful in SQL, we'll return to this later.

The fact that we get tables back from our queries is very useful – we'll return to that too in a little while.

If we wanted to see only laureates in physics in 1910, we'd just refine our condition:

In [9]:
%%sql
SELECT   name
FROM     nobel_prizes
WHERE    year = 1910 and category = 'physics'
Running query in 'sqlite:///lect01.sqlite'
Out[9]:
name
Johannes Diderik van der Waals

Problem: How do we get all literature laureates in the 1920ies (see the docs)?

In [13]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature' AND year BETWEEN 1920 AND 1930
Running query in 'sqlite:///lect01.sqlite'
Out[13]:
year name
1920 Knut Pedersen Hamsun
1921 Anatole France
1922 Jacinto Benavente
1923 William Butler Yeats
1924 Wladyslaw Stanislaw Reymont
1925 George Bernard Shaw
1926 Grazia Deledda
1927 Henri Bergson
1928 Sigrid Undset
1929 Thomas Mann
1930 Sinclair Lewis

To limit the number of rows in the output when using jupysql we can set the variable SqlMagic.displaylimit – here we'll turn it off:

In [12]:
%config SqlMagic.displaylimit = None
displaylimit: Value None will be treated as 0 (no limit)

Problem: How do we order the literature laureates in the 1920ies by name?

In [15]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    category = 'literature' AND year BETWEEN 1920 AND 1929
ORDER BY name
Running query in 'sqlite:///lect01.sqlite'
Out[15]:
year name
1921 Anatole France
1925 George Bernard Shaw
1926 Grazia Deledda
1927 Henri Bergson
1922 Jacinto Benavente
1920 Knut Pedersen Hamsun
1928 Sigrid Undset
1929 Thomas Mann
1923 William Butler Yeats
1924 Wladyslaw Stanislaw Reymont

How do we get the search result above in reverse order?

Problem: Show the first 10 Nobel prizes in chemistry.

In [16]:
%%sql
SELECT   *
FROM     nobel_prizes
WHERE    category = 'chemistry'
LIMIT    10
Running query in 'sqlite:///lect01.sqlite'
Out[16]:
year category name motivation
1901 chemistry Jacobus Henricus van 't Hoff in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions
1902 chemistry Hermann Emil Fischer in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses
1903 chemistry Svante August Arrhenius in recognition of the extraordinary services he has rendered to the advancement of chemistry by his electrolytic theory of dissociation
1904 chemistry Sir William Ramsay in recognition of his services in the discovery of the inert gaseous elements in air, and his determination of their place in the periodic system
1905 chemistry Johann Friedrich Wilhelm Adolf von Baeyer in recognition of his services in the advancement of organic chemistry and the chemical industry, through his work on organic dyes and hydroaromatic compounds
1906 chemistry Henri Moissan in recognition of the great services rendered by him in his investigation and isolation of the element fluorine, and for the adoption in the service of science of the electric furnace called after him
1907 chemistry Eduard Buchner for his biochemical researches and his discovery of cell-free fermentation
1908 chemistry Ernest Rutherford for his investigations into the disintegration of the elements, and the chemistry of radioactive substances
1909 chemistry Wilhelm Ostwald in recognition of his work on catalysis and for his investigations into the fundamental principles governing chemical equilibria and rates of reaction
1910 chemistry Otto Wallach in recognition of his services to organic chemistry and the chemical industry by his pioneer work in the field of alicyclic compounds

Problem: Show Nobel prize number 32 to 42 in medicine (in chronological order).

In [18]:
%%sql
SELECT   year, name
FROM     nobel_prizes
WHERE    category = 'medicine'
ORDER BY name
LIMIT    11
OFFSET   32
Running query in 'sqlite:///lect01.sqlite'
Out[18]:
year name
1928 Charles Jules Henri Nicolle
1907 Charles Louis Alphonse Laveran
2020 Charles M. Rice
1913 Charles Robert Richet
1929 Christiaan Eijkman
1974 Christian de Duve
1995 Christiane Nüsslein-Volhard
1938 Corneille Jean François Heymans
2006 Craig C. Mello
1984 César Milstein
1976 D. Carleton Gajdusek

Problem: What year did Albert Einstein get his award?

In [19]:
%%sql
SELECT    year, motivation
FROM      nobel_prizes
WHERE     name = 'Albert Einstein'
Running query in 'sqlite:///lect01.sqlite'
Out[19]:
year motivation
1921 for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect

Problem: What year, and in what category did Churchill get his award?

In [22]:
%%sql
SELECT   year, name,category, motivation
FROM     nobel_prizes
WHERE    name LIKE '%Churchill%'
Running query in 'sqlite:///lect01.sqlite'
Out[22]:
year name category motivation
1953 Sir Winston Leonard Spencer Churchill literature for his mastery of historical and biographical description as well as for brilliant oratory in defending exalted human values

Most databases have the LIKE predicate, quite a few also have some kind of regular expressions.

Selecting only distinct values¶

If we wanted to see which years the Nobel prize was awarded, we could write:

In [25]:
%%sql
SELECT DISTINCT year
FROM            nobel_prizes
LIMIT           10
Running query in 'sqlite:///lect01.sqlite'
Out[25]:
year
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910

but we would get many repetitions. To see only the distinct values, we can use a SELECT DISTINCT query:

In [ ]:
%%sql
SELECT DISTINCT  year
FROM             nobel_prizes

Problem: What categories are in our database?

In [26]:
%%sql
SELECT DISTINCT  category
FROM             nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[26]:
category
chemistry
literature
medicine
physics

SQL scalar functions¶

In the docs, we see that the general form of the SELECT statement allows a result-column, and that a result-column can contain an expr. This opens up many, many possibilities (some of which we will get back to below), one of them is the use of simple scalar functions. We can also use these functions in other places, such as in our WHERE or ORDER BY clauses.

We can se the standard core functions here, try to use them to do the following:

Problem: Show the initial letter of each of the laureates in the year 2023.

In [27]:
%%sql
SELECT   name, substring(name, 1, 1)
FROM     nobel_prizes
WHERE    year = 2023
Running query in 'sqlite:///lect01.sqlite'
Out[27]:
name substring(name, 1, 1)
Anne L'Huillier A
Ferenc Krausz F
Pierre Agostini P
Moungi G. Bawendi M
Louis E. Brus L
Alexey Ekimov A
Katalin Karikó K
Drew Weissman D
Jon Fosse J

Problem: Show the 10 first winners in the following format (see the format-function – unfortunately it will not look great in the notebook…):

1901: chemistry  Jacobus Henricus van 't Hoff
1901: literature Sully Prudhomme
1901: medicine   Emil Adolf von Behring
1901: physics    Wilhelm Conrad Röntgen
1902: chemistry  Hermann Emil Fischer
1902: literature Christian Matthias Theodor Mommsen
1902: medicine   Ronald Ross
1902: physics    Hendrik Antoon Lorentz
1902: physics    Pieter Zeeman
1903: chemistry  Svante August Arrhenius
In [ ]:
%%sql

Problem: List the five laureates with the shortest names.

In [30]:
%%sql
SELECT     name, year, category
FROM       nobel_prizes
ORDER BY   length(name) DESC
LIMIT      5
Running query in 'sqlite:///lect01.sqlite'
Out[30]:
name year category
Count Maurice (Mooris) Polidore Marie Bernhard Maeterlinck 1911 literature
Prince Louis-Victor Pierre Raymond de Broglie 1929 physics
Antonio Caetano de Abreu Freire Egas Moniz 1949 medicine
Johann Friedrich Wilhelm Adolf von Baeyer 1905 chemistry
Hans Karl August Simon von Euler-Chelpin 1929 chemistry

SQL aggregate functions¶

The functions above was applied to one or more columns in a row, and they returned results for each row. There is another kind of SQL function, called aggregate function. An aggregate function collapses a whole table (or part of it, see below) into one row – the SQLite docs list some aggregate functions here.

One simple such function is count(X) – it counts the number of times a values is not NULL in a column, so to see how many Nobel laureates we have in our database, we can write:

In [33]:
%%sql
SELECT    count(motivation)
FROM      nobel_prizes;
Running query in 'sqlite:///lect01.sqlite'
Out[33]:
count(motivation)
774

This query returns just one row, with the count (what else could it have returned?).

Problem: How many Nobel prizes were awarded in 2023?

In [35]:
%%sql
SELECT   COUNT(name)
FROM     nobel_prizes
WHERE    year = 2023
Running query in 'sqlite:///lect01.sqlite'
Out[35]:
COUNT(name)
9

Problem: When was the first Nobel prize awarded?

In [34]:
%%sql
SELECT   MIN(year)
FROM     nobel_prizes
Running query in 'sqlite:///lect01.sqlite'
Out[34]:
MIN(year)
1900

Exercise: How many Nobel prizes in chemistry have been awarded?

In [ ]:
%%sql

GROUP BY and HAVING¶

Using GROUP BY we can handle rows in groups – to understand how it works, lets first look at the following query:

In [36]:
%%sql
SELECT    year, category, name
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[36]:
year category name
2013 chemistry Arieh Warshel
2013 chemistry Martin Karplus
2013 chemistry Michael Levitt
2013 literature Alice Munro
2013 medicine James E. Rothman
2013 medicine Randy W. Schekman
2013 medicine Thomas C. Südhof
2013 physics François Englert
2013 physics Peter W. Higgs

Here the rows of each category will end up adjacent to each other, and using GROUP BY we insert an invisible divider between the groups, and perform any aggregate function on the whole 'group':

In [37]:
%%sql
SELECT    category, count()
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[37]:
category count()
chemistry 3
literature 1
medicine 3
physics 2

So, if we apply an aggregate function, such as count(), in a table which we have grouped, the function will be applied to each group, not to the whole table. Instead of getting one count() for the whole table (it would be a single value), we get one count() for each group (as above).

If we add name in the first line, we get a somewhat arbitrary result:

In [38]:
%%sql
SELECT    category, count(), name
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[38]:
category count() name
chemistry 3 Arieh Warshel
literature 1 Alice Munro
medicine 3 James E. Rothman
physics 2 François Englert

The category and count is correct, but only one name is shown for each category.

The 'problem' is that we only get one row per group in the output, and that there may be several laureates in each group – our query will return one of them in a seemingly haphazard manner. To alleviate this problem, SQLite defines an aggregate function group_concat, which concatenates all values in the group (it also has the alias string_agg, to match the corresponding function in PostgreSQL):

In [40]:
%%sql
SELECT    category, count(), group_concat(name, ", " ORDER BY name DESC) AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[40]:
category count() names
chemistry 3 Michael Levitt, Martin Karplus, Arieh Warshel
literature 1 Alice Munro
medicine 3 Thomas C. Südhof, Randy W. Schekman, James E. Rothman
physics 2 Peter W. Higgs, François Englert

In recent versions of SQLite it's also possible to order the values in an aggregate function such as group_concat by using ORDER BY inside the function call (see the docs).

Observe that there is no problem displaying category in the SELECT-statement above, we get a value which we know will be the same for each row in the group (this is by definition, since that's what we grouped by).

If we're only interested in those categories with less than three laureates, we use HAVING to select only groups with a given property:

In [41]:
%%sql
SELECT    category, count(), group_concat(name, ", ") AS "names"
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
Running query in 'sqlite:///lect01.sqlite'
Out[41]:
category count() names
literature 1 Alice Munro
physics 2 François Englert, Peter W. Higgs

This corresponds to a WHERE statement, but it applies to groups, not to individual rows (as WHERE does) – so WHERE and HAVING have similar effects (they somehow narrow a search), but they're absolutely not interchangable!

Important (and often misunderstood): In the query above we first have a WHERE statement to select some rows from the whole table, and then group the resulting selection – this can be seen in this diagram:

No description has been provided for this image

Every time we have both a WHERE and a HAVING in the same query, we must first use WHERE to select rows we can group, and then use HAVING to select groups. We can use WITH statements or subqueries (see below) if we want to have it the other way around.

Problem: Has anyone won more than one Nobel prize? We can assume the names of the laureates are unique (so far they are!).

In [45]:
%%sql
SELECT    name, group_concat(format('%s in %s', category, year) ORDER BY year)
FROM      nobel_prizes
GROUP BY  name
HAVING    COUNT() > 1
ORDER BY  name
Running query in 'sqlite:///lect01.sqlite'
Out[45]:
name group_concat(format('%s in %s', category, year) ORDER BY year)
Frederick Sanger chemistry in 1958,chemistry in 1980
John Bardeen physics in 1956,physics in 1972
K. Barry Sharpless chemistry in 2001,chemistry in 2022
Marie Curie, née Sklodowska physics in 1903,chemistry in 1911

Problem: How many laureates are there in each category?

In [47]:
%%sql
SELECT   category, COUNT()
FROM     nobel_prizes
GROUP BY category
Running query in 'sqlite:///lect01.sqlite'
Out[47]:
category COUNT()
chemistry 197
literature 122
medicine 229
physics 227

Problem: Which categories have had more than 200 laureates?

In [48]:
%%sql
SELECT    category, COUNT()
FROM      nobel_prizes
GROUP BY  category
HAVING    COUNT() > 200
Running query in 'sqlite:///lect01.sqlite'
Out[48]:
category COUNT()
medicine 229
physics 227

Exercise: How many laureates were there each year between 1920 and 1930?

In [ ]:
%%sql

Exercise: Which years saw more than 9 laureates?

In [ ]:
%%sql

Exercise: Which have been the 20 years with most laureates? (We don't need to be precise in case of ties.)

In [ ]:
%%sql

Using CASE WHEN¶

When we group values, we can make use of the CASE WHEN construct (see the penultimate track in the diagram below):

No description has been provided for this image

For instance, we can use it to categorize the era for each of the physics laureates having a name beginning with 'A':

In [49]:
%%sql
SELECT   year, name,
         CASE
             WHEN year < 1970 THEN 'ancient era'
             WHEN year <= 2000 THEN 'a long time ago'
             ELSE 'quite recently'
         END AS era
FROM     nobel_prizes
WHERE    category = 'physics' AND name LIKE 'A%'
Running query in 'sqlite:///lect01.sqlite'
Out[49]:
year name era
1903 Antoine Henri Becquerel ancient era
1907 Albert Abraham Michelson ancient era
1921 Albert Einstein ancient era
1927 Arthur Holly Compton ancient era
1964 Aleksandr Mikhailovich Prokhorov ancient era
1966 Alfred Kastler ancient era
1974 Antony Hewish a long time ago
1975 Aage Niels Bohr a long time ago
1978 Arno Allan Penzias a long time ago
1979 Abdus Salam a long time ago
1981 Arthur Leonard Schawlow a long time ago
2003 Alexei A. Abrikosov quite recently
2003 Anthony J. Leggett quite recently
2007 Albert Fert quite recently
2010 Andre Geim quite recently
2011 Adam G. Riess quite recently
2015 Arthur B. McDonald quite recently
2018 Arthur Ashkin quite recently
2020 Andrea Ghez quite recently
2022 Alain Aspect quite recently
2022 Anton Zeilinger quite recently
2023 Anne L'Huillier quite recently

Problem: Use the CASE WHEN construct and GROUP BY to count the number of physics laureates beginning with 'A' in each era.

In [ ]:
%%sql

A short intro to window functions¶

As we saw above, we can treat partitions of our rows as 'groups', by using GROUP BY.

We also saw than when we use an aggregate function on a group, we collapse whole groups into one row in the output (and when we apply it to a table without groups, the whole table collapses) – but sometimes we want to apply aggregate functions within a group of values, and keep each row in the output.

Above we listed all laureates in 2013, now we want to add one column to the output: for each laureate, we want to show how many laureates shared the prize in her or his category.

If we use GROUP BY and the aggregate function count() on the categories, we would only get one row per category, and using count() without grouping would collapse the whole result into just one row:

In [50]:
%%sql
SELECT    year, category, name, count() AS count      -- oh no!
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[50]:
year category name count
2013 chemistry Arieh Warshel 9

Fortunately, SQL has quite recently introduced a way to apply functions only over 'partitions' of our tables, and keep all rows in the output – using the reserved word OVER we can introduce a window of our rows, and apply the function only over this 'window':

In [51]:
%%sql
SELECT    year,
          category,
          name,
          count() OVER (PARTITION by category) AS count
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[51]:
year category name count
2013 chemistry Arieh Warshel 3
2013 chemistry Martin Karplus 3
2013 chemistry Michael Levitt 3
2013 literature Alice Munro 1
2013 medicine James E. Rothman 3
2013 medicine Randy W. Schekman 3
2013 medicine Thomas C. Südhof 3
2013 physics François Englert 2
2013 physics Peter W. Higgs 2

We can also give our windows names, using an alias:

In [ ]:
%%sql
SELECT    year,
          category,
          name,
          count() OVER categories AS count
FROM      nobel_prizes
WHERE     year = 2013
WINDOW    categories AS (PARTITION by category)
ORDER BY  category

The aliased window definitions must come after any WHERE and HAVING, and before any ORDER BY.

So, if we use the reserved word OVER after a function, the function will be applied according to a 'window' (there is more to it than this, but this will suffice for now). In the window we can:

  • define a partition, using PARTITION BY,

  • define an order, using ORDER BY, and

  • define a range, which we can use to define groups of rows relative to each other (but we won't look at ranges in the course).

The function will be applied to each partition, in the same way we applied aggregate functions on groups above, but now we won't collapse the partitions. Observe that the partitioning and ordering are based only on the selection we make (i.e., only those rows which are chosen in our WHERE clause).

We can use our regular aggregate functions as window functions, but there are also a couple of dedicated window functions, such as (there are more, but we won't use them in the course):

  • rank(): ranks rows by the order specified in the window, ties can occur,

  • row_number(): as rank(), but now we avoid ties, and rank by row number in the output, and

  • percent_rank(): gives a value between 0.0 and 1.0 (so it's a bit of a misnomer), giving the row's relative rank within its partition.

You can find more here.

Window functions can be very powerful, but we'll not delve too deeply into them in the course – I want you to be aware of them, though!

Problem: Add one column which 'ranks' the laureates of 2013 in the table above according to the lengths of their names, within the categories – shorter names should come before longer names.

In [57]:
%%sql
SELECT    year,
          category,
          name,
          RANK() OVER (PARTITION by category ORDER BY length(name)) AS rank
FROM      nobel_prizes
WHERE     year = 2013
ORDER BY  category
Running query in 'sqlite:///lect01.sqlite'
Out[57]:
year category name rank
2013 chemistry Arieh Warshel 1
2013 chemistry Martin Karplus 2
2013 chemistry Michael Levitt 2
2013 literature Alice Munro 1
2013 medicine James E. Rothman 1
2013 medicine Thomas C. Südhof 1
2013 medicine Randy W. Schekman 3
2013 physics Peter W. Higgs 1
2013 physics François Englert 2

Exercise: For each laureate with the initial A, list their category, year, name, and 'freshness' within that category, i.e., the most recent laureate in a category is ranked 1, the second most recent laureate is ranked 2, etc. The ranks should be confined to laureates with the initial A.

In [ ]:
%%sql

Some exercises¶

To spice things up a bit, I've included a table with all olympic games since 1896 – the table olympics contains the columns:

  • year
  • city
  • country
  • continent
  • season
  • ordinal_number

If we look carefully at this table, we can find some unnecessary repetition (try to find it!) – we will address this problem during lecture 2, but for now, we'll let it pass.

Exercise: How many olympic games have each continent hosted?

In [54]:
%%sql
SELECT   *
FROM     olympics
Running query in 'sqlite:///lect01.sqlite'
Out[54]:
year city country continent season ordinal_number
1924 Chamonix France Europe winter I
1928 St. Moritz Switzerland Europe winter II
1932 Lake Placid United States North America winter III
1936 Garmisch-Partenkirchen Germany Europe winter IV
1948 St. Moritz Switzerland Europe winter V
1952 Oslo Norway Europe winter VI
1956 Cortina d'Ampezzo Italy Europe winter VII
1960 Squaw Valley United States North America winter VIII
1964 Innsbruck Austria Europe winter IX
1968 Grenoble France Europe winter X
1972 Sapporo Japan Asia winter XI
1976 Innsbruck Austria Europe winter XII
1980 Lake Placid United States North America winter XIII
1984 Sarajevo Yugoslavia Europe winter XIV
1988 Calgary Canada North America winter XV
1992 Albertville France Europe winter XVI
1994 Lillehammer Norway Europe winter XVII
1998 Nagano Japan Asia winter XVIII
2002 Salt Lake City United States North America winter XIX
2006 Turin Italy Europe winter XX
2010 Vancouver Canada North America winter XXI
2014 Sochi Russia Europe winter XXII
2018 Pyeongchang South Korea Asia winter XXIII
2022 Beijing China Asia winter XXIV
1896 Athens Greece Europe summer I
1900 Paris France Europe summer II
1904 St. Louis United States North America summer III
1908 London United Kingdom Europe summer IV
1912 Stockholm Sweden Europe summer V
1916 Berlin Germany Europe summer VI
1920 Antwerp Belgium Europe summer VII
1924 Paris France Europe summer VIII
1928 Amsterdam Netherlands Europe summer IX
1932 Los Angeles United States North America summer X
1936 Berlin Germany Europe summer XI
1948 London United Kingdom Europe summer XIV
1952 Helsinki Finland Europe summer XV
1956 Melbourne Australia Australia summer XVII
1960 Rome Italy Europe summer XVII
1964 Tokyo Japan Asia summer XVIII
1968 Mexico City Mexico North America summer XIX
1972 Munich West Germany Europe summer XX
1976 Montreal Canada North America summer XXI
1980 Moscow Soviet Union Europe summer XXII
1984 Los Angeles United States North America summer XXIII
1988 Seoul South Korea Asia summer XXIV
1992 Barcelona Spain Europe summer XXV
1996 Atlanta United States North America summer XXVI
2000 Sydney Australia Australia summer XXVII
2004 Athens Greece Europe summer XXVIII
2008 Beijing China Asia summer XXIX
2012 London United Kingdom Europe summer XXX
2016 Rio de Janeiro Brazil South America summer XXXI
2020 Tokyo Japan Asia summer XXXII
2024 Paris France Europe summer XXXIII

Exercise: When was the first olympic games in each continent?

In [56]:
%%sql
SELECT    continent, MIN(year)
FROM      olympics
GROUP BY  continent
Running query in 'sqlite:///lect01.sqlite'
Out[56]:
continent MIN(year)
Asia 1964
Australia 1956
Europe 1896
North America 1904
South America 2016

Exercise: Which countries have hosted the summer olympics more than once?

In [ ]:
%%sql

Exercise: List the continents in descending order by the number of times they've hosted the summer olympics

In [ ]:
%%sql

Exercise: Show a 'histogram' (no actual diagram, just the counts) over the the initial letter of the names of all Nobel laureates (see how big your chance is…).

In [ ]:
%%sql

Exercise: Show a 'histogram' over the the initial letter of the names of all Nobel laureates, for each category.

In [ ]:
%%sql

Exercise: Has anyone won more than one Nobel prize in the same category?

In [ ]:
%%sql

Exercise: Has anyone won Nobel prizes in different categories?

In [ ]:
%%sql

Exercise: For each olympic game, show how many olympic games had come before it in its continent. (Requires window functions)

In [ ]:
%%sql

Subqueries, Views and Common Table Expressions (CTEs)¶

As we noted above, the result of a SELECT statement is itself a (kind of) table, and we can use such a table inside other SELECT statements.

One useful pattern is:

SELECT ...
FROM   ...
WHERE  ... IN
       (SELECT ...
        FROM ...
        WHERE ...)

The second, nested query is called a subquery.

We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?

In [ ]:
%%sql

… and now we use the result of that query to find out what we're really looking for:

In [ ]:
%%sql

This can be simplified by using either of two ways to define 'temporary tables':

  • views
  • Common Tabale Expressions.

A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:

In [ ]:
%%sql
CREATE VIEW shared_literature_prize(year) AS
  SELECT ... -- just as above...

We can now use our view in a query:

In [ ]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN (
            SELECT    year
            FROM      shared_literature_prize
          )

Since there is only one attribute in our shared_literature_table, we can simplify this expression:

In [ ]:
%%sql
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

The view will be around until we decide to remove it with:

In [ ]:
%%sql
DROP VIEW shared_literature_prize

A Common Table Expression (or CTE) is like a view, but only defined in one query, so we'd write it like:

In [ ]:
%%sql
WITH
  shared_literature_prize(year) AS (
    SELECT ... -- as above...
  )
SELECT    year, name
FROM      nobel_prizes
WHERE     category = 'literature'
          AND year IN shared_literature_prize

There are some things which makes CTEs very nice:

  • they're defined as part of a SELECT statement (so there is nothing to drop afterwards),
  • since they're part of a SELECT statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and
  • they can be defined recursively (we'll return to that later in the course).

Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.

In [ ]:
%%sql

Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery, and then rewrite it using a CTE.

In [ ]:
%%sql

We saw above that we can't have another WHERE after the HAVING clause:

In [ ]:
%%sql
SELECT    category, count() AS count
FROM      nobel_prizes
WHERE     year = 2013
GROUP BY  category
HAVING    count() < 3
WHERE     count > 1        --   <-- not allowed!

but we can make our 'grouping query' into a subquery, and have another WHERE in the outer query:

In [ ]:
%%sql
SELECT category, count
FROM (
    SELECT    category, count() AS count
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3)
WHERE  count > 1

A somewhat tidier way of expressing this is to use a WITH-statement (CTE):

In [ ]:
%%sql
WITH category_count(category, count) AS (
    SELECT    category, count()
    FROM      nobel_prizes
    WHERE     year = 2013
    GROUP BY  category
    HAVING    count() < 3
)
SELECT category, count
FROM   category_count
WHERE  count > 1

Problem: Show which years there were prices in some category, but not in medicine (you can do it either with or without CTEs, but use a subquery).

In [ ]:
%%sql

Problem: There is a 'funny' function which returns series of values – it's called generate_series, and you can try to use it to see which year no nobel price at all was awarded.

In [ ]:
%%sql

Correlated subqueries¶

Another form of subquery allows us to correlate two SELECT statements, so that we use values from an outer SELECT statement inside the SELECT statement in a subquery. As an example, solve the following problem:

Problem: List the names of all laureates who has the longest name of all laureates in their category (in case of ties, all should be listed) – order by category.

Here we can use a subquery which is 'run' for each row in our main query:

In [ ]:
%%sql
SELECT  category, year, name
FROM    nobel AS outer_nobel
WHERE   length(name) = (
            SELECT max(length(name))
            FROM   nobel
            WHERE  category = outer_nobel.category)
ORDER BY category

This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel table in the outer query and the nobel table in the subquery (it's the same table, but we 'iterate' through it separately).

BTW, we could have skipped the AS in

...
FROM    nobel AS outer_nobel
...

and just written:

...
FROM    nobel outer_nobel
...

The general opinion is that we should use AS, as it makes it more obvious that we're defining an alias.