EDAF75 - notebook for lecture 1¶
Author: Christian Söderberg
Week 1: Introduction, relational databases, and SQL¶
In the text below there are two kinds of problems:
Problems marked Problem, which I intend to solve during the lecture. Depending on how fast we progress, I may or may not have time to solve them all – those of the problems we have to skip during the lectures are left as exercises (see below), but we can discuss them during QA sessions.
Problems marked Exercise, which I suggest you solve yourselves (we can also work on them during the QA sessions).
This is a Jupyter notebook, and they have built in support for Julia, Python, and R (hence Ju-Pyt-R), here's some Python code:
def hello(name):
print(f"hello, {name}!")
def main():
name = input("What's your name: ")
hello(name)
main()
hello, world!
You can run the code snippet above by clicking somewhere in the box, and press Shift-Enter.
We're primarily going to run SQL code (see below) in our notebooks, but I'll also show you some Python code later on in the course (you don't have to learn Python to take the course, though).
Using tables to store data¶
If we were to keep track of all Nobel laureates in a Python or Java program, and didn't know about relational databases, we would probably define classes for the laureates, and put them in lists. We could also define classes for the categories, and have one list for each category, or have lists with one element per year, and somehow track all laureates in that year, or use dictionaries/maps. However we chose to keep track of the data, some searches, insertions and deletions would be easy to implement, and some would be cumbersome. We'd also have to be careful to keep our data consistent.
In this course, we'll use a technique which may at first seem too simple to be useful, but which turns out to be incredibly powerful. We're going to use relational databases, and we'll store the data in 'simple' tables. Each table looks like a simple spreadsheet – here is a table with some Nobel laureates:
Year | Category | Name | Motivation |
---|---|---|---|
1901 | chemistry | Jacobus Henricus van 't Hoff | in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions |
1901 | literature | Sully Prudhomme | in special recognition of his poetic composition, which gives evidence of lofty idealism, artistic perfection and a rare combination of the qualities of both heart and intellect |
1901 | medicine | Emil Adolf von Behring | for his work on serum therapy, especially its application against diphtheria, by which he has opened a new road in the domain of medical science and thereby placed in the hands of the physician a victorious weapon against illness and deaths |
1901 | physics | Wilhelm Conrad Röntgen | in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him |
1902 | chemistry | Hermann Emil Fischer | in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses |
1902 | literature | Christian Matthias Theodor Mommsen | the greatest living master of the art of historical writing, with special reference to his monumental work, A history of Rome |
1902 | medicine | Ronald Ross | for his work on malaria, by which he has shown how it enters the organism and thereby has laid the foundation for successful research on this disease and methods of combating it |
1902 | physics | Hendrik Antoon Lorentz | in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena |
1902 | physics | Pieter Zeeman | in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena |
… | … | … | … |
A row represents an item, and a column represents a property of the items – in the example above, each row describes a Nobel prize someone has been awarded, and for each such prize, we have columns showing what year the prize was awarded, in what category, the name of the laureate, and the motivation.
One basic idea of relational databases is that all 'cells' in the table should be simple values (no lists or objects), and that we can use simple operations from relational algebra to get information from it (cells are sometimes allowed to have the value NULL
, when a value is missing, we'll return to that below).
To work with our tables we use a programming language which is highly specialized for manipulating and extracting information, it is called SQL, which is short hand for Structured Query Language, and it's a Domain Specific Language (DSL) for finding information in tables.
SQL can be pronounced as either "S-Q-L", or "sequel".
Setting up this notebook¶
There are many different Relational Database Management Systems (RDMBS:es) which implements SQL, some of the most prominent are:
Most of these systems are client-server-systems, i.e., they have one program, a SQL server, which handles the data, and clients who communicate with the server in various ways. There are several different kinds of clients:
We can run an IDE, which allows us to see our tables in a GUI.
We can run command line clients (CLI) – they are text based programs who work like typical REPLs, output will just be text in a terminal window.
We can write scripts which we send to the server, often through a CLI.
We can run a notebook (such as this one), and have it communicate with our database.
We can write code in a general purpose language, and have it communicate with our database.
In the course, we'll try all of these methods to access our databases.
The RDMBS we'll use in the course is SQLite, which is a lightweight but still very powerful system – it is by far the most used RDBMS, and it's probably already running on all of your phones and computers (just as an example, if you use Chrome for browsing, your browsing history is typically saved in a SQL-database file .config/google-chrome/Default/History
, and Mozilla use it for storing meta-data in Firefox and Thunderbird).
It's actually not a client/server system (instead it is a library which keeps our databases in files on our computer) – but in the course, we'll think of SQLite as if it were a traditional client/server system, because in many ways, it behaves as one.
In order to run SQL in this notebook, we need to load the SQL extension:
%load_ext sql
… and then we want some database to work on (I've put all the databases used for this lecture in a file called lect01.sqlite
):
%sql sqlite:///lect01.sqlite
Now we're good to go, we just have to prefix our SQL queries with %sql
(one line of SQL) or %%sql
(several lines of SQL, this is the form we will use in most cases).
Using SQL to find information in a table (SELECT
)¶
If we were to find out who were the Nobel laureates in 1910, we could just look through the lines of our table above, and if the rows in the table were ordered by year, it would be easy (that's essentially how mankind has used encyclopedias since they were invented about 2000 years ago).
To search for values in SQL, we use the SELECT
statement:
(the diagram above is taken from SQLite's fabulous web site, you can find out more about the SELECT
statement here).
If we follow the lines in the diagram above, we could puzzle together the query:
%%sql
SELECT category, name
FROM nobel_prizes
WHERE year = 1910
category | name |
---|---|
chemistry | Otto Wallach |
literature | Paul Johann Ludwig Heyse |
medicine | Albrecht Kossel |
physics | Johannes Diderik van der Waals |
This will return a new table, looking like:
Category | Name |
---|---|
chemistry | Otto Wallach |
literature | Paul Johann Ludwig Heyse |
medicine | Albrecht Kossel |
physics | Johannes Diderik van der Waals |
Nobel prize was first awarded in 1901, so if we look for winners of the 1900 Nobel prizes, we would get an empty table back:
%%sql
SELECT category, name
FROM nobel_prizes
WHERE year = 1900
category | name |
---|
SQL has a NULL
value which is sometimes useful (we can use it to signal that one column in a row lacks a value), but observe that there is a huge difference between the empty table returned above, and NULL
(if we had a Python or Java function which was given a list of strings and should return all of these strings which contained an uppercase letter, it should return an empty list, not None
or NULL
, if there was no such string).
NULL
values are only used for individual columns, and we could potentially have used it in our Nobel prize database if we had an award for which there was no motivation.
Let's say the fictional poet Oddput Clementin was awarded the Nobel prize in literature in 2025 – since there is no way to explain why he got it it, we could set motivation
to NULL
in that case (we'll return to how we insert data into our tables later).
%%sql
INSERT INTO nobel_prizes(year, category, name, motivation)
VALUES (1900, 'literature', 'Oddput Clementin', NULL)
The value NULL
itself is weird, in SQL the test NULL = NULL
doesn't return TRUE
or FALSE
, it's NULL
(but Jupyter writes that as None
).
So the query:
%%sql
SELECT year, category, name
FROM nobel_prizes
WHERE motivation = NULL
year | category | name |
---|
would always return an empty result.
To see if a value is NULL
, we have to use the IS
comparison:
%%sql
SELECT year, category, name
FROM nobel_prizes
WHERE motivation IS NULL
year | category | name |
---|---|---|
1900 | literature | Oddput Clementin |
Despite it's weirdness, NULL
can actually be useful in SQL, we'll return to this later.
The fact that we get tables back from our queries is very useful – we'll return to that too in a little while.
If we wanted to see only laureates in physics in 1910, we'd just refine our condition:
%%sql
SELECT name
FROM nobel_prizes
WHERE year = 1910 and category = 'physics'
name |
---|
Johannes Diderik van der Waals |
Problem: How do we get all literature laureates in the 1920ies (see the docs)?
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature' AND year BETWEEN 1920 AND 1930
year | name |
---|---|
1920 | Knut Pedersen Hamsun |
1921 | Anatole France |
1922 | Jacinto Benavente |
1923 | William Butler Yeats |
1924 | Wladyslaw Stanislaw Reymont |
1925 | George Bernard Shaw |
1926 | Grazia Deledda |
1927 | Henri Bergson |
1928 | Sigrid Undset |
1929 | Thomas Mann |
1930 | Sinclair Lewis |
To limit the number of rows in the output when using jupysql
we can set the variable SqlMagic.displaylimit
– here we'll turn it off:
%config SqlMagic.displaylimit = None
Problem: How do we order the literature laureates in the 1920ies by name?
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature' AND year BETWEEN 1920 AND 1929
ORDER BY name
year | name |
---|---|
1921 | Anatole France |
1925 | George Bernard Shaw |
1926 | Grazia Deledda |
1927 | Henri Bergson |
1922 | Jacinto Benavente |
1920 | Knut Pedersen Hamsun |
1928 | Sigrid Undset |
1929 | Thomas Mann |
1923 | William Butler Yeats |
1924 | Wladyslaw Stanislaw Reymont |
How do we get the search result above in reverse order?
Problem: Show the first 10 Nobel prizes in chemistry.
%%sql
SELECT *
FROM nobel_prizes
WHERE category = 'chemistry'
LIMIT 10
year | category | name | motivation |
---|---|---|---|
1901 | chemistry | Jacobus Henricus van 't Hoff | in recognition of the extraordinary services he has rendered by the discovery of the laws of chemical dynamics and osmotic pressure in solutions |
1902 | chemistry | Hermann Emil Fischer | in recognition of the extraordinary services he has rendered by his work on sugar and purine syntheses |
1903 | chemistry | Svante August Arrhenius | in recognition of the extraordinary services he has rendered to the advancement of chemistry by his electrolytic theory of dissociation |
1904 | chemistry | Sir William Ramsay | in recognition of his services in the discovery of the inert gaseous elements in air, and his determination of their place in the periodic system |
1905 | chemistry | Johann Friedrich Wilhelm Adolf von Baeyer | in recognition of his services in the advancement of organic chemistry and the chemical industry, through his work on organic dyes and hydroaromatic compounds |
1906 | chemistry | Henri Moissan | in recognition of the great services rendered by him in his investigation and isolation of the element fluorine, and for the adoption in the service of science of the electric furnace called after him |
1907 | chemistry | Eduard Buchner | for his biochemical researches and his discovery of cell-free fermentation |
1908 | chemistry | Ernest Rutherford | for his investigations into the disintegration of the elements, and the chemistry of radioactive substances |
1909 | chemistry | Wilhelm Ostwald | in recognition of his work on catalysis and for his investigations into the fundamental principles governing chemical equilibria and rates of reaction |
1910 | chemistry | Otto Wallach | in recognition of his services to organic chemistry and the chemical industry by his pioneer work in the field of alicyclic compounds |
Problem: Show Nobel prize number 32 to 42 in medicine (in chronological order).
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'medicine'
ORDER BY name
LIMIT 11
OFFSET 32
year | name |
---|---|
1928 | Charles Jules Henri Nicolle |
1907 | Charles Louis Alphonse Laveran |
2020 | Charles M. Rice |
1913 | Charles Robert Richet |
1929 | Christiaan Eijkman |
1974 | Christian de Duve |
1995 | Christiane Nüsslein-Volhard |
1938 | Corneille Jean François Heymans |
2006 | Craig C. Mello |
1984 | César Milstein |
1976 | D. Carleton Gajdusek |
Problem: What year did Albert Einstein get his award?
%%sql
SELECT year, motivation
FROM nobel_prizes
WHERE name = 'Albert Einstein'
year | motivation |
---|---|
1921 | for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect |
Problem: What year, and in what category did Churchill get his award?
%%sql
SELECT year, name,category, motivation
FROM nobel_prizes
WHERE name LIKE '%Churchill%'
year | name | category | motivation |
---|---|---|---|
1953 | Sir Winston Leonard Spencer Churchill | literature | for his mastery of historical and biographical description as well as for brilliant oratory in defending exalted human values |
Most databases have the LIKE
predicate, quite a few also have some kind of regular expressions.
Selecting only distinct values¶
If we wanted to see which years the Nobel prize was awarded, we could write:
%%sql
SELECT DISTINCT year
FROM nobel_prizes
LIMIT 10
year |
---|
1901 |
1902 |
1903 |
1904 |
1905 |
1906 |
1907 |
1908 |
1909 |
1910 |
but we would get many repetitions. To see only the distinct values, we can use a SELECT DISTINCT
query:
%%sql
SELECT DISTINCT year
FROM nobel_prizes
Problem: What categories are in our database?
%%sql
SELECT DISTINCT category
FROM nobel_prizes
category |
---|
chemistry |
literature |
medicine |
physics |
SQL scalar functions¶
In the docs, we see that the general form of the SELECT
statement allows a result-column
, and that a result-column
can contain an expr
.
This opens up many, many possibilities (some of which we will get back to below), one of them is the use of simple scalar functions.
We can also use these functions in other places, such as in our WHERE
or ORDER BY
clauses.
We can se the standard core functions here, try to use them to do the following:
Problem: Show the initial letter of each of the laureates in the year 2023.
%%sql
SELECT name, substring(name, 1, 1)
FROM nobel_prizes
WHERE year = 2023
name | substring(name, 1, 1) |
---|---|
Anne L'Huillier | A |
Ferenc Krausz | F |
Pierre Agostini | P |
Moungi G. Bawendi | M |
Louis E. Brus | L |
Alexey Ekimov | A |
Katalin Karikó | K |
Drew Weissman | D |
Jon Fosse | J |
Problem: Show the 10 first winners in the following format (see the format
-function – unfortunately it will not look great in the notebook…):
1901: chemistry Jacobus Henricus van 't Hoff
1901: literature Sully Prudhomme
1901: medicine Emil Adolf von Behring
1901: physics Wilhelm Conrad Röntgen
1902: chemistry Hermann Emil Fischer
1902: literature Christian Matthias Theodor Mommsen
1902: medicine Ronald Ross
1902: physics Hendrik Antoon Lorentz
1902: physics Pieter Zeeman
1903: chemistry Svante August Arrhenius
%%sql
Problem: List the five laureates with the shortest names.
%%sql
SELECT name, year, category
FROM nobel_prizes
ORDER BY length(name) DESC
LIMIT 5
name | year | category |
---|---|---|
Count Maurice (Mooris) Polidore Marie Bernhard Maeterlinck | 1911 | literature |
Prince Louis-Victor Pierre Raymond de Broglie | 1929 | physics |
Antonio Caetano de Abreu Freire Egas Moniz | 1949 | medicine |
Johann Friedrich Wilhelm Adolf von Baeyer | 1905 | chemistry |
Hans Karl August Simon von Euler-Chelpin | 1929 | chemistry |
SQL aggregate functions¶
The functions above was applied to one or more columns in a row, and they returned results for each row. There is another kind of SQL function, called aggregate function. An aggregate function collapses a whole table (or part of it, see below) into one row – the SQLite docs list some aggregate functions here.
One simple such function is count(X)
– it counts the number of times a values is not NULL
in a column, so to see how many Nobel laureates we have in our database, we can write:
%%sql
SELECT count(motivation)
FROM nobel_prizes;
count(motivation) |
---|
774 |
This query returns just one row, with the count (what else could it have returned?).
Problem: How many Nobel prizes were awarded in 2023?
%%sql
SELECT COUNT(name)
FROM nobel_prizes
WHERE year = 2023
COUNT(name) |
---|
9 |
Problem: When was the first Nobel prize awarded?
%%sql
SELECT MIN(year)
FROM nobel_prizes
MIN(year) |
---|
1900 |
Exercise: How many Nobel prizes in chemistry have been awarded?
%%sql
GROUP BY
and HAVING
¶
Using GROUP BY
we can handle rows in groups – to understand how it works, lets first look at the following query:
%%sql
SELECT year, category, name
FROM nobel_prizes
WHERE year = 2013
ORDER BY category
year | category | name |
---|---|---|
2013 | chemistry | Arieh Warshel |
2013 | chemistry | Martin Karplus |
2013 | chemistry | Michael Levitt |
2013 | literature | Alice Munro |
2013 | medicine | James E. Rothman |
2013 | medicine | Randy W. Schekman |
2013 | medicine | Thomas C. Südhof |
2013 | physics | François Englert |
2013 | physics | Peter W. Higgs |
Here the rows of each category will end up adjacent to each other, and using GROUP BY
we insert an invisible divider between the groups, and perform any aggregate function on the whole 'group':
%%sql
SELECT category, count()
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
category | count() |
---|---|
chemistry | 3 |
literature | 1 |
medicine | 3 |
physics | 2 |
So, if we apply an aggregate function, such as count()
, in a table which we have grouped, the function will be applied to each group, not to the whole table.
Instead of getting one count()
for the whole table (it would be a single value), we get one count()
for each group (as above).
If we add name
in the first line, we get a somewhat arbitrary result:
%%sql
SELECT category, count(), name
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
category | count() | name |
---|---|---|
chemistry | 3 | Arieh Warshel |
literature | 1 | Alice Munro |
medicine | 3 | James E. Rothman |
physics | 2 | François Englert |
The category and count is correct, but only one name is shown for each category.
The 'problem' is that we only get one row per group in the output, and that there may be several laureates in each group – our query will return one of them in a seemingly haphazard manner.
To alleviate this problem, SQLite defines an aggregate function group_concat
, which concatenates all values in the group (it also has the alias string_agg
, to match the corresponding function in PostgreSQL):
%%sql
SELECT category, count(), group_concat(name, ", " ORDER BY name DESC) AS "names"
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
category | count() | names |
---|---|---|
chemistry | 3 | Michael Levitt, Martin Karplus, Arieh Warshel |
literature | 1 | Alice Munro |
medicine | 3 | Thomas C. Südhof, Randy W. Schekman, James E. Rothman |
physics | 2 | Peter W. Higgs, François Englert |
In recent versions of SQLite it's also possible to order the values in an aggregate function such as group_concat
by using ORDER BY
inside the function call (see the docs).
Observe that there is no problem displaying category
in the SELECT-statement above, we get a value which we know will be the same for each row in the group (this is by definition, since that's what we grouped by).
If we're only interested in those categories with less than three laureates, we use HAVING
to select only groups with a given property:
%%sql
SELECT category, count(), group_concat(name, ", ") AS "names"
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3
category | count() | names |
---|---|---|
literature | 1 | Alice Munro |
physics | 2 | François Englert, Peter W. Higgs |
This corresponds to a WHERE
statement, but it applies to groups, not to individual rows (as WHERE
does) – so WHERE
and HAVING
have similar effects (they somehow narrow a search), but they're absolutely not interchangable!
Important (and often misunderstood): In the query above we first have a WHERE
statement to select some rows from the whole table, and then group the resulting selection – this can be seen in this diagram:
Every time we have both a WHERE
and a HAVING
in the same query, we must first use WHERE
to select rows we can group, and then use HAVING
to select groups.
We can use WITH
statements or subqueries (see below) if we want to have it the other way around.
Problem: Has anyone won more than one Nobel prize? We can assume the names of the laureates are unique (so far they are!).
%%sql
SELECT name, group_concat(format('%s in %s', category, year) ORDER BY year)
FROM nobel_prizes
GROUP BY name
HAVING COUNT() > 1
ORDER BY name
name | group_concat(format('%s in %s', category, year) ORDER BY year) |
---|---|
Frederick Sanger | chemistry in 1958,chemistry in 1980 |
John Bardeen | physics in 1956,physics in 1972 |
K. Barry Sharpless | chemistry in 2001,chemistry in 2022 |
Marie Curie, née Sklodowska | physics in 1903,chemistry in 1911 |
Problem: How many laureates are there in each category?
%%sql
SELECT category, COUNT()
FROM nobel_prizes
GROUP BY category
category | COUNT() |
---|---|
chemistry | 197 |
literature | 122 |
medicine | 229 |
physics | 227 |
Problem: Which categories have had more than 200 laureates?
%%sql
SELECT category, COUNT()
FROM nobel_prizes
GROUP BY category
HAVING COUNT() > 200
category | COUNT() |
---|---|
medicine | 229 |
physics | 227 |
Exercise: How many laureates were there each year between 1920 and 1930?
%%sql
Exercise: Which years saw more than 9 laureates?
%%sql
Exercise: Which have been the 20 years with most laureates? (We don't need to be precise in case of ties.)
%%sql
Using CASE WHEN
¶
When we group values, we can make use of the CASE WHEN
construct (see the penultimate track in the diagram below):
For instance, we can use it to categorize the era for each of the physics laureates having a name beginning with 'A':
%%sql
SELECT year, name,
CASE
WHEN year < 1970 THEN 'ancient era'
WHEN year <= 2000 THEN 'a long time ago'
ELSE 'quite recently'
END AS era
FROM nobel_prizes
WHERE category = 'physics' AND name LIKE 'A%'
year | name | era |
---|---|---|
1903 | Antoine Henri Becquerel | ancient era |
1907 | Albert Abraham Michelson | ancient era |
1921 | Albert Einstein | ancient era |
1927 | Arthur Holly Compton | ancient era |
1964 | Aleksandr Mikhailovich Prokhorov | ancient era |
1966 | Alfred Kastler | ancient era |
1974 | Antony Hewish | a long time ago |
1975 | Aage Niels Bohr | a long time ago |
1978 | Arno Allan Penzias | a long time ago |
1979 | Abdus Salam | a long time ago |
1981 | Arthur Leonard Schawlow | a long time ago |
2003 | Alexei A. Abrikosov | quite recently |
2003 | Anthony J. Leggett | quite recently |
2007 | Albert Fert | quite recently |
2010 | Andre Geim | quite recently |
2011 | Adam G. Riess | quite recently |
2015 | Arthur B. McDonald | quite recently |
2018 | Arthur Ashkin | quite recently |
2020 | Andrea Ghez | quite recently |
2022 | Alain Aspect | quite recently |
2022 | Anton Zeilinger | quite recently |
2023 | Anne L'Huillier | quite recently |
Problem: Use the CASE WHEN
construct and GROUP BY
to count the number of physics laureates beginning with 'A' in each era.
%%sql
A short intro to window functions¶
As we saw above, we can treat partitions of our rows as 'groups', by using GROUP BY
.
We also saw than when we use an aggregate function on a group, we collapse whole groups into one row in the output (and when we apply it to a table without groups, the whole table collapses) – but sometimes we want to apply aggregate functions within a group of values, and keep each row in the output.
Above we listed all laureates in 2013, now we want to add one column to the output: for each laureate, we want to show how many laureates shared the prize in her or his category.
If we use GROUP BY
and the aggregate function count()
on the categories, we would only get one row per category, and using count()
without grouping would collapse the whole result into just one row:
%%sql
SELECT year, category, name, count() AS count -- oh no!
FROM nobel_prizes
WHERE year = 2013
ORDER BY category
year | category | name | count |
---|---|---|---|
2013 | chemistry | Arieh Warshel | 9 |
Fortunately, SQL has quite recently introduced a way to apply functions only over 'partitions' of our tables, and keep all rows in the output – using the reserved word OVER
we can introduce a window of our rows, and apply the function only over this 'window':
%%sql
SELECT year,
category,
name,
count() OVER (PARTITION by category) AS count
FROM nobel_prizes
WHERE year = 2013
ORDER BY category
year | category | name | count |
---|---|---|---|
2013 | chemistry | Arieh Warshel | 3 |
2013 | chemistry | Martin Karplus | 3 |
2013 | chemistry | Michael Levitt | 3 |
2013 | literature | Alice Munro | 1 |
2013 | medicine | James E. Rothman | 3 |
2013 | medicine | Randy W. Schekman | 3 |
2013 | medicine | Thomas C. Südhof | 3 |
2013 | physics | François Englert | 2 |
2013 | physics | Peter W. Higgs | 2 |
We can also give our windows names, using an alias:
%%sql
SELECT year,
category,
name,
count() OVER categories AS count
FROM nobel_prizes
WHERE year = 2013
WINDOW categories AS (PARTITION by category)
ORDER BY category
The aliased window definitions must come after any WHERE
and HAVING
, and before any ORDER BY
.
So, if we use the reserved word OVER
after a function, the function will be applied according to a 'window' (there is more to it than this, but this will suffice for now).
In the window we can:
define a partition, using
PARTITION BY
,define an order, using
ORDER BY
, anddefine a range, which we can use to define groups of rows relative to each other (but we won't look at ranges in the course).
The function will be applied to each partition, in the same way we applied aggregate functions on groups above, but now we won't collapse the partitions.
Observe that the partitioning and ordering are based only on the selection we make (i.e., only those rows which are chosen in our WHERE
clause).
We can use our regular aggregate functions as window functions, but there are also a couple of dedicated window functions, such as (there are more, but we won't use them in the course):
rank()
: ranks rows by the order specified in the window, ties can occur,row_number()
: asrank()
, but now we avoid ties, and rank by row number in the output, andpercent_rank()
: gives a value between 0.0 and 1.0 (so it's a bit of a misnomer), giving the row's relative rank within its partition.
You can find more here.
Window functions can be very powerful, but we'll not delve too deeply into them in the course – I want you to be aware of them, though!
Problem: Add one column which 'ranks' the laureates of 2013 in the table above according to the lengths of their names, within the categories – shorter names should come before longer names.
%%sql
SELECT year,
category,
name,
RANK() OVER (PARTITION by category ORDER BY length(name)) AS rank
FROM nobel_prizes
WHERE year = 2013
ORDER BY category
year | category | name | rank |
---|---|---|---|
2013 | chemistry | Arieh Warshel | 1 |
2013 | chemistry | Martin Karplus | 2 |
2013 | chemistry | Michael Levitt | 2 |
2013 | literature | Alice Munro | 1 |
2013 | medicine | James E. Rothman | 1 |
2013 | medicine | Thomas C. Südhof | 1 |
2013 | medicine | Randy W. Schekman | 3 |
2013 | physics | Peter W. Higgs | 1 |
2013 | physics | François Englert | 2 |
Exercise: For each laureate with the initial A, list their category, year, name, and 'freshness' within that category, i.e., the most recent laureate in a category is ranked 1, the second most recent laureate is ranked 2, etc. The ranks should be confined to laureates with the initial A.
%%sql
Some exercises¶
To spice things up a bit, I've included a table with all olympic games since 1896 – the table olympics
contains the columns:
year
city
country
continent
season
ordinal_number
If we look carefully at this table, we can find some unnecessary repetition (try to find it!) – we will address this problem during lecture 2, but for now, we'll let it pass.
Exercise: How many olympic games have each continent hosted?
%%sql
SELECT *
FROM olympics
year | city | country | continent | season | ordinal_number |
---|---|---|---|---|---|
1924 | Chamonix | France | Europe | winter | I |
1928 | St. Moritz | Switzerland | Europe | winter | II |
1932 | Lake Placid | United States | North America | winter | III |
1936 | Garmisch-Partenkirchen | Germany | Europe | winter | IV |
1948 | St. Moritz | Switzerland | Europe | winter | V |
1952 | Oslo | Norway | Europe | winter | VI |
1956 | Cortina d'Ampezzo | Italy | Europe | winter | VII |
1960 | Squaw Valley | United States | North America | winter | VIII |
1964 | Innsbruck | Austria | Europe | winter | IX |
1968 | Grenoble | France | Europe | winter | X |
1972 | Sapporo | Japan | Asia | winter | XI |
1976 | Innsbruck | Austria | Europe | winter | XII |
1980 | Lake Placid | United States | North America | winter | XIII |
1984 | Sarajevo | Yugoslavia | Europe | winter | XIV |
1988 | Calgary | Canada | North America | winter | XV |
1992 | Albertville | France | Europe | winter | XVI |
1994 | Lillehammer | Norway | Europe | winter | XVII |
1998 | Nagano | Japan | Asia | winter | XVIII |
2002 | Salt Lake City | United States | North America | winter | XIX |
2006 | Turin | Italy | Europe | winter | XX |
2010 | Vancouver | Canada | North America | winter | XXI |
2014 | Sochi | Russia | Europe | winter | XXII |
2018 | Pyeongchang | South Korea | Asia | winter | XXIII |
2022 | Beijing | China | Asia | winter | XXIV |
1896 | Athens | Greece | Europe | summer | I |
1900 | Paris | France | Europe | summer | II |
1904 | St. Louis | United States | North America | summer | III |
1908 | London | United Kingdom | Europe | summer | IV |
1912 | Stockholm | Sweden | Europe | summer | V |
1916 | Berlin | Germany | Europe | summer | VI |
1920 | Antwerp | Belgium | Europe | summer | VII |
1924 | Paris | France | Europe | summer | VIII |
1928 | Amsterdam | Netherlands | Europe | summer | IX |
1932 | Los Angeles | United States | North America | summer | X |
1936 | Berlin | Germany | Europe | summer | XI |
1948 | London | United Kingdom | Europe | summer | XIV |
1952 | Helsinki | Finland | Europe | summer | XV |
1956 | Melbourne | Australia | Australia | summer | XVII |
1960 | Rome | Italy | Europe | summer | XVII |
1964 | Tokyo | Japan | Asia | summer | XVIII |
1968 | Mexico City | Mexico | North America | summer | XIX |
1972 | Munich | West Germany | Europe | summer | XX |
1976 | Montreal | Canada | North America | summer | XXI |
1980 | Moscow | Soviet Union | Europe | summer | XXII |
1984 | Los Angeles | United States | North America | summer | XXIII |
1988 | Seoul | South Korea | Asia | summer | XXIV |
1992 | Barcelona | Spain | Europe | summer | XXV |
1996 | Atlanta | United States | North America | summer | XXVI |
2000 | Sydney | Australia | Australia | summer | XXVII |
2004 | Athens | Greece | Europe | summer | XXVIII |
2008 | Beijing | China | Asia | summer | XXIX |
2012 | London | United Kingdom | Europe | summer | XXX |
2016 | Rio de Janeiro | Brazil | South America | summer | XXXI |
2020 | Tokyo | Japan | Asia | summer | XXXII |
2024 | Paris | France | Europe | summer | XXXIII |
Exercise: When was the first olympic games in each continent?
%%sql
SELECT continent, MIN(year)
FROM olympics
GROUP BY continent
continent | MIN(year) |
---|---|
Asia | 1964 |
Australia | 1956 |
Europe | 1896 |
North America | 1904 |
South America | 2016 |
Exercise: Which countries have hosted the summer olympics more than once?
%%sql
Exercise: List the continents in descending order by the number of times they've hosted the summer olympics
%%sql
Exercise: Show a 'histogram' (no actual diagram, just the counts) over the the initial letter of the names of all Nobel laureates (see how big your chance is…).
%%sql
Exercise: Show a 'histogram' over the the initial letter of the names of all Nobel laureates, for each category.
%%sql
Exercise: Has anyone won more than one Nobel prize in the same category?
%%sql
Exercise: Has anyone won Nobel prizes in different categories?
%%sql
Exercise: For each olympic game, show how many olympic games had come before it in its continent. (Requires window functions)
%%sql
Subqueries, Views and Common Table Expressions (CTEs)¶
As we noted above, the result of a SELECT
statement is itself a (kind of) table, and we can use such a table inside other SELECT
statements.
One useful pattern is:
SELECT ...
FROM ...
WHERE ... IN
(SELECT ...
FROM ...
WHERE ...)
The second, nested query is called a subquery.
We'll use a subquery to find all literature laureates who split their prizes – to do it we begin with a regular query, to find which years the Nobel prize for literature were split?
%%sql
… and now we use the result of that query to find out what we're really looking for:
%%sql
This can be simplified by using either of two ways to define 'temporary tables':
- views
- Common Tabale Expressions.
A view can be seen as a new table, containing the result of a query – to create a view with the years where the literature prize was shared, we can write:
%%sql
CREATE VIEW shared_literature_prize(year) AS
SELECT ... -- just as above...
We can now use our view in a query:
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN (
SELECT year
FROM shared_literature_prize
)
Since there is only one attribute in our shared_literature_table
, we can simplify this expression:
%%sql
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN shared_literature_prize
The view will be around until we decide to remove it with:
%%sql
DROP VIEW shared_literature_prize
A Common Table Expression (or CTE) is like a view, but only defined in one query, so we'd write it like:
%%sql
WITH
shared_literature_prize(year) AS (
SELECT ... -- as above...
)
SELECT year, name
FROM nobel_prizes
WHERE category = 'literature'
AND year IN shared_literature_prize
There are some things which makes CTEs very nice:
- they're defined as part of a
SELECT
statement (so there is nothing to drop afterwards), - since they're part of a
SELECT
statement, we only need one statement (which will become useful when we call our database remotely, we'll return to that later in the course), and - they can be defined recursively (we'll return to that later in the course).
Problem: Show the years and categories for recurring laureates (i.e., laureates who has won more than once) – use a CTE to do it.
%%sql
Exercise: Who has won the literature prize in a year when at least one chemistry laureate had a name beginning with 'L'? First try to solve this with a regular subquery, and then rewrite it using a CTE.
%%sql
We saw above that we can't have another WHERE
after the HAVING
clause:
%%sql
SELECT category, count() AS count
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3
WHERE count > 1 -- <-- not allowed!
but we can make our 'grouping query' into a subquery, and have another WHERE
in the outer query:
%%sql
SELECT category, count
FROM (
SELECT category, count() AS count
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3)
WHERE count > 1
A somewhat tidier way of expressing this is to use a WITH-statement (CTE):
%%sql
WITH category_count(category, count) AS (
SELECT category, count()
FROM nobel_prizes
WHERE year = 2013
GROUP BY category
HAVING count() < 3
)
SELECT category, count
FROM category_count
WHERE count > 1
Problem: Show which years there were prices in some category, but not in medicine (you can do it either with or without CTEs, but use a subquery).
%%sql
Problem: There is a 'funny' function which returns series of values – it's called generate_series
, and you can try to use it to see which year no nobel price at all was awarded.
%%sql
Correlated subqueries¶
Another form of subquery allows us to correlate two SELECT
statements, so that we use values from an outer SELECT
statement inside the SELECT
statement in a subquery. As an example, solve the following problem:
Problem: List the names of all laureates who has the longest name of all laureates in their category (in case of ties, all should be listed) – order by category.
Here we can use a subquery which is 'run' for each row in our main query:
%%sql
SELECT category, year, name
FROM nobel AS outer_nobel
WHERE length(name) = (
SELECT max(length(name))
FROM nobel
WHERE category = outer_nobel.category)
ORDER BY category
This is called a correlated subquery (since we refer to the enclosing query inside it). We use an alias to distinguish between the nobel table in the outer query and the nobel table in the subquery (it's the same table, but we 'iterate' through it separately).
BTW, we could have skipped the AS
in
...
FROM nobel AS outer_nobel
...
and just written:
...
FROM nobel outer_nobel
...
The general opinion is that we should use AS
, as it makes it more obvious that we're defining an alias.