A functional approach to relational queries
I believe that many of the ideas from functional programming are very suitable for a query language. To make a case for it, I’ll briefly explain a few features, adapt their syntax, and show how they are used in PRQL.
Fancy functional features
Functional programming is an old-school programming paradigm that is gradually gaining traction within a few common languages that are designed with procedural paradigm in mind.
The many aspects of the paradigm have been discussed extensively, so I’ll try to be brief. If you’ve already read them all, I suggest you skip to the next section.
Pure functions
This is the core aspect that allows us to do most of the magic later. When we say “pure function”, we just mean the mathematical function and not “a procedure” or “a method”. To be more precise:
- Pure functions have no side effects. This means that no global state is altered. The function result is the only output that the function produces.
- Pure functions produce the same output given the same args. The output does not depend on any external information; for example the current time or a random number generator.
The most obvious effect is that for a pure function my_function
, this:
res1 = my_function(arg);
res2 = my_function(arg);
res3 = my_function(arg);
… is equivalent to:
res1 = my_function(arg);
res2 = res1;
res3 = res2;
Because my_function
is guaranteed to return the same result when given arg
,
we can skip calling my_function
for the second and the third time because we
already know what the output will be.
This may seem like a minor point, but we’re just getting started.
First-class citizens
Functions being “first-class citizens” means that they are treated the same as any other value; they can be stored in a variable, they can be passed to another function as an argument or even be used in binary operations.
function double(x) {
return 2 * x;
}
let preprocess = double;
my_array = [4, 5, 6];
my_array.map(preprocess);
// yields [8, 10, 12]
Here, we’ve stored function double
in variable preprocess
and then passed
that to method Array.map
, which applies a given function to each of the array
elements.
Currying
Named after Haskell Curry, currying is an implicit act of converting a function call with missing arguments into a new function.
function add(x, y) {
return x + y;
}
let add_one = add(1);
Because we didn’t specify y
parameter of add
function, the result is a new
function that is still waiting for the last argument. It is equivalent to
defining add_one like this:
function add_one(y) {
return add(1, y);
}
Implicit function call
Let’s suppose that we have a pure function that doesn’t take any arguments:
function my_function() {
return 42;
}
Because it is a pure function, it cannot depend on anything other than its non-existing arguments. Another way to phrase it, this is a constant expression.
And as such, it might as well be defined as one:
let my_var = 42;
In conventional programming languages1, there is a difference between using
my_function()
and my_var
. The first one evaluates the expression at the call
site, while the second one evaluates it at the declaration site. It is also
possible to express just my_function
, a reference to a function that can be
called without arguments.
But if all functions are pure, we generalize by saying that all three cases are equivalent.
To make this work, let’s say that my_function
is “implicitly invoked” just as
it would have been expressed as my_function()
.
Also, if we declare all functions to be pure it doesn’t matter exactly when
they are evaluated. So let’s give the compiler the authority to make an informed
guess about when and make my_var
and my_function()
semantically
equivalent.
Syntax, surly shinier
First of all, let’s change the function call to this:
(my_function arg1 arg2 arg3)
It may look strange because the function name is within the parenthesis and there are no commas between arguments. But there are benefits to this syntax.
Firstly, when the call has no arguments, the parenthesis can be omitted:
(my_function) == my_function
which feels very natural because of the similar behavior with expressions. As intended, a function call with no argument and a plain reference to the function are expressed with the same syntax.
Let’s extend this behavior and allow bare function calls in a few places where they won’t become ambiguous:
# a list:
[my_function arg1 arg2 arg3, my_function arg4 arg5 arg6]
# a declaration:
let res = my_function arg1 arg2 arg3
Secondly, I’d argue that currying looks natural.
let curry = my_function arg1 arg2
let res = curry arg3
# or inline:
let res = (my_function arg1 arg2) arg3
Now let’s go one step further and introduce a “pipe” operator. It applies its left operand as an argument to the right operand:
arg3 | my_function arg1 arg2
This is especially useful for chaining function calls. If we declare div
,
floor
and mul
as common arithmetic functions, the function call syntax
starts to make sense:
12 | div 5 | floor | mul 5
Running real relations
Up to this point, we’ve discussed language design in the abstract, without knowing what it will be used for. Designing for the sake of “language being cool” is fine, but we do have a use for it, which means that the language must be designed to be the right tool for the job.
The job, in the case of PRQL, is querying databases.
The core unit of data here is a “relation”: an ordered set of rows, each of which contains entries for each of the columns. (More people are likely to be familiar with a “table”, which is a relation stored in a DB.)
Queries operate in a static environment that contains references to database tables and a library of standard functions. The important keyword here is “static” by which I mean “global immutable variables” that can be referenced from pure functions 2.
PRQL currently supports most of the features described in previous sections, focused on the innovative function call syntax.
Basics
A basic operation on a relation would be:
(take 3 albums)
# or with a pipeline:
(albums | take 3)
… where take
is a function and albums
is a static relation. This does
exactly what it sounds like: it limits the result to the first n rows. One could
say it selects the top n rows.
In actual PRQL queries, we have some rules regarding references to tables, which
I’ll talk about it some other time. For now, let’s just say that for referencing
tables (i.e. static relations) use the function from
. It has the bonus that it
makes queries look a lot more like SQL:
(from albums | take 3)
To make querying easier, we have some neat name resolution rules that allow function arguments to refer to each other. In practice, it allows referring to columns of a relation in function calls:
(select [title, artist_id] default_db.albums)
# and with a pipeline:
(from albums | select {title, artist_id})
All these queries can be simplified to an expression of relations and scalars. In PRQL, we call such expressions “Relational Queries” or RQ for short. It is an intermediate representation of prqlc and can be translated to SQL and executed on basically any relational database.
This is the gist of how to express SQL queries with a functional language. At this stage a curious reader might ask “can PRQL express any SQL query?” to which I’d say: almost. Another might be wondering “but is it less cumbersome and more consistent?” and I’d reply “very much, yes”, but it depends on whom you ask. And someone might say “why functional?” to which I say “exactly the question I was waiting for!”.
There are a few quick benefits I want to get out of the way:
- Pipelines make the flow of the query “top-to-bottom”. An earlier part of a the pipeline will always be a valid query, regardless of what follows.
- Functions on relations (transforms) are designed to be orthogonal. Each of
them has as few effects as possible and can be used at any point in the
pipeline (except for
from
).
But there is more.
Aggregate
At this point, I have to introduce aggregate
. It takes a relation and produces
a single row using some aggregation function:
(from albums | aggregate {n_albums = count})
What happens if we don’t specify the leading relation?
(aggregate {n_albums = count})
Because aggregate
is missing an argument, currying kicks in, and the whole
expression evaluates to a function that is still waiting for a relation.
In SQL it is common to use GROUP BY when aggregating. If you think about it, it
essentially separates the relation into groups using some criteria and then
applies aggregate
to each of the groups. This is exactly how PRQL expressed
it:
from albums | group artist_id (aggregate {n_albums = count})
This is a lot for one line, so let’s unveil new syntactic conveniences: a new line is a pipe operator and the top-level pipeline does not need parenthesis. I’ll also add a new transform at the back, don’t worry about it.
from albums
group artist_id (
aggregate {n_albums = count}
)
filter n_albums > 3
Notice how group
operates on the whole relation it gets from from
and that
aggregate
is passed to group
as a function to be applied to each of the
groups. aggregate
then converts each of the groups into a single row and
returns that to group
that composes these rows back together. This can now be
passed on to filter
, which will remove any rows that don’t match its
condition.
In SQL, a similar expression would have these four parts:
- projection in SELECT,
- an aggregation function that implicitly triggers aggregation,
- GROUP BY clause,
- HAVING clause.
These parts are entangled syntactically and semantically into one feature we understand as GROUPING.
The beautiful realization is that these are 3 different operations that are
happening and that when separated they don’t need to be associated with each
other. For example filter
is more commonly expressed as WHERE and group
has
other uses than aggregate
.
But to separate these core relational operations, we need a way to express aggregate as a function. This is why the functional paradigm fits the relational data model.
Distinct
A common question when learning SQL is “how do I select the row where column x is smallest?”. It has many variations, but there are two ways of doing it:
-- option 1
SELECT x, y, z FROM tab ORDER BY x LIMIT 1
-- option 2
SELECT x, y, z FROM tab WHERE x = (SELECT min(x) FROM tab)
A follow-up question would be “how do I select the row where column x is smallest, for each group over y?”. This seems like a similar problem but the solution in SQL is surprisingly different:
-- option 1 (unsupported in some dialects)
SELECT DISTINCT ON (y) x, y, z FROM tab ORDER BY y, x, z;
-- option 2 (supported by most dialects)
WITH summary AS (
SELECT x, y, z,
ROW_NUMBER() OVER(PARTITION BY y ORDER BY x) AS rank
FROM tab)
SELECT * FROM summary WHERE rank = 1
Now break the query down into core operations. Essentially we want to do the
same thing we did before, but performed in groups by y
. Before we used SQL
that can be expressed as sort x | take 1
(which evaluates to a function), so
now surely this should work:
from tab
group y (sort x | take 1)
And it does. You can go and test this out in the PRQL playground.
Another variation of the question would be “how do I select a row, for each group over all columns?”. If you phrase it differently “group by all columns and then take one row from each group”. Or another way: “select distinct values of all columns”.
from tab
group tab.* (take 1)
As you can see, our language has many shortcuts for expressing operations such as DISTINCT. This is convenient for us humans, but it’s not a good base for a relational query language.
I hope my point is clear: relational query language benefits a lot by separating operations into orthogonal transforms3. These transforms are in most cases pure functions that are easiest to express in a functional language.
If you want to see more of what PRQL is capable of, come and check out the project. It may not have monads (yet), but it’s probably better than what you are forced to use now.
Appendix
PRQL support
PRQL is a work in progress. It does not yet support all the features presented here, namely:
let
syntax for variable declarations,- inline currying - function call must start with a function name,
- syntax for declaring tables.
We focused on the core features and left these out because they can be worked around.
Also, PRQL may not ever get all of these features, because the ideas in this article are only my own and not necessarily of the whole PRQL core team.
In math, function call syntax is ambiguous
If you think about it, the function call syntax from math is kind of ambiguous. For example, what does this mean:
a(b + 1)
It could either be a call of function a, or it could be just multiplication
where we omitted *
. This is not a problem in conventional programming
languages because they don’t allow omitting the *
.
Let’s say that conventional languages are the first 15 from this list: https://survey.stackoverflow.co/2022/#most-popular-technologies-language-prof ↩︎
Many functions in SQL are not pure, and we don’t have a plan on how to deal with that. ↩︎
Transforms in PRQL are not completely orthogonal.
select
,derive
,aggregate
andjoin
all manipulate relation columns. So in a sense, they are much closer to each other than they are totake
. ↩︎