Examples
A simple example
Here’s a fairly simple SQL query:
SELECT TOP 20
title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(salary + payroll_tax) AS average_gross_salary,
SUM(salary + payroll_tax) AS sum_gross_salary,
AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost,
SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost,
COUNT(*) AS ct
FROM
employees
WHERE
start_date > DATE('2021-01-01')
AND salary + payroll_tax + benefits_cost > 0
GROUP BY
title,
country
HAVING
COUNT(*) > 2000
ORDER BY
sum_gross_cost,
country DESC
Even this simple query demonstrates some of the problems with SQL’s lack of abstractions:
- Unnecessary repetition — the calculations for each measure are repeated,
despite deriving from a previous measure. The repetition in the
WHERE
clause obfuscates the meaning of the expression. - Functions have multiple operators —
HAVING
&WHERE
are fundamentally similar operations applied at different stages of the pipeline, but SQL’s lack of pipeline-based precedence requires it to have two different operators. - Operators have multiple functions — the
SELECT
operator both creates new aggregations, and selects which columns to include. - Awkward syntax — when developing the query, commenting out the final line of
the
SELECT
list causes a syntax error because of how commas are handled, and we need to repeat the columns in theGROUP BY
clause in theSELECT
list.
Here’s the same query with PRQL:
from employees # Each line transforms the previous result.
filter start_date > @2021-01-01 # Clear date syntax.
derive { # `derive` adds columns / variables.
gross_salary = salary + payroll_tax,
gross_cost = gross_salary + benefits_cost # Variables can use other variables.
}
filter gross_cost > 0
group {title, country} ( # `group` runs a pipeline over each group.
aggregate { # `aggregate` reduces a column to a row.
average salary,
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum_gross_cost = sum gross_cost, # `=` sets a column name.
ct = count this,
}
)
sort {sum_gross_cost, -country} # `-country` means descending order.
filter ct > 2_000
take 20
As well as using variables to reduce unnecessary repetition, the query is also
more readable — it flows from top to bottom, each line representing a
transformation of the previous line’s result. For example, TOP 20
/ take 20
modify the final result in both queries — but only PRQL represents it as the
final transformation. And context is localized — the aggregate
transform is
immediately wrapped in a group
transform containing the columns to group by.
While PRQL is designed for reading & writing by people, it’s also much simpler
for code to construct or edit PRQL queries. In SQL, adding a filter to a query
involves parsing the query to find and then modify the WHERE
statement, or
wrapping the existing query in a CTE. In PRQL, adding a filter just involves
appending a filter
transformation to the query.
For more examples, check out the PRQL Book.