A simple example
Here’s a fairly simple SQL query:
SELECT TOP 20 title, country, AVG(salary) AS average_salary, SUM(salary) AS sum_salary, AVG(salary + payroll_tax) AS average_gross_salary, SUM(salary + payroll_tax) AS sum_gross_salary, AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost, SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost, COUNT(*) AS ct FROM employees WHERE start_date > DATE('2021-01-01') AND salary + payroll_tax + benefits_cost > 0 GROUP BY title, country HAVING COUNT(*) > 200 ORDER BY sum_gross_cost, country DESC
Even this simple query demonstrates some of the problems with SQL’s lack of abstractions:
- Unnecessary repetition — the calculations for each measure are repeated,
despite deriving from a previous measure. The repetition in the
WHEREclause obfuscates the meaning of the expression.
- Functions have multiple operators —
WHEREare fundamentally similar operations applied at different stages of the pipeline, but SQL’s lack of pipeline-based precedence requires it to have two different operators.
- Operators have multiple functions — the
SELECToperator both creates new aggregations, and selects which columns to include.
- Awkward syntax — when developing the query, commenting out the final line of
SELECTlist causes a syntax error because of how commas are handled, and we need to repeat the columns in the
GROUP BYclause in the
Here’s the same query with PRQL:
from employees # Each line transforms the previous result. filter start_date > @2021-01-01 # Clear date syntax. derive [ # `derive` adds columns / variables. gross_salary = salary + payroll_tax, gross_cost = gross_salary + benefits_cost # Variables can use other variables. ] filter gross_cost > 0 group [title, country] ( # `group` runs a pipeline over each group. aggregate [ # `aggregate` reduces a column to a row. average salary, sum salary, average gross_salary, sum gross_salary, average gross_cost, sum_gross_cost = sum gross_cost, # `=` sets a column name. ct = count, ] ) sort [sum_gross_cost, -country] # `-country` means descending order. filter ct > 200 take 20
As well as using variables to reduce unnecessary repetition, the query is also
more readable — it flows from top to bottom, each line representing a
transformation of the previous line’s result. For example,
TOP 20 /
modify the final result in both queries — but only PRQL represents it as the
final transformation. And context is localized — the
aggregate transform is
immediately wrapped in a
group transform containing the columns to group by.
While PRQL is designed for reading & writing by people, it’s also much simpler
for code to construct or edit PRQL queries. In SQL, adding a filter to a query
involves parsing the query to find and then modify the
WHERE statement, or
wrapping the existing query in a CTE. In PRQL, adding a filter just involves
filter transformation to the query.
For more examples, check out the PRQL Book .