A simple example

Here’s a fairly simple SQL query:

  AVG(salary) AS average_salary,
  SUM(salary) AS sum_salary,
  AVG(salary + payroll_tax) AS average_gross_salary,
  SUM(salary + payroll_tax) AS sum_gross_salary,
  AVG(salary + payroll_tax + benefits_cost) AS average_gross_cost,
  SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost,
  COUNT(*) AS ct
  start_date > DATE('2021-01-01')
  AND salary + payroll_tax + benefits_cost > 0
  COUNT(*) > 2000
  country DESC

Even this simple query demonstrates some of the problems with SQL’s lack of abstractions:

  • Unnecessary repetition — the calculations for each measure are repeated, despite deriving from a previous measure. The repetition in the WHERE clause obfuscates the meaning of the expression.
  • Functions have multiple operators — HAVING & WHERE are fundamentally similar operations applied at different stages of the pipeline, but SQL’s lack of pipeline-based precedence requires it to have two different operators.
  • Operators have multiple functions — the SELECT operator both creates new aggregations, and selects which columns to include.
  • Awkward syntax — when developing the query, commenting out the final line of the SELECT list causes a syntax error because of how commas are handled, and we need to repeat the columns in the GROUP BY clause in the SELECT list.

Here’s the same query with PRQL:

from employees                                # Each line transforms the previous result.
filter start_date > @2021-01-01               # Clear date syntax.
derive {                                      # `derive` adds columns / variables.
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost   # Variables can use other variables.
filter gross_cost > 0
group {title, country} (                      # `group` runs a pipeline over each group.
  aggregate {                                 # `aggregate` reduces a column to a row.
    average salary,
    sum     salary,
    average gross_salary,
    sum     gross_salary,
    average gross_cost,
    sum_gross_cost = sum gross_cost,          # `=` sets a column name.
    ct = count this,
sort {sum_gross_cost, -country}               # `-country` means descending order.
filter ct > 2_000
take 20

As well as using variables to reduce unnecessary repetition, the query is also more readable — it flows from top to bottom, each line representing a transformation of the previous line’s result. For example, TOP 20 / take 20 modify the final result in both queries — but only PRQL represents it as the final transformation. And context is localized — the aggregate transform is immediately wrapped in a group transform containing the columns to group by.

While PRQL is designed for reading & writing by people, it’s also much simpler for code to construct or edit PRQL queries. In SQL, adding a filter to a query involves parsing the query to find and then modify the WHERE statement, or wrapping the existing query in a CTE. In PRQL, adding a filter just involves appending a filter transformation to the query.

For more examples, check out the PRQL Book.