Introduction

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it’s readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

Let’s get started with an example:

PRQL

from employees
filter start_date > @2021-01-01               # Clear date syntax
derive [                                      # `derive` adds columns / variables
  gross_salary = salary + (tax ?? 0),         # Terse coalesce
  gross_cost = gross_salary + benefits_cost,  # Variables can use other variables
]
filter gross_cost > 0
group [title, country] (                      # `group` runs a pipeline over each group
  aggregate [                                 # `aggregate` reduces each group to a value
    average gross_salary,
    sum_gross_cost = sum gross_cost,          # `=` sets a column name
  ]
)
filter sum_gross_cost > 100_000                # `filter` replaces both of SQL's `WHERE` & `HAVING`
derive id = f"{title}_{country}"              # F-strings like python
derive country_code = s"LEFT(country, 2)"     # S-strings allow using SQL as an escape hatch
sort [sum_gross_cost, -country]               # `-country` means descending order
take 1..20                                    # Range expressions (also valid here as `take 20`)

SQL

WITH table_1 AS (
  SELECT
    title,
    country,
    salary + COALESCE(tax, 0) + benefits_cost AS _expr_0,
    salary + COALESCE(tax, 0) AS _expr_1
  FROM
    employees
  WHERE
    start_date > DATE '2021-01-01'
)
SELECT
  title,
  country,
  AVG(_expr_1),
  SUM(_expr_0) AS sum_gross_cost,
  CONCAT(title, '_', country) AS id,
  LEFT(country, 2) AS country_code
FROM
  table_1
WHERE
  _expr_0 > 0
GROUP BY
  title,
  country
HAVING
  SUM(_expr_0) > 100000
ORDER BY
  sum_gross_cost,
  country DESC
LIMIT
  20

As you can see, PRQL is a linear pipeline of transformations — each line of the query is a transformation of the previous line’s result.

You can see that in SQL, operations do not follow one another, which makes it hard to compose larger queries.

Queries

PRQL queries are composed of a sequence of transforms that form a pipeline that modifies data as it is passed from one to the next.

The major items of a query are listed at the left:

Pipelines

PRQL queries are a sequence of lines (or transforms) that form a pipeline. Each line transforms the data, and passes its result to the next.

The simplest pipeline is just:

PRQL

from employees

SQL

SELECT
  *
FROM
  employees

Adding transforms

As we add additional lines, each one transforms the result:

PRQL

from employees
derive gross_salary = (salary + payroll_tax)

SQL

SELECT
  *,
  salary + payroll_tax AS gross_salary
FROM
  employees

…and so on:

from employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary

Compiling to SQL

PRQL compiles the query to SQL. The PRQL compiler tries to represent as many transforms as possible with a single SELECT statement. When necessary, the compiler “overflows” and creates CTEs (common table expressions):

PRQL

from e = employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary
take 10
join d = department [==dept_no]
select [e.name, gross_salary, d.name]

SQL

WITH table_1 AS (
  SELECT
    name,
    salary + payroll_tax AS gross_salary,
    dept_no
  FROM
    employees AS e
  ORDER BY
    gross_salary
  LIMIT
    10
)
SELECT
  table_1.name,
  table_1.gross_salary,
  d.name
FROM
  table_1
  JOIN department AS d ON table_1.dept_no = d.dept_no

See also

Functions

Functions are a fundamental abstraction in PRQL — they allow us to run code in many places that we’ve written once. This reduces the number of errors in our code, makes our code more readable, and simplifies making changes.

Functions have two types of parameters:

  1. Positional parameters, which require an argument.
  2. Named parameters, which optionally take an argument, otherwise using their default value.

So this function is named fahrenheit_to_celsius and has one parameter temp:

PRQL

func fahrenheit_to_celsius temp -> (temp - 32) / 1.8

from cities
derive temp_c = (fahrenheit_to_celsius temp_f)

SQL

SELECT
  *,
  (temp_f - 32) / 1.8 AS temp_c
FROM
  cities

This function is named interp, and has two positional parameters named higher and x, and one named parameter named lower which takes a default argument of 0. It calculates the proportion of the distance that x is between lower and higher.

PRQL

func interp lower:0 higher x -> (x - lower) / (higher - lower)

from students
derive [
  sat_proportion_1 = (interp 1600 sat_score),
  sat_proportion_2 = (interp lower:0 1600 sat_score),
]

SQL

SELECT
  *,
  (sat_score - 0) / 1600 AS sat_proportion_1,
  (sat_score - 0) / 1600 AS sat_proportion_2
FROM
  students

Piping

Consistent with the principles of PRQL, it’s possible to pipe values into functions, which makes composing many functions more readable. When piping a value into a function, the value is passed as an argument to the final positional parameter of the function. Here’s the same result as the examples above with an alternative construction:

PRQL

func interp lower:0 higher x -> (x - lower) / (higher - lower)

from students
derive [
  sat_proportion_1 = (sat_score | interp 1600),
  sat_proportion_2 = (sat_score | interp lower:0 1600),
]

SQL

SELECT
  *,
  (sat_score - 0) / 1600 AS sat_proportion_1,
  (sat_score - 0) / 1600 AS sat_proportion_2
FROM
  students

and

PRQL

func fahrenheit_to_celsius temp -> (temp - 32) / 1.8

from cities
derive temp_c = (temp_f | fahrenheit_to_celsius)

SQL

SELECT
  *,
  (temp_f - 32) / 1.8 AS temp_c
FROM
  cities

We can combine a chain of functions, which makes logic more readable:

PRQL

func fahrenheit_to_celsius temp -> (temp - 32) / 1.8
func interp lower:0 higher x -> (x - lower) / (higher - lower)

from kettles
derive boiling_proportion = (temp_c | fahrenheit_to_celsius | interp 100)

SQL

SELECT
  *,
  ((temp_c - 32) / 1.8 - 0) / 100 AS boiling_proportion
FROM
  kettles

Roadmap

Late binding

Currently, functions require a binding to variables in scope; they can’t late-bind to column names; so for example:

func return price -> (price - dividend) / price_yesterday

…isn’t yet a valid function, and instead would needs to be:

func return price dividend price_yesterday ->  (price - dividend) / (price_yesterday)

(which makes functions in this case not useful)

Tables

We can define a temporary table — similar to a CTE in SQL — with let:

PRQL

let top_50 = (
  from employees
  sort salary
  take 50
  aggregate [total_salary = sum salary]
)

from top_50      # Starts a new pipeline

SQL

WITH table_0 AS (
  SELECT
    salary
  FROM
    employees
  ORDER BY
    salary
  LIMIT
    50
), top_50 AS (
  SELECT
    SUM(salary) AS total_salary
  FROM
    table_0
)
SELECT
  total_salary
FROM
  top_50

We can even place a whole CTE in an s-string, enabling us to use features which PRQL doesn’t yet support.

PRQL

let grouping = s"""
  SELECT SUM(a)
  FROM tbl
  GROUP BY
    GROUPING SETS
    ((b, c, d), (d), (b, d))
"""

from grouping

SQL

WITH table_0 AS (
  SELECT
    SUM(a)
  FROM
    tbl
  GROUP BY
    GROUPING SETS ((b, c, d), (d), (b, d))
),
grouping AS (
  SELECT
    *
  FROM
    table_0 AS table_1
)
SELECT
  *
FROM
  grouping

Info

In PRQL tables are far less common than CTEs are in SQL, since a linear series of CTEs can be represented with a single pipeline.

Note

In the final example above, the e representing the table / namespace is no longer available after the select statement. For example, this would raise an error:

from e=employees
select e.first_name
filter e.first_name == "Fred" # Can't find `e.first_name`

To refer to the e.first_name column in subsequent transforms, either refer to it using first_name, or if it requires a different name, assign one in the select statement:

PRQL

from e=employees
select fname = e.first_name
filter fname == "Fred"

SQL

WITH table_1 AS (
  SELECT
    first_name AS fname
  FROM
    employees AS e
)
SELECT
  fname
FROM
  table_1
WHERE
  fname = 'Fred'

`

Sort

Orders rows based on the values of one or more columns.

sort [{direction}{column}]

Parameters

  • One column or a list of columns to sort by
  • Each column can be prefixed with:
    • +, for ascending order, the default
    • -, for descending order
  • When using prefixes, even a single column needs to be in a list or parentheses. (Otherwise, sort -foo is parsed as a subtraction between sort and foo.)

Examples

PRQL

from employees
sort age

SQL

SELECT
  *
FROM
  employees
ORDER BY
  age

PRQL

from employees
sort [-age]

SQL

SELECT
  *
FROM
  employees
ORDER BY
  age DESC

PRQL

from employees
sort [age, -tenure, +salary]

SQL

SELECT
  *
FROM
  employees
ORDER BY
  age,
  tenure DESC,
  salary

We can also use expressions:

PRQL

from employees
sort [s"substr({first_name}, 2, 5)"]

SQL

WITH table_1 AS (
  SELECT
    *,
    substr(first_name, 2, 5) AS _expr_0
  FROM
    employees
  ORDER BY
    _expr_0
)
SELECT
  *
FROM
  table_1

Notes

Ordering guarantees

Most DBs will persist ordering through most transforms; for example, you can expect this result to be ordered by tenure.

PRQL

from employees
sort tenure
derive name = f"{first_name} {last_name}"

SQL

SELECT
  *,
  CONCAT(first_name, ' ', last_name) AS name
FROM
  employees
ORDER BY
  tenure

But:

  • This is an implementation detail of the DB. If there are instances where this doesn’t hold, please open an issue, and we’ll consider how to manage it.
  • Some transforms which change the existence of rows, such as join or group, won’t persist ordering; for example:

PRQL

from employees
sort tenure
join locations [==employee_id]

SQL

WITH table_1 AS (
  SELECT
    *
  FROM
    employees
  ORDER BY
    tenure
)
SELECT
  table_1.*,
  locations.*
FROM
  table_1
  JOIN locations ON table_1.employee_id = locations.employee_id

See Issue #1363 for more details.

Take

Picks rows based on their position.

take {n|range}

See Ranges for more details on how ranges work.

Examples

PRQL

from employees
take 10

SQL

SELECT
  *
FROM
  employees
LIMIT
  10

PRQL

from orders
sort [-value, date]
take 101..110

SQL

SELECT
  *
FROM
  orders
ORDER BY
  value DESC,
  date
LIMIT
  10 OFFSET 100

Window

Applies a pipeline to segments of rows, producing one output value for every input value.

window rows:{range} range:{range} expanding:false rolling:0 {pipeline}

For each row, the segment over which the pipeline is applied is determined by one of:

  • rows, which takes a range of rows relative to the current row position.
    • 0 references the current row.
  • range, which takes a range of values relative to current row value.

The bounds of the range are inclusive. If a bound is omitted, the segment will extend until the edge of the table or group.

For ease of use, there are two flags that override rows or range:

  • expanding:true is an alias for rows:..0. A sum using this window is also known as “cumulative sum”.
  • rolling:n is an alias for row:(-n+1)..0, where n is an integer. This will include n last values, including current row. An average using this window is also knows as a Simple Moving Average.

Some examples:

ExpressionMeaning
rows:0..2current row plus two following
rows:-2..0two preceding rows plus current row
rolling:3(same as previous)
rows:-2..4two preceding rows plus current row plus four following rows
rows:..0all rows from the start of the table up to & including current row
expanding:true(same as previous)
rows:0..current row and all following rows until the end of the table
rows:..all rows, which same as not having window at all

Example

PRQL

from employees
group employee_id (
  sort month
  window rolling:12 (
    derive [trail_12_m_comp = sum paycheck]
  )
)

SQL

SELECT
  *,
  SUM(paycheck) OVER (
    PARTITION BY employee_id
    ORDER BY
      month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS trail_12_m_comp
FROM
  employees

PRQL

from orders
sort day
window rows:-3..3 (
  derive [centered_weekly_average = average value]
)
group [order_month] (
  sort day
  window expanding:true (
    derive [monthly_running_total = sum value]
  )
)

SQL

SELECT
  *,
  AVG(value) OVER (
    ORDER BY
      day ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
  ) AS centered_weekly_average,
  SUM(value) OVER (
    PARTITION BY order_month
    ORDER BY
      day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS monthly_running_total
FROM
  orders

Windowing by default

If you use window functions without window transform, they will be applied to the whole table. Unlike in SQL, they will remain window functions and will not trigger aggregation.

PRQL

from employees
sort age
derive rnk = rank

SQL

SELECT
  *,
  RANK() OVER (
    ORDER BY
      age ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS rnk
FROM
  employees
ORDER BY
  age

You can also only apply group:

PRQL

from employees
group department (
  sort age
  derive rnk = rank
)

SQL

SELECT
  *,
  RANK() OVER (
    PARTITION BY department
    ORDER BY
      age ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS rnk
FROM
  employees

Window functions as first class citizens

There is no limitaions where windowed expressions can be used:

PRQL

from employees
filter salary < (average salary)

SQL

WITH table_1 AS (
  SELECT
    *,
    AVG(salary) OVER () AS _expr_0
  FROM
    employees
)
SELECT
  *
FROM
  table_1
WHERE
  salary < _expr_0

Syntax

Summary

A summary of PRQL syntax

SyntaxUsageExample
|Pipelinesfrom employees | select first_name
=Assigns & Aliasesfrom e = employees
derive total = (sum salary)
:Named args & Parametersinterp lower:0 1600 sat_score
[]Listsselect [id, amount]
()Precedence & Parenthesesderive celsius = (fahrenheit - 32) / 1.8
'' & ""Stringsderive name = 'Mary'
` `Quoted identifiersselect `first name`
#Comments# A comment
@Dates & Times@2021-01-01
==Expressionsfilter a == b and c != d and e > f
==Self-equality in joinjoin s=salaries [==id]
->Function definitionsfunc add a b -> a + b
+/-Sort ordersort [-amount, +date]
??Coalesceamount ?? 0

Pipes

Pipes — the connection between transforms that make up a pipeline — can be either line breaks or a pipe character (|).

In almost all situations, line-breaks pipe the result of a line’s transform into the transform on the following line. For example, the filter transform operates on the result of from employees (which is just the employees table), and the select transform operates on the result of the filter transform.

PRQL

from employees
filter department == "Product"
select [first_name, last_name]

SQL

SELECT
  first_name,
  last_name
FROM
  employees
WHERE
  department = 'Product'

In the place of a line-break, it’s also possible to use the | character to pipe results, such that this is equivalent:

PRQL

from employees | filter department == "Product" | select [first_name, last_name]

SQL

SELECT
  first_name,
  last_name
FROM
  employees
WHERE
  department = 'Product'

A line-break doesn’t create a pipeline in a couple of cases:

  • within a list (e.g. the derive examples below),
  • when the following line is a new statement, which starts with a keyword of func, table or from.

Lists

Lists are represented with [], and can span multiple lines. A final trailing comma is optional.

PRQL

from numbers
derive [x = 1, y = 2]
derive [
  a = x,
  b = y
]
derive [
  c = a,
  d = b,
]

SQL

SELECT
  *,
  1 AS x,
  2 AS y,
  1 AS a,
  2 AS b,
  1 AS c,
  2 AS d
FROM
  numbers

Most transforms can take either a list or a single item, so these are equivalent:

PRQL

from employees
select [first_name]

SQL

SELECT
  first_name
FROM
  employees

PRQL

from employees
select first_name

SQL

SELECT
  first_name
FROM
  employees

Expressions

PRQL is made up of expressions, like 2 + 3 or ((1 + x) * y). In the example below, note the use of expressions to calculate the alias circumference and in the filter transform.

PRQL

from foo
select [
  circumference = diameter * 3.14159,
  color,
]
filter circumference > 10 and color != "red"

SQL

WITH table_1 AS (
  SELECT
    diameter * 3.14159 AS circumference,
    color
  FROM
    foo
)
SELECT
  circumference,
  color
FROM
  table_1
WHERE
  circumference > 10
  AND color <> 'red'

Precedence and Parentheses

Parentheses — () — are used to give precedence to inner expressions.

Note

We realize some of the finer points here are not intuitive. We are considering approaches to make this more intuitive — even at the cost of requiring more syntax in some circumstances. And we’re planning to make the error messages much better, so the compiler is there to help out.

Parentheses are required around:

  • Any nested function call containing a pipe, either the | symbol or a new line. “Nested” means within a transform; i.e. not just the main pipeline.
  • Any function call that isn’t a single item in a list or a pipeline, like sum distance in round 0 (sum distance)1.
  • A minus sign in a function argument, like in add (-1) (-3)
  • Inner transforms for group, window, and other transforms.

Parentheses are not required around expressions which use operators but no function call, like foo + bar.

1

or, technically, it’s on the right side of an assignment in a list…

Here’s a full rundown of times this applier:

PRQL

from employees
# Requires parentheses, because it's contains a pipe
derive is_proximate = (distance | in 0..20)
# Requires parentheses, because it's a function call
derive total_distance = (sum distance)
# `??` doesn't require parentheses, as it's not a function call
derive min_capped_distance = (min distance ?? 5)
# No parentheses needed, because no function call
derive travel_time = distance / 40
# No inner parentheses needed around `1+1` because no function call
derive distance_rounded_2_dp = (round 1+1 distance)
derive [
  # Requires parentheses, because it contains a pipe
  is_far = (distance | in 100..),
  # The left value of the range requires parentheses,
  # because of the minus sign
  is_negative = (distance | in (-100..0)),
  # ...this is equivalent
  is_negative = (distance | in (-100)..0),
  # Doesn't require parentheses, because it's in a list (confusing, see footnote)!
  average_distance = average distance,
]
# Requires parentheses because of the minus sign
sort (-distance)
# A list is fine too
sort [-distance]

SQL

SELECT
  *,
  distance BETWEEN 0 AND 20 AS is_proximate,
  SUM(distance) OVER () AS total_distance,
  MIN(COALESCE(distance, 5)) OVER () AS min_capped_distance,
  distance / 40 AS travel_time,
  ROUND(distance, 2) AS distance_rounded_2_dp,
  distance >= 100 AS is_far,
  distance BETWEEN -100 AND 0,
  distance BETWEEN -100 AND 0 AS is_negative,
  AVG(distance) OVER () AS average_distance
FROM
  employees
ORDER BY
  distance DESC

Inner Transforms

Parentheses are also used for transforms (such as group and window) that pass their result to an “inner transform”. The example below applies the aggregate pipeline to each group of unique title and country values:

PRQL

from employees
group [title, country] (
  aggregate [
    average salary,
    ct = count
  ]
)

SQL

SELECT
  title,
  country,
  AVG(salary),
  COUNT(*) AS ct
FROM
  employees
GROUP BY
  title,
  country

Comments

Comments are represented by #.

PRQL

from employees  # Comment 1
# Comment 2
aggregate [average salary]

SQL

SELECT
  AVG(salary)
FROM
  employees

There’s no distinct multiline comment syntax.

Quoted identifiers

To use identifiers that are otherwise invalid, surround them with backticks. Depending on the dialect, these will remain as backticks or be converted to double-quotes.

PRQL

prql target:sql.mysql
from employees
select `first name`

SQL

SELECT
  `first name`
FROM
  employees

PRQL

prql target:sql.postgres
from employees
select `first name`

SQL

SELECT
  "first name"
FROM
  employees

PRQL

from `dir/*.parquet`

SQL

SELECT
  *
FROM
  "dir/*.parquet"

BigQuery also uses backticks to surround project & dataset names (even if valid identifiers) in the SELECT statement:

PRQL

prql target:sql.bigquery
from `project-foo.dataset.table`
join `project-bar.dataset.table` [==col_bax]

SQL

SELECT
  `project-foo.dataset.table`.*,
  `project-bar.dataset.table`.*
FROM
  `project-foo.dataset.table`
  JOIN `project-bar.dataset.table` ON `project-foo.dataset.table`.col_bax = `project-bar.dataset.table`.col_bax

Quoting schemas

This is currently not great and we are working on improving it; see
https://github.com/PRQL/prql/issues/1535 for progress.

If supplying a schema without a column — for example in a from or join transform, that also needs to be a quoted identifier:

PRQL

from `music.albums`

SQL

SELECT
  *
FROM
  music.albums

Parameters

PRQL will retain parameters like $1 in SQL output, which can then be supplied to the SQL query:

PRQL

from employees
filter id == $1

SQL

SELECT
  *
FROM
  employees
WHERE
  id = $1

Numbers

Numbers can contain underscores between numbers; which can make reading large numbers easier:

PRQL

from numbers
select [
    small = 1.000_000_1,
    big = 5_000_000,
]

SQL

SELECT
  1.0000001 AS small,
  5000000 AS big
FROM
  numbers

Language features

The pages of this section describe how PRQL handles various aspects of the language.

Coalesce

We can coalesce values with an ?? operator. Coalescing takes either the first value or, if that value is null, the second value.

PRQL

from orders
derive amount ?? 0

SQL

SELECT
  *,
  COALESCE(amount, 0)
FROM
  orders

Dates & times

PRQL uses @ followed by a string to represent dates & times. This is less verbose than SQL’s approach of TIMESTAMP '2004-10-19 10:23:54' and more explicit than SQL’s implicit option of just using a string '2004-10-19 10:23:54'.

Note

Currently PRQL passes strings which can be compiled straight through to the database, and so many compatible formats string may work, but we may refine this in the future to aid in compatibility across databases. We’ll always support the canonical ISO8601 format described below.

Dates

Dates are represented by @{yyyy-mm-dd} — a @ followed by the date format.

PRQL

from employees
derive age_at_year_end = (@2022-12-31 - dob)

SQL

SELECT
  *,
  DATE '2022-12-31' - dob AS age_at_year_end
FROM
  employees

Times

Times are represented by @{HH:mm:ss.SSS±Z} with any parts not supplied being rounded to zero, including the timezone, which is represented by +HH:mm, -HH:mm or Z. This is consistent with the ISO8601 time format.

PRQL

from orders
derive should_have_shipped_today = (order_time < @08:30)

SQL

SELECT
  *,
  order_time < TIME '08:30' AS should_have_shipped_today
FROM
  orders

Timestamps

Timestamps are represented by @{yyyy-mm-ddTHH:mm:ss.SSS±Z} / @{date}T{time}, with any time parts not supplied being rounded to zero, including the timezone, which is represented by +HH:mm, -HH:mm or Z. This is @ followed by the ISO8601 datetime format, which uses T to separate date & time.

PRQL

from commits
derive first_prql_commit = @2020-01-01T13:19:55-0800

SQL

SELECT
  *,
  TIMESTAMP '2020-01-01T13:19:55-0800' AS first_prql_commit
FROM
  commits

Intervals

Intervals are represented by {N}{periods}, such as 2years or 10minutes, without a space.

Note

These aren’t the same as ISO8601, because we evaluated P3Y6M4DT12H30M5S to be difficult to understand, but we could support a simplified form if there’s demand for it. We don’t currently support compound expressions, for example 2years10months, but most DBs will allow 2years + 10months. Please raise an issue if this is inconvenient.

PRQL

from projects
derive first_check_in = start + 10days

SQL

SELECT
  *,
  start + INTERVAL 10 DAY AS first_check_in
FROM
  projects

Examples

Here’s a fuller list of examples:

  • @20221231 is forbidden — it must contain full punctuation (- and :),
  • @2022-12-31 is a date
  • @2022-12 or @2022 are forbidden — SQL can’t express a month, only a date
  • @16:54:32.123456 is a time
  • @16:54:32, @16:54, @16 are all allowed, expressing @16:54:32.000000, @16:54:00.000000, @16:00:00.000000 respectively
  • @2022-12-31T16:54:32.123456 is a timestamp without timezone
  • @2022-12-31T16:54:32.123456Z is a timestamp in UTC
  • @2022-12-31T16:54+02 is timestamp in UTC+2
  • @2022-12-31T16:54+02:00 and @2022-12-31T16:54+02 are datetimes in UTC+2
  • @16:54+02 is forbidden — time is always local, so it cannot have a timezone
  • @2022-12-31+02 is forbidden — date is always local, so it cannot have a timezone

Roadmap

Datetimes

Datetimes are supported by some databases (e.g. MySql, BigQuery) in addition to timestamps. When we have type annotations, these will be represented by a timestamp annotated as a datetime:

derive pi_day = @2017-03-14T15:09:26.535898<datetime>

These are some examples we can then add:

  • @2022-12-31T16:54<datetime> is datetime without timezone
  • @2022-12-31<datetime> is forbidden — datetime must specify time
  • @16:54<datetime> is forbidden — datetime must specify date

Distinct

PRQL doesn’t have a specific distinct keyword. Instead, use group and take 1:

PRQL

from employees
select department
group department (
  take 1
)

SQL

SELECT
  DISTINCT department
FROM
  employees

This also works without a linebreak:

PRQL

from employees
select department
group department (take 1)

SQL

SELECT
  DISTINCT department
FROM
  employees

Selecting from each group

We are be able to select a single row from each group by combining group and sort:

PRQL

# youngest employee from each department
from employees
group department (
  sort age
  take 1
)

SQL

WITH table_1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY department
      ORDER BY
        age
    ) AS _expr_0
  FROM
    employees
)
SELECT
  *
FROM
  table_1
WHERE
  _expr_0 <= 1

Roadmap

When using Postgres dialect, we are planning to compile:

# youngest employee from each department
from employees
group department (
  sort age
  take 1
)

… to …

SELECT DISTINCT ON (department) *
FROM employees
ORDER BY department, age

Null handling

SQL has an unconventional way of handling NULL values, since it treats them as unknown values. As a result, in SQL:

  • NULL is not a value indicating a missing entry, but a placeholder for anything possible,
  • NULL = NULL evaluates to NULL, since one cannot know if one unknown is equal to another unknown,
  • NULL <> NULL evaluates to NULL, using same logic,
  • to check if a value is NULL, SQL introduces IS NULL and IS NOT NULL operators,
  • DISTINCT column may return multiple NULL values.

For more information, check out the Postgres documentation.

PRQL, on the other hand, treats null as a value, which means that:

  • null == null evaluates to true,
  • null != null evaluates to false,
  • distinct column cannot contain multiple null values.

PRQL

from employees
filter first_name == null
filter null != last_name

SQL

SELECT
  *
FROM
  employees
WHERE
  first_name IS NULL
  AND last_name IS NOT NULL

Note that PRQL doesn’t change how NULL is compared between columns, for example in joins. (PRQL compiles to SQL and so can’t change the behavior of the database).

For more context or to provide feedback check out the discussion on issue #99.

Ranges

PRQL has a concise range syntax start..end. If only one of start & end are supplied, the range is open on the empty side.

Ranges can be used in filters with the in function, with any type of literal, including dates:

PRQL

from events
filter (date | in @1776-07-04..@1787-09-17)
filter (magnitude | in 50..100)
derive is_northern = (latitude | in 0..)

SQL

SELECT
  *,
  latitude >= 0 AS is_northern
FROM
  events
WHERE
  date BETWEEN DATE '1776-07-04' AND DATE '1787-09-17'
  AND magnitude BETWEEN 50 AND 100

Like in SQL, ranges are inclusive.

As discussed in the take docs, ranges can also be used in take:

PRQL

from orders
sort [-value, date]
take 101..110

SQL

SELECT
  *
FROM
  orders
ORDER BY
  value DESC,
  date
LIMIT
  10 OFFSET 100

Note

Half-open ranges are generally less intuitive to read than a simple >= or <= operator.

Roadmap

We’d like to use ranges for other types, such as whether an object is in an array or list literal.

Regex expressions

Note

At this time, PRQL does not have a facility for regular expression (regex) processing. The Milestone for 0.5 talks about general plans for regex’s. The Discussion #1123 provides a workaround function that emulates %LIKE%.

Standard Library

The standard library currently contains commonly used functions that are used in SQL. It’s not yet as broad as we’d like, and we’re very open to expanding it.

Currently s-strings are an escape-hatch for any function that isn’t in our standard library. If we find ourselves using them for something frequently, raise an issue and we’ll add it to the stdlib.

Note

Currently the stdlib implementation doesn’t support different DB implementations itself; those need to be built deeper into the compiler. We’ll resolve this at some point. Until then, we’ll only add functions here that are broadly supported by most DBs.

Here’s the source of the current PRQL std:

# Aggregate Functions
func min <scalar|column> column -> null
func max <scalar|column> column -> null
func sum <scalar|column> column -> null
func avg <scalar|column> column -> null
func stddev <scalar|column> column -> null
func average <scalar|column> column -> null
func count <scalar|column> non_null:s"*" -> null
# TODO: Possibly make this into `count distinct:true` (or like `distinct:` as an
# abbreviation of that?)
func count_distinct <scalar|column> column -> null

# Window functions
func lag<column> offset column -> null
func lead<column> offset column -> null
func first<column> offset column -> null
func last<column> offset column -> null
func rank<column> -> null
func rank_dense<column> -> null
func row_number<column> -> null

# Other functions
func round<scalar> n_digits column -> null
func as<scalar> `noresolve.type` column -> null
func in<bool> pattern value -> null

# Transform type definitions
func from<table> `default_db.source`<table> -> null
func select<table> columns<column> tbl<table> -> null
func filter<table> condition<bool> tbl<table> -> null
func derive<table> columns<column> tbl<table> -> null
func aggregate<table> a<column> tbl<table> -> null
func sort<table> by tbl<table> -> null
func take<table> expr tbl<table> -> null
func join<table> `default_db.with`<table> filter `noresolve.side`:inner tbl<table> -> null
func group<table> by pipeline tbl<table> -> null
func window<table> rows:0..0 range:0..0 expanding:false rolling:0 pipeline tbl<table> -> null

func append<table> `default_db.bottom`<table> top<table> -> null
func intersect<table> `default_db.bottom`<table> top<table> -> (
    from t = _param.top
    join b = _param.bottom (all (map _eq (zip t.* b.*)))
    select t.*
)
func remove<table> `default_db.bottom`<table> top<table> -> (
    from t = _param.top
    join side:left b = _param.bottom (all (map _eq (zip t.* b.*)))
    filter (all (map _is_null b.*))
    select t.*
)

# List functions
func all<bool> list<list> -> null
func map<list> fn list<list> -> null
func zip<list> a<list> b<list> -> null
func _eq<list> a<list> -> null
func _is_null a -> _param.a == null

# Misc
func from_text<table> text<string> `noresolve.format`:csv -> null

And a couple of examples:

PRQL

from employees
derive [
  gross_salary = (salary + payroll_tax | as int),
  gross_salary_rounded = (gross_salary | round 0),
  time = s"NOW()",  # an s-string, given no `now` function exists in PRQL
]

SQL

SELECT
  *,
  CAST(salary + payroll_tax AS int) AS gross_salary,
  ROUND(CAST(salary + payroll_tax AS int), 0) AS gross_salary_rounded,
  NOW() AS time
FROM
  employees

Strings

Strings in PRQL can use either single or double quotes:

PRQL

from my_table
select x = "hello world"

SQL

SELECT
  'hello world' AS x
FROM
  my_table

PRQL

from my_table
select x = 'hello world'

SQL

SELECT
  'hello world' AS x
FROM
  my_table

To quote a string containing quotes, either use the “other” type of quote, or use three-or-more quotes, and close with the same number.

PRQL

from my_table
select x = '"hello world"'

SQL

SELECT
  '"hello world"' AS x
FROM
  my_table

PRQL

from my_table
select x = """I said "hello world"!"""

SQL

SELECT
  'I said "hello world"!' AS x
FROM
  my_table

PRQL

from my_table
select x = """""I said """hello world"""!"""""

SQL

SELECT
  'I said """hello world"""!' AS x
FROM
  my_table

F-Strings and S-Strings

These special case strings can be used to:

F-Strings - Build up a new string from a set of columns or values

S-Strings - Insert SQL statements directly into the query. Use when PRQL doesn’t have an equivalent facility.

Note

Currently PRQL does not adjust escape characters.

Warning

Currently PRQL allows multiline strings with either a single character or multiple character quotes. This may change for strings using a single character quote in future versions.

S-Strings

An s-string inserts SQL directly, as an escape hatch when there’s something that PRQL doesn’t yet implement. For example, there’s no version() function in SQL that returns the Postgres version, so if we want to use that, we use an s-string:

PRQL

from my_table
select db_version = s"version()"

SQL

SELECT
  version() AS db_version
FROM
  my_table

Embed a column name in an s-string using braces. For example, PRQL’s standard library defines the average function as:

func average column -> s"AVG({column})"

So this compiles using the function:

PRQL

from employees
aggregate [average salary]

SQL

SELECT
  AVG(salary)
FROM
  employees

Here’s an example of a more involved use of an s-string:

PRQL

from de=dept_emp
join s=salaries side:left [
  (s.emp_no == de.emp_no),
  s"""({s.from_date}, {s.to_date})
  OVERLAPS
  ({de.from_date}, {de.to_date})"""
]

SQL

SELECT
  de.*,
  s.*
FROM
  dept_emp AS de
  LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
  AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)

For those who have used python, s-strings are similar to python’s f-strings, but the result is SQL code, rather than a string literal. For example, a python f-string of f"average{col}" would produce "average(salary)", with quotes; while in PRQL, s"average{col}" produces average(salary), without quotes.

We can also use s-strings to produce a full table:

PRQL

from s"SELECT DISTINCT ON first_name, id, age FROM employees ORDER BY age ASC"
join s = s"SELECT * FROM salaries" [==id]

SQL

WITH table_2 AS (
  SELECT
    DISTINCT ON first_name,
    id,
    age
  FROM
    employees
  ORDER BY
    age ASC
),
table_3 AS (
  SELECT
    *
  FROM
    salaries
)
SELECT
  table_0.*,
  table_1.*
FROM
  table_2 AS table_0
  JOIN table_3 AS table_1 ON table_0.id = table_1.id

Note

S-strings in user code are intended as an escape-hatch for an unimplemented feature. If we often need s-strings to express something, that’s a sign we should implement it in PRQL or PRQL’s stdlib.

Braces

To output braces from an s-string, use double braces:

PRQL

from employees
derive [
  has_valid_title = s"regexp_contains(title, '([a-z0-9]*-){{2,}}')"
]

SQL

SELECT
  *,
  regexp_contains(title, '([a-z0-9]*-){2,}') AS has_valid_title
FROM
  employees

Precedence

The PRQL compiler simply places a literal copy of each variable into the resulting string, which means we may get surprising behavior when the variable is has multiple terms and the s-string isn’t parenthesized.

In this toy example, the salary + benefits / 365 gets precedence wrong:

PRQL

from employees
derive [
  gross_salary = salary + benefits,
  daily_rate = s"{gross_salary} / 365"
]

SQL

SELECT
  *,
  salary + benefits AS gross_salary,
  salary + benefits / 365 AS daily_rate
FROM
  employees

Instead, put the denominator {gross_salary} in parentheses:

PRQL

from employees
derive [
  gross_salary = salary + benefits,
  daily_rate = s"({gross_salary}) / 365"
]

SQL

SELECT
  *,
  salary + benefits AS gross_salary,
  (salary + benefits) / 365 AS daily_rate
FROM
  employees

F-Strings

F-strings are a readable approach to building new strings from existing strings. Currently PRQL supports this for concatenating strings:

PRQL

from employees
select full_name = f"{first_name} {last_name}"

SQL

SELECT
  CONCAT(first_name, ' ', last_name) AS full_name
FROM
  employees

This can be much easier to read for longer strings, relative to the SQL approach:

PRQL

from web
select url = f"http{tls}://www.{domain}.{tld}/{page}"

SQL

SELECT
  CONCAT(
    'http',
    tls,
    '://www.',
    domain,
    '.',
    tld,
    '/',
    page
  ) AS url
FROM
  web

Roadmap

In the future, f-strings may incorporate string formatting such as datetimes, numbers, and padding. If there’s a feature that would be helpful, please post an issue.

Switch

Note

switch is currently experimental and may change behavior in the near future

PRQL uses switch for both SQL’s CASE and IF statements. Here’s an example:

PRQL

from employees
derive distance = switch [
  city == "Calgary" -> 0,
  city == "Edmonton" -> 300,
]

SQL

SELECT
  *,
  CASE
    WHEN city = 'Calgary' THEN 0
    WHEN city = 'Edmonton' THEN 300
    ELSE NULL
  END AS distance
FROM
  employees

If no condition is met, the value takes a null value. To set a default, use a true condition:

PRQL

from employees
derive distance = switch [
  city == "Calgary" -> 0,
  city == "Edmonton" -> 300,
  true -> "Unknown",
]

SQL

SELECT
  *,
  CASE
    WHEN city = 'Calgary' THEN 0
    WHEN city = 'Edmonton' THEN 300
    ELSE 'Unknown'
  END AS distance
FROM
  employees

Target & Version

Target dialect

PRQL allows specifying a target dialect at the top of the query, which allows PRQL to compile to a database-specific SQL flavor.

Examples

PRQL

prql target:sql.postgres

from employees
sort age
take 10

SQL

SELECT
  *
FROM
  employees
ORDER BY
  age
LIMIT
  10

PRQL

prql target:sql.mssql

from employees
sort age
take 10

SQL

SELECT
  TOP (10) *
FROM
  employees
ORDER BY
  age

Supported dialects

Note

Note that dialect support is early — most differences are not implemented, and most dialects’ implementations are identical to generic’s. Contributions are very welcome.

  • sql.ansi
  • sql.bigquery
  • sql.clickhouse
  • sql.generic
  • sql.hive
  • sql.mssql
  • sql.mysql
  • sql.postgres
  • sql.sqlite
  • sql.snowflake
  • sql.duckdb

Version

PRQL allows specifying a version of the language in the PRQL header, like:

PRQL

prql version:"0.4"

from employees

SQL

SELECT
  *
FROM
  employees

This has two roles, one of which is implemented:

  • The compiler will raise an error if the compiler is older than the query version. This prevents confusing errors when queries use newer features of the language but the compiler hasn’t yet been upgraded.
  • The compiler will compile for the major version of the query. This allows the language to evolve without breaking existing queries, or forcing multiple installations of the compiler. This isn’t yet implemented, but is a gating feature for PRQL 1.0.

Bindings

PRQL has bindings for many languages. These include:

PRQL

PRQL bindings for Elixir.

Installation

def deps do
  [
    {:prql, "~> 0.1.0"}
  ]
end

Basic Usage

  iex> PRQL.compile("from customers")
      {:ok, "SELECT\n  *\nFROM\n  customers\n\n-- Generated by PRQL compiler version 0.3.1 (https://prql-lang.org)\n"}


  iex> PRQL.compile("from customers\ntake 10", dialect: :mssql)
  {:ok, "SELECT\n  TOP (10) *\nFROM\n  customers\n\n-- Generated by PRQL compiler version 0.3.1 (https://prql-lang.org)\n"}

Development

We are in the early stages of developing Elixir bindings.

We’re using Rustler to provide rust bindings for prql-compiler.

Currently using the bindings in an Elixir project requires compiling the rust crate from this repo:

  • Install dependencies with mix deps.get
  • Compile project mix compile
  • Run tests mix test

Future work includes publishing pre-compiled artifacts, so Elixir projects can run PRQL without needing a rust toolchain.

Java (prql-java)

prql-java offers rust bindings to the prql-compiler rust library. It exposes a java native method public static native String toSql(String query).

Installation

<dependency>
    <groupId>org.prqllang</groupId>
    <artifactId>prql-java</artifactId>
    <version>${PRQL_VERSION}</version>
</dependency>

Usage

import org.prqllang.prql4j.PrqlCompiler;

class Main {
    public static void main(String[] args) {
        String sql = PrqlCompiler.toSql("from table");
        System.out.println(sql);
    }
}

prql-js

JavaScript bindings for prql-compiler.

Installation

npm install prql-js

Usage

Currently these functions are exposed

function compile(prql_query: string, options?: CompileOptions): string;

function prql_to_pl(prql_query: string): string;

function pl_to_rq(pl_json: string): string;

function rq_to_sql(rq_json: string): string;

From NodeJS

Direct usage

const prqljs = require("prql-js");

const sql = prqljs.compile(`from employees | select first_name`);
console.log(sql.sql);

Template literal

const prqljs = require("prql-js");
const prql = (string) => prqljs.compile(string[0] || "");

const sql = prql`from employees | select first_name`;
console.log(sql.sql);

Template literal with newlines

const prqljs = require("prql-js");
const prql = (string) => prqljs.compile(string[0] || "");

const sql = prql`
    from employees
    select first_name
`;
console.log(sql.sql);

From a Browser

<html>
  <head>
    <script src="./node_modules/prql-js/dist/web/prql_js.js"></script>
    <script>
      const { compile } = wasm_bindgen;

      async function run() {
        await wasm_bindgen("./node_modules/prql-js/dist/web/prql_js_bg.wasm");
        const sql = compile("from employees | select first_name");

        console.log(sql);
      }

      run();
    </script>
  </head>

  <body></body>
</html>

From a Framework or a Bundler

import compile from "prql-js/dist/bundler";

const sql = compile(`from employees | select first_name`);
console.log(sql);

Errors

Errors are returned as following object, serialized as a JSON array:

interface ErrorMessage {
  /// Plain text of the error
  reason: string;
  /// A list of suggestions of how to fix the error
  hint: string | null;
  /// Character offset of error origin within a source file
  span: [number, number] | null;

  /// Annotated code, containing cause and hints.
  display: string | null;
  /// Line and column number of error origin within a source file
  location: SourceLocation | null;
}

/// Location within the source file.
/// Tuples contain:
/// - line number (0-based),
/// - column number within that line (0-based),
interface SourceLocation {
  start: [number, number];

  end: [number, number];
}

These errors can be caught as such:

try {
  const sql = prqlJs.compile(`from employees | foo first_name`);
} catch (error) {
  const errorMessages = JSON.parse(error.message).inner;

  console.log(errorMessages[0].display);
  console.log(errorMessages[0].location);
}

Development

Build:

npm run build

This builds Node, bundler and web packages in the dist path.

Test:

npm test

Notes

  • This uses wasm-pack to generate bindings1.
  • We’ve added an npm layer on top of the usual approach of just using wasm-pack, so we can distribute a single package with targets of node, bundler and no-modules — somewhat inverting the approach recommended by wasm-pack. The build instruction goes in a build script, rather than a pack script.
1
Though we would be very open to other approaches, given wasm-pack does not
seem maintained, and we're eliding many of its features to build for three
targets.

Python (prql-python)

Installation

pip install prql-python

Usage

import prql_python as prql

prql_query = """
    from employees
    join salaries [==emp_id]
    group [dept_id, gender] (
      aggregate [
        avg_salary = average salary
      ]
    )
"""

sql = prql.compile(prql_query)

R (prqlr)

R bindings for prql-compiler. Check out https://eitsupi.github.io/prqlr/ for more context.

Note

prqlr is generously maintained by @eitsupi in the eitsupi/prqlr repo.

Installation

install.packages("prqlr")

Usage

library(prqlr)

"
from employees
join salaries [emp_id]
group [dept_id, gender] (
  aggregate [
    avg_salary = average salary
  ]
)
" |>
  prql_compile()

Rust (prql-compiler)

Installation

cargo new myproject
cd myproject
cargo add prql-compiler

Usage

cargo run

src/main.rs

use prql_compiler::compile;
use prql_compiler::sql;

fn main() {
    let prql = "from employees | select [name,age]  ";
    let opt = sql::Options {
             format: true,
             dialect: Some(sql::Dialect::SQLite),
             signature_comment: true,
         };
    let sql = compile(&prql, Some(opt)).unwrap();
    println!("PRQL: {}\nSQLite: {}", prql, sql);
}

Cargo.toml

[package]
name = "myproject"
version = "0.1.0"
edition = "2021"

[dependencies]
prql-compiler = "0.4.0"

Integrations

PRQL is building integrations with lots of external tools, including:

dbt-prql

Original docs at https://github.com/prql/dbt-prql

dbt-prql allows writing PRQL in dbt models. This combines the benefits of PRQL’s power & simplicity within queries, with dbt’s version control, lineage & testing across queries.

Once dbt-prql in installed, dbt commands compile PRQL between {% prql %} & {% endprql %} jinja tags to SQL as part of dbt’s compilation. No additional config is required.

Examples

Simple example

{% prql %}
from employees
filter (age | in 20..30)
{% endprql %}

…would appear to dbt as:

SELECT
  employees.*
FROM
  employees
WHERE
  age BETWEEN 20
  AND 30

Less simple example

{% prql %}
from {{ source('salesforce', 'in_process') }}
derive expected_sales = probability * value
join {{ ref('team', 'team_sales') }} [==name]
group name (
  aggregate (expected_sales)
)
{% endprql %}

…would appear to dbt as:

SELECT
  name,
  {{ source('salesforce', 'in_process') }}.probability * {{ source('salesforce', 'in_process') }}.value AS expected_sales
FROM
  {{ source('salesforce', 'in_process') }}
  JOIN {{ ref('team', 'team_sales') }} USING(name)
GROUP BY
  name

…and then dbt will compile the source and refs to a full SQL query.

Replacing macros

dbt’s use of macros has saved many of us many lines of code, and even saved some people some time. But imperatively programming text generation with code like if not loop.last is not our highest calling. It’s the “necessary” part rather than beautiful part of dbt.

Here’s the canonical example of macros in the dbt documentation:

{%- set payment_methods = ["bank_transfer", "credit_card", "gift_card"] -%}

select
order_id,
{%- for payment_method in payment_methods %}
sum(case when payment_method = '{{payment_method}}' then amount end) as {{payment_method}}_amount
{%- if not loop.last %},{% endif -%}
{% endfor %}
from {{ ref('raw_payments') }}
group by 1

Here’s that model using PRQL1, including the prql jinja tags.

{% prql %}
func filter_amount method -> s"sum(case when payment_method = '{method}' then amount end) as {method}_amount"

from {{ ref('raw_payments') }}
group order_id (
  aggregate [
    filter_amount bank_transfer,
    filter_amount credit_card,
    filter_amount gift_card,
  ]
)
{% endprql %}

As well the query being simpler in its final form, writing in PRQL also gives us live feedback around any errors, on every keystroke. Though there’s much more to come, check out the current version on PRQL Playground.

What it does

When dbt compiles models to SQL queries:

  • Any text in a dbt model between {% prql %} and {% endprql %} tags is compiled from PRQL to SQL before being passed to dbt.
  • The PRQL compiler passes text that’s containing {{ & }} through to dbt without modification, which allows us to embed jinja expressions in PRQL. (This was added to PRQL specifically for this use-case.)
  • dbt will then compile the resulting model into its final form of raw SQL, and dispatch it to the database, as per usual.

There’s no config needed in the dbt project; this works automatically on any dbt command (e.g. dbt run) assuming dbt-prql is installed.

Installation

pip install dbt-prql

Current state

Currently this is new, but fairly feature-complete. It’s enthusiastically supported — if there are any problems, please open an issue.

Jupyter

Original docs at https://pyprql.readthedocs.io/en/latest/magic_readme.html

Work with pandas and PRQL in an IPython terminal or Jupyter notebook.

Implementation

This is a thin wrapper around the fantastic IPython-sql magic. Roughly speaking, all we do is parse PRQL to SQL and pass that through to ipython-sql. A full documentation of the supported features is available at their repository. Here, we document those places where we differ from them, plus those features we think you are mostly likely to find useful.

Usage

Installation

If you have already installed PyPRQL into your environment, then you should be could to go! We bundle in IPython and pandas, though you’ll need to install Jupyter separately. If you haven’t installed PyPRQL, that’s as simple as:

pip install pyprql

Set Up

Open up either an IPython terminal or Jupyter notebook. First, we need to load the extension and connect to a database.

In [1]: %load_ext pyprql.magic

Connecting a database

We have two options for connecting a database

  1. Create an in-memory DB. This is the easiest way to get started.

    In [2]: %prql duckdb:///:memory:
    

    However, in-memory databases start off empty! So, we need to add some data. We have a two options:

    • We can easily add a pandas dataframe to the DuckDB database like so:

      In [3]: %prql --persist df
      

      where df is a pandas dataframe. This adds a table named df to the in-memory DuckDB instance.

    • Or download a CSV and query it directly, with DuckDB:

      !wget https://github.com/graphql-compose/graphql-compose-examples/blob/master/examples/northwind/data/csv/products.csv
      

      …and then from products.csv will work.

  2. Connect to an existing database

    When connecting to a database, pass the connection string as an argument to the line magic %prql. The connection string needs to be in SQLAlchemy format, so any connection supported by SQLAlchemy is supported by the magic. Additional connection parameters can be passed as a dictionary using the --connection_arguments flag to the the %prql line magic. We ship with the necessary extensions to use DuckDB as the backend, and here connect to an in-memory database.

Querying

Now, let’s do a query! By default, PRQLMagic always returns the results as dataframe, and always prints the results. The results of the previous query are accessible in the _ variable.

These examples are based on the products.csv example above.



In [4]: %%prql
   ...: from p = products.csv
   ...: filter supplierID == 1

Done.
Returning data to local variable _
   productID    productName  supplierID  categoryID      quantityPerUnit  unitPrice  unitsInStock  unitsOnOrder  reorderLevel  discontinued
0          1           Chai           1           1   10 boxes x 20 bags       18.0            39             0            10             0
1          2          Chang           1           1   24 - 12 oz bottles       19.0            17            40            25             0
2          3  Aniseed Syrup           1           2  12 - 550 ml bottles       10.0            13            70            25             0
In [5]: %%prql
   ...: from p = products.csv
   ...: group categoryID (
   ...:   aggregate [average unitPrice]
   ...: )

Done.
Returning data to local variable _
   categoryID  avg("unitPrice")
0           1         37.979167
1           2         23.062500
2           7         32.370000
3           6         54.006667
4           8         20.682500
5           4         28.730000
6           3         25.160000
7           5         20.250000

We can capture the results into a different variable like so:

In [6]: %%prql results <<
   ...: from p = products.csv
   ...: aggregate [min unitsInStock, max unitsInStock]

Done.
Returning data to local variable results
   min("unitsInStock")  max("unitsInStock")
0                    0                  125

Now, the output of the query is saved to results.

Prefect

Because Prefect is in native python, it’s extremely easy to integrate with PRQL.

With a Postgres Task, replace:

PostgresExecute.run(..., query=sql)

…with…

PostgresExecute.run(..., query=pyprql.to_sql(prql))

We’re big fans of Prefect, and if there is anything that would make the integration easier, please open an issue.

VSCode extension

PRQL has a VSCode Extension that compiles a PRQL query in a VSCode editor and displays the resulting SQL code in a second pane on the side. This is very handy for editing, saving, and reusing PRQL queries in VScode.

To install the VSCode extension, open VSCode and type Ctl-Shift-P (Cmd-Shift-P on a Mac) and type PRQL. Install the extension as usual.

Repo for the PRQL VScode Extension

Rill

PRQL has had some work to integrate with Rill. See the Rill Issues for more details.

Examples

These examples are rewritten from other languages such as SQL. They try to express real-world problems in PRQL, covering most of the language features. We are looking for different use-cases of data transformation, be it database queries, semantic business modeling or data cleaning.

If you want to help, translate some of your queries to PRQL and open a PR to add them here!

PRQL

from employees
filter country == "USA"                       # Each line transforms the previous result.
derive [                                     # This adds columns / variables.
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost  # Variables can use other variables.
]
filter gross_cost > 0
group [title, country] (                     # For each group use a nested pipeline
  aggregate [                                # Aggregate each group to a single row
    average salary,
    average gross_salary,
    sum salary,
    sum gross_salary,
    average gross_cost,
    sum_gross_cost = sum gross_cost,
    ct = count,
  ]
)
sort sum_gross_cost
filter ct > 200
take 20

SQL

WITH table_1 AS (
  SELECT
    title,
    country,
    salary + payroll_tax + benefits_cost AS _expr_0,
    salary + payroll_tax AS _expr_1,
    salary
  FROM
    employees
  WHERE
    country = 'USA'
)
SELECT
  title,
  country,
  AVG(salary),
  AVG(_expr_1),
  SUM(salary),
  SUM(_expr_1),
  AVG(_expr_0),
  SUM(_expr_0) AS sum_gross_cost,
  COUNT(*) AS ct
FROM
  table_1
WHERE
  _expr_0 > 0
GROUP BY
  title,
  country
HAVING
  COUNT(*) > 200
ORDER BY
  sum_gross_cost
LIMIT
  20

PRQL

from employees
group [emp_no] (
  aggregate [
    emp_salary = average salary     # average salary resolves to "AVG(salary)" (from stdlib)
  ]
)
join titles [==emp_no]
group [title] (
  aggregate [
    avg_salary = average emp_salary
  ]
)
select salary_k = avg_salary / 1000 # avg_salary should resolve to "AVG(emp_salary)"
take 10                             # induces new SELECT
derive salary = salary_k * 1000     # salary_k should not resolve to "avg_salary / 1000"

SQL

WITH table_1 AS (
  SELECT
    AVG(salary) AS _expr_0,
    emp_no
  FROM
    employees
  GROUP BY
    emp_no
)
SELECT
  AVG(table_1._expr_0) / 1000 AS salary_k,
  AVG(table_1._expr_0) / 1000 * 1000 AS salary
FROM
  table_1
  JOIN titles ON table_1.emp_no = titles.emp_no
GROUP BY
  titles.title
LIMIT
  10

Single item is coerced into a list

PRQL

from employees
select salary

SQL

SELECT
  salary
FROM
  employees

Same as above but with salary in a list:

PRQL

from employees
select [salary]

SQL

SELECT
  salary
FROM
  employees

Multiple items

PRQL

from employees
derive [
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost
]

SQL

SELECT
  *,
  salary + payroll_tax AS gross_salary,
  salary + payroll_tax + benefits_cost AS gross_cost
FROM
  employees

Same as above but split into two lines:

PRQL

from employees
derive gross_salary = salary + payroll_tax
derive gross_cost = gross_salary + benefits_cost

SQL

SELECT
  *,
  salary + payroll_tax AS gross_salary,
  salary + payroll_tax + benefits_cost AS gross_cost
FROM
  employees

PRQL

let newest_employees = (
  from employees
  sort tenure
  take 50
)

let average_salaries = (
  from salaries
  group country (
    aggregate average_country_salary = (average salary)
  )
)

from newest_employees
join average_salaries [==country]
select [name, salary, average_country_salary]

SQL

WITH average_salaries AS (
  SELECT
    country,
    AVG(salary) AS average_country_salary
  FROM
    salaries
  GROUP BY
    country
),
newest_employees AS (
  SELECT
    *
  FROM
    employees
  ORDER BY
    tenure
  LIMIT
    50
)
SELECT
  newest_employees.name,
  newest_employees.salary,
  average_salaries.average_country_salary
FROM
  newest_employees
  JOIN average_salaries ON newest_employees.country = average_salaries.country

Employees

These are homework tasks on employees database.

Clone and init the database (requires a local PostgreSQL instance):

psql -U postgres -c 'CREATE DATABASE employees;'
git clone https://github.com/vrajmohan/pgsql-sample-data.git
psql -U postgres -d employees -f pgsql-sample-data/employee/employees.dump

Execute a PRQL query:

cd prql-compiler
cargo run compile examples/employees/average-title-salary.prql | psql -U postgres -d employees

Task 1

rank the employee titles according to the average salary for each department.

My solution:

  • for each employee, find their average salary,
  • join employees with their departments and titles (duplicating employees for each of their titles and departments)
  • group by department and title, aggregating average salary
  • join with department to get department name

PRQL

from salaries
group [emp_no] (
  aggregate [emp_salary = average salary]
)
join t=titles [==emp_no]
join dept_emp side:left [==emp_no]
group [dept_emp.dept_no, t.title] (
  aggregate [avg_salary = average emp_salary]
)
join departments [==dept_no]
select [dept_name, title, avg_salary]

SQL

WITH table_1 AS (
  SELECT
    AVG(salary) AS _expr_0,
    emp_no
  FROM
    salaries
  GROUP BY
    emp_no
),
table_2 AS (
  SELECT
    t.title,
    AVG(table_1._expr_0) AS avg_salary,
    dept_emp.dept_no
  FROM
    table_1
    JOIN titles AS t ON table_1.emp_no = t.emp_no
    LEFT JOIN dept_emp ON table_1.emp_no = dept_emp.emp_no
  GROUP BY
    dept_emp.dept_no,
    t.title
)
SELECT
  departments.dept_name,
  table_2.title,
  table_2.avg_salary
FROM
  table_2
  JOIN departments ON table_2.dept_no = departments.dept_no

Task 2

Estimate distribution of salaries and gender for each department departments.

PRQL

from e=employees
join salaries [==emp_no]
group [e.emp_no, e.gender] (
  aggregate [
    emp_salary = average salaries.salary
  ]
)
join de=dept_emp [==emp_no] side:left
group [de.dept_no, gender] (
  aggregate [
    salary_avg = average emp_salary,
    salary_sd = stddev emp_salary,
  ]
)
join departments [==dept_no]
select [dept_name, gender, salary_avg, salary_sd]

SQL

WITH table_1 AS (
  SELECT
    e.gender,
    AVG(salaries.salary) AS _expr_0,
    e.emp_no
  FROM
    employees AS e
    JOIN salaries ON e.emp_no = salaries.emp_no
  GROUP BY
    e.emp_no,
    e.gender
),
table_2 AS (
  SELECT
    table_1.gender,
    AVG(table_1._expr_0) AS salary_avg,
    STDDEV(table_1._expr_0) AS salary_sd,
    de.dept_no
  FROM
    table_1
    LEFT JOIN dept_emp AS de ON table_1.emp_no = de.emp_no
  GROUP BY
    de.dept_no,
    table_1.gender
)
SELECT
  departments.dept_name,
  table_2.gender,
  table_2.salary_avg,
  table_2.salary_sd
FROM
  table_2
  JOIN departments ON table_2.dept_no = departments.dept_no

Task 3

Estimate distribution of salaries and gender for each manager.

PRQL

from e=employees
join salaries [==emp_no]
group [e.emp_no, e.gender] (
  aggregate [
    emp_salary = average salaries.salary
  ]
)
join de=dept_emp [==emp_no]
join dm=dept_manager [
  (dm.dept_no == de.dept_no) and s"(de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)"
]
group [dm.emp_no, gender] (
  aggregate [
    salary_avg = average emp_salary,
    salary_sd = stddev emp_salary
  ]
)
derive mng_no = emp_no
join managers=employees [==emp_no]
derive mng_name = s"managers.first_name || ' ' || managers.last_name"
select [mng_name, managers.gender, salary_avg, salary_sd]

SQL

WITH table_1 AS (
  SELECT
    e.gender,
    AVG(salaries.salary) AS _expr_0,
    e.emp_no
  FROM
    employees AS e
    JOIN salaries ON e.emp_no = salaries.emp_no
  GROUP BY
    e.emp_no,
    e.gender
),
table_2 AS (
  SELECT
    AVG(table_1._expr_0) AS salary_avg,
    STDDEV(table_1._expr_0) AS salary_sd,
    dm.emp_no
  FROM
    table_1
    JOIN dept_emp AS de ON table_1.emp_no = de.emp_no
    JOIN dept_manager AS dm ON dm.dept_no = de.dept_no
    AND (de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)
  GROUP BY
    dm.emp_no,
    table_1.gender
)
SELECT
  managers.first_name || ' ' || managers.last_name AS mng_name,
  managers.gender,
  table_2.salary_avg,
  table_2.salary_sd
FROM
  table_2
  JOIN employees AS managers ON table_2.emp_no = managers.emp_no

Task 4

Find distributions of titles, salaries and genders for each department.

PRQL

from de=dept_emp
join s=salaries side:left [
  (s.emp_no == de.emp_no),
  s"({s.from_date}, {s.to_date}) OVERLAPS ({de.from_date}, {de.to_date})"
]
group [de.emp_no, de.dept_no] (
  aggregate salary = (average s.salary)
)
join employees [==emp_no]
join titles [==emp_no]
select [dept_no, salary, employees.gender, titles.title]

SQL

WITH table_1 AS (
  SELECT
    de.dept_no,
    AVG(s.salary) AS salary,
    de.emp_no
  FROM
    dept_emp AS de
    LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
    AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)
  GROUP BY
    de.emp_no,
    de.dept_no
)
SELECT
  table_1.dept_no,
  table_1.salary,
  employees.gender,
  titles.title
FROM
  table_1
  JOIN employees ON table_1.emp_no = employees.emp_no
  JOIN titles ON table_1.emp_no = titles.emp_no

Contributing

If you’re interested in joining the community to build a better SQL, here are ways to start:

PRQL is evolving from a project with lots of excitement into a project that folks are using in their work and integrating into their tools. We’re actively looking for collaborators to lead that growth with us.

Areas for larger contributions

Compiler

The compiler is written in Rust, and there’s enough to do such that any level of experience with rust is sufficient.

We try to keep a few onboarding issues on hand under the “good first issue” label. These have been screened to have sufficient context to get started (and we very much welcome questions where there’s some context missing).

To get started, check out the docs on Development and the Compiler architecture

And if you have questions, there are lots of friendly people on the Discord who will patiently help you.

Bindings & integrations

For PRQL to be successful, it needs to be available for the languages & tools that people already use.

  • We currently have bindings to the PRQL compiler in a few different languages; many of these can be improved, documented, and packaged in a better way.
  • If you have experience with packaging in an ecosystem that doesn’t currently have bindings, then creating PRQL bindings for that language we don’t currently support would be valuable to the project.
  • If there’s a tool that you use yourself to run data queries which you think would benefit from a PRQL integration, suggest one to us or the tool. If it’s open-source, build & share a prototype.

Relevant issues are labeled Integrations.

Language design

We decide on new language features in GitHub issues, usually under “language design” label.

You can also contribute by:

  • Finding instances where the compiler produces incorrect results, and post a bug report — feel free to use the playground.
  • Opening an issue / append to an existing issue with examples of queries that are difficult to express in PRQL — especially if more difficult than SQL.

With sufficient examples, suggest a change to the language! (Though suggestions without examples are difficult to engage with, so please do anchor suggestions in examples.)

Marketing

  • Improve our website. We have a few issues open on this front and are looking for anyone with at least some design skills.
  • Contribute towards the docs. Anything from shaping a whole section of the docs, to simply improving a confusing paragraph or fixing a typo.
  • Tell people about PRQL.
  • Find a group of users who would be interested in PRQL, help them get up to speed, help the project understand what they need.

Core team

If you have any questions or feedback and don’t receive a response on one of the general channels such as GitHub or Discord, feel free to reach out to:

Core team Emeritus

Thank you to those who have previously served on the core team:

Development

Setting up an initial dev environment

We can set up a local development environment sufficient for navigating, editing, and testing PRQL’s compiler code in two minutes:

  • Install rustup & cargo.

  • [Optional but highly recommended] Install cargo-insta, our testing framework:

    cargo install cargo-insta
    
  • That’s it! Running the unit tests for the prql-compiler crate after cloning the repo should complete successfully:

    cargo test -p prql-compiler --lib
    

    …or, to run tests and update the test snapshots:

    cargo insta test --accept -p prql-compiler --lib
    

    There’s more context on our tests in How we test below.

That’s sufficient for making an initial contribution to the compiler.


Setting up a full dev environment

Note: We really care about this process being easy, both because the project benefits from more contributors like you, and to reciprocate your future contribution. If something isn’t easy, please let us know in a GitHub Issue. We’ll enthusiastically help you, and use your feedback to improve the scripts & instructions.

For more advanced development; for example compiling for wasm or previewing the website, we have two options:

Option 1: Use the project’s task

Note: This is tested on MacOS, should work on Linux, but won’t work on Windows.

  • Install Task; either brew install go-task/tap/go-task or as described on Task

  • Then run the setup-dev task. This runs commands from our Taskfile.yml, installing dependencies with cargo, brew, npm & pip, and suggests some VSCode extensions.

    task setup-dev
    

Option 2: Install tools individually

  • We’ll need cargo-insta, to update snapshot tests:

    cargo install cargo-insta
    
  • We’ll need a couple of additional components, which most systems will have already. The easiest way to check whether they’re installed is to try running the full tests:

    cargo test
    

    …and if that doesn’t complete successfully, check we have:

    • A clang compiler, to compile the DuckDB integration tests, since we use `duckdb-rs’. To install one:

      • On MacOS, install xcode with xcode-select --install
      • On Debian Linux, apt-get update && apt-get install clang
      • On Windows, duckdb-rs isn’t supported, so these tests are excluded
    • Python >= 3.7, to compile prql-python.

  • For more involved contributions, such as building the website, playground, book, or some release artifacts, we’ll need some additional tools. But we won’t need those immediately, and the error messages on what’s missing should be clear when we attempt those things. When we hit them, the Taskfile.yml will be a good source to copy & paste instructions from.

Building & testing the full project

We have a couple of tasks which incorporate all building & testing. While they don’t need to be run as part of a standard dev loop — generally we’ll want to run a more specific test — they can be useful as a backstop to ensure everything works, and as a reference for how each part of the repo is built & tested. They should be consistent with the GitHub Actions workflows; please report any inconsistencies.

To build everything:

task build-all

To run all tests; (which includes building everything):

task test-all

These require installing Task, either brew install go-task/tap/go-task or as described on Task.

Contribution workflow

We’re similar to most projects on GitHub — open a Pull Request with a suggested change!

Commits

  • If a change is user-facing, please add a line in CHANGELOG.md, with {message}, ({@contributor, #X}) where X is the PR number.
    • If there’s a missing entry, a follow-up PR containing just the changelog entry is welcome.
  • We’re experimenting with using the Conventional Commits message format, enforced through action-semantic-pull-request. This would let us generate Changelogs automatically. The check is not required to pass at the moment.

Merges

  • We merge any code that makes PRQL better
  • A PR doesn’t need to be perfect to be merged; it doesn’t need to solve a big problem. It needs to:
    • be in the right direction
    • make incremental progress
    • be explicit on its current state, so others can continue the progress
  • If you have merge permissions, and are reasonably confident that a PR is suitable to merge (whether or not you’re the author), feel free to merge.
    • If you don’t have merge permissions and have authored a few PRs, ask and ye shall receive.
  • The primary way we ratchet the code quality is through automated tests.
    • This means PRs almost always need a test to demonstrate incremental progress.
    • If a change breaks functionality without breaking tests, our tests were insufficient.
  • We use PR reviews to give general context, offer specific assistance, and collaborate on larger decisions.
    • Reviews around ‘nits’ like code formatting / idioms / etc are very welcome. But the norm is for them to be received as helpful advice, rather than as mandatory tasks to complete. Adding automated tests & lints to automate these suggestions is welcome.
    • If you have merge permissions and would like a PR to be reviewed before it merges, that’s great — ask or assign a reviewer.
    • If a PR hasn’t received attention after a day, please feel free to ping the pull request.
  • People may review a PR after it’s merged. As part of the understanding that we can merge quickly, contributors are expected to incorporate substantive feedback into a future PR.
  • We should revert quickly if the impact of a PR turns out not to be consistent with our expectations, or there isn’t as much consensus on a decision as we had hoped. It’s very easy to revert code and then re-revert when we’ve resolved the issue; it’s a sign of moving quickly.

Components of PRQL

The PRQL project has several components. Instructions for working with them are in the README.md file in their respective paths. Here’s an overview:

book: The PRQL language book, which documents the language.

playground: A web GUI for the PRQL compiler. It shows the PRQL source beside the resulting SQL output.

prql-compiler: Installation and usage instructions for building and running the prql-compiler.

prql-java: Rust bindings to the prql-compiler rust library.

prql-js: Javascript bindings to the prql-compiler rust library.

prql-lib: Generates .a and .so libraries from the prql-compiler rust library for bindings to other languages

prql-python: Python bindings to the prql-compiler rust library.

website: Our website, hosted at https://prql-lang.org, built with hugo.

How we test

We use a pyramid of tests — we have fast, focused tests at the bottom of the pyramid, which give us low latency feedback when developing, and then slower, broader tests which ensure that we don’t miss anything as PRQL develops1.

1
Our approach is very consistent with
**[@matklad](https://github.com/matklad)**'s advice, in his excellent blog
post [How to Test](https://matklad.github.io//2021/05/31/how-to-test.html).

Note

If you’re making your first contribution, you don’t need to engage with all this — it’s fine to just make a change and push the results; the tests that run in GitHub will point you towards any errors, which can be then be run locally if needed. We’re always around to help out.

Our tests, from the bottom of the pyramid to the top:

  • Static checks — we run a few static checks to ensure the code stays healthy and consistent. They’re defined in .pre-commit-config.yaml, using pre-commit. They can be run locally with

    pre-commit run -a
    

    The tests fix most of the issues they find themselves. Most of them also run on GitHub on every commit; any changes they make are added onto the branch automatically in an additional commit.

  • Unit tests & inline insta snapshots — like most projects, we rely on unit tests to rapidly check that our code basically works. We extensively use Insta, a snapshot testing tool which writes out the values generated by our code, making it fast & simple to write and modify tests2.

    These are the fastest tests which run our code; they’re designed to run on every save while you’re developing. We include a task which does this (thy full command run on every save is cargo insta test --accept -p prql-compiler --lib):

    task -w test-rust-fast
    
2
[Here's an example of an insta test](https://github.com/PRQL/prql/blob/0.2.2/prql-compiler/src/parser.rs#L580-L605)
— note that only the initial line of each test is written by us; the
remainder is filled in by insta.
  • Integration tests — these run tests against real databases, to ensure we’re producing correct SQL.

  • Examples — we compile all examples in the PRQL Book, to test that they produce the SQL we expect, and that changes to our code don’t cause any unexpected regressions.

  • GitHub Actions on every commit — we run tests on prql-compiler for standard & wasm targets, and the examples in the book on every pull request every time a commit is pushed. These are designed to run in under two minutes, and we should be reassessing their scope if they grow beyond that. Once these pass, a pull request can be merged.

    All tests up to this point can be run with:

    task test-all
    
  • GitHub Actions on specific changes — we run additional tests on pull requests when we identify changes to some paths, such as bindings to other languages.

  • GitHub Actions on merge — we run many more tests on every merge to main. This includes testing across OSs, all our language bindings, our task tasks, a measure of test code coverage, and some performance benchmarks.3

    We can run these tests before a merge by adding a label pr-test-all to the PR.

    If these tests fail after merging, we revert the merged commit before fixing the test and then re-reverting.

3
We reference "actions", such as
[`build-prql-python`](https://github.com/PRQL/prql/blob/main/.github/actions/build-prql-python/action.yaml)
from workflows. We need to use these actions since workflow calls can only
have a depth of 2 (i.e. workflow can call workflows, but those workflows
can't call other workflows). We occasionally copy & paste small amounts of
yaml where we don't want to abstract something tiny into another action.

An alternative approach would be to have all jobs in a single workflow which
is called on every change, and then each job contains all its filtering
logic. So `pull-request.yaml` and `test-all.yaml` would be a single file,
and `test-python` would a job that has an `if` containing a) path changes,
b) a branch condition for `main`, and c) a PR label filter. That would be a
"flatter" approach — each job contains all its own criteria. The downside
would less abstraction, more verbose jobs, and a long list of ~25/30 skipped
jobs on every PR (since each job is skipped, rather than never started).

Ideally we wouldn't have to make these tradeoffs — GHA would offer an
arbitrary DAG of workflows, with filters at each level, and a UI that less
prominently displays workflows which aren't designed to run.
  • GitHub Actions nightly — we run tests that take a long time or are unrelated to code changes, such as security checks, or expensive timing benchmarks, every night.

    We can run these tests before a merge by adding a label pr-cron to the PR.

The goal of our tests is to allow us to make changes quickly. If you find they’re making it more difficult for you to make changes, or there are missing tests that would give you the confidence to make changes faster, then please raise an issue.


Releasing

Currently we release in a semi-automated way:

  1. PR & merge an updated Changelog. GitHub will produce a draft version at https://github.com/PRQL/prql/releases/new, including “New Contributors”.

  2. Run cargo release version patch -x && cargo release replace -x to bump the versions, then PR the resulting commit.

  3. After merging, go to Draft a new release4, copy the changelog entry into the release description5, enter the tag to be created, and hit “Publish”.

  4. From there, both the tag and release is created and all packages are published automatically based on our release workflow.

  5. Add in the sections for a new Changelog:

    ## 0.4.X — [unreleased]
    
    **Features**:
    
    **Fixes**:
    
    **Documentation**:
    
    **Web**:
    
    **Integrations**:
    
    **Internal changes**:
    
    **New Contributors**:
    

We may make this more automated in future; e.g. automatic changelog creation.

5
Unfortunately GitHub's markdown parser interprets linebreaks as newlines. I
haven't found a better way of editing the markdown to look reasonable than
manually editing the text.
4

Only maintainers have access to this page.

Using the Dockerfile

The Dockerfile in this repo builds a Docker image that has current versions of our rust development tools. This can be the lowest-effort way of setting up a rust environment for those that don’t have one already.

Development cycle

The developer loop when using Docker is substantially the same as if the tools had been installed directly.

All the source files live in the prql directory on the host. As the source changes, the tools (running in the Docker container) can watch those directories and re-run so results are instantly visible.

When the Docker container exits (say, at the end of the development session), the prql directory on the local machine contains the latest files. Use git to pull or to push the prql repo from the host as normal.

To do all this, build the Docker image and start a container as described in the Installation section.

Installation

Once Docker is installed, build the Docker image with the following commands.

Note: It will take some time while Docker pulls in all the necessary developer tools.

cd <top-level-PRQL-directory>
docker build -t prql .

Optional: Install pre-commit on the machine that hosts Docker. It runs several Static Checks to ensure code consistency. You can also configure git to run pre-commit automatically for each commit with the second (one-time) command below.

pre-commit run -a   # Run checks manually
pre-commit install  # (one time) install the git hooks

Finally, start up the Docker container with:

cd <top-level-PRQL-directory>
docker run -it -v $(pwd)/:/src -p 3000:3000 prql
  • There’ll be a root@xxxxxxxxx:/src/# prompt
  • Enter a command to run or test code; for example cargo test
  • Enter exit to stop the container

Running code with Docker

Currently our Docker image only supports running rust dependencies. (adding hugo & nodejs so that the playground can run would be a welcome contribution.)

Use the docker run... command above, then enter the relevant commands; for example cargo insta test --accept or task run book — more details of the commands are in each component’s README.md file or our Development docs.

Note: The first time you run a component, it may take some time to install additional files. Once they’re built, start up is quick.

Developing the Dockerfile

When making updates to the Dockerfile, we have automated testing that the Dockerfile builds on each merge in test-all.yaml, and automated testing that the confirms all rust tests pass, in nightly.yaml.

Add a label to the PR pr-test-all or pr-cron to run these tests on a PR.

Internals

This chapter explains PRQL’s semantics: how expressions are interpreted and their meaning. It’s intended for advanced users and compiler contributors.

Compiler architecture

Compiler works in the following stages:

  1. Lexing & parsing - split PRQL text into tokens, build parse tree and convert into our AST (Abstract Syntax Tree, see ast module). Parsing is done using PEST parser (prql.pest), AST is constructed in parser.rs.

  2. Semantic analysis - resolves names (identifiers), extracts declarations, determines frames (columns of the table in each step). It declares Context that contains root module (mapping from accessible names to their declarations).

    Resolving includes following operations:

    • Assign an id to each node (Expr and Stmt).
    • Extract function declarations and variable def into appropriate Module, accessible from Context::root_mod
    • Lookup identifiers in module and find associated declaration. Ident is replaced with fully qualified name that guarantees unique name in root_mod. Sometimes, Expr::target is also set.
    • Function calls to transforms (from, derive, filter) are converted from FuncCall into TransformCall, which is more convenient for later processing.
    • Determine type of expr. If expr is a reference to a table use the frame of the table as the type. If it is a TransformCall, apply the transform to the input frame to obtain resulting type. For simple expressions, try to infer from ExprKind.
  3. Lowering - converts PL into RQ that is more strictly typed, contains less information but is convenient for translating into SQL or some other backend.

  4. SQL backend - converts RQ into SQL. It converts each of the relations into a SQL query. Pipelines are analyzed and split at appropriate positions into “AtomicPipelines” which can be represented by a single SELECT statement.

    Splitting is done back-to-front. First, we start with list all output columns we want. Then we traverse the pipeline backwards and split when we encounter a transform that is incompatible with transforms already present in the pipeline. Split can also be triggered by encountering an expression that cannot be materialized where it is used (window function is WHERE for example).

    This process is also called anchoring, because it anchors a column definition to a specific location in the output query.

    During this process, sql::context keeps track of:

    • table instances in the query (to prevent mixing up two instances of the same table)
    • column definitions, whether computed or a reference to a table column,
    • column names, as defined in RQ or generated

Name resolving

Because PRQL primarily handles relational data, it has specialized scoping rules for referencing columns.

Scopes

In PRQL’s compiler, a scope is the collection of all names one can reference from a specific point in the program.

In PRQL, names in the scope are composed from namespace and variable name which are separated by a dot, similar to SQL. Namespaces can contain many dots, but variable names cannot.

Example

Name my_table.some_column is a variable some_column from namespace my_table.

Name foo.bar.baz is a variable baz from namespace foo.bar.

When processing a query, a scope is maintained and updated for each point in the query.

It start with only namespace std, which is the standard library. It contains common functions like sum or count, along with all transform functions such as derive and group.

In pipelines (or rather in transform functions), scope is also injected with namespaces of tables which may have been referenced with from or join transforms. These namespaces contain simply all the columns of the table and possibly a wildcard variable, which matches any variable (see the algorithm below). Within transforms, there is also a special namespace that does not have a name. It is called a “frame” and it contains columns of the current table the transform is operating on.

Resolving

For each ident we want to resolve, we search the scope’s items in order. One of three things can happen:

  • Scope contains an exact match, e.g. a name that matches in namespace and the variable name.

  • Scope does not contain an exact match, but the ident did not specify a namespace, so we can match a namespace that contains a * wildcard. If there’s a single namespace, the matched namespace is also updated to contain this new variable name.

  • Otherwise, the nothing is matched and an error is raised.

Translating to SQL

When translating into a SQL statement which references only one table, there is no need to reference column names with table prefix.

PRQL

from employees
select first_name

SQL

SELECT
  first_name
FROM
  employees

But when there are multiple tables and we don’t have complete knowledge of all table columns, a column without a prefix (i.e. first_name) may actually reside in multiple tables. Because of this, we have to use table prefixes for all column names.

PRQL

from employees
derive [first_name, dept_id]
join d=departments [==dept_id]
select [first_name, d.title]

SQL

SELECT
  employees.first_name,
  d.title
FROM
  employees
  JOIN departments AS d ON employees.dept_id = d.dept_id

As you can see, employees.first_name now needs table prefix, to prevent conflicts with potential column with the same name in departments table. Similarly, d.title needs the table prefix.

Functions

Function call

The major distinction between PRQL and today’s conventional programming languages such as C or Python is the function call syntax. It consists of the function name followed by arguments separated by whitespace.

function_name arg1 arg2 arg3

If one of the arguments is also a function call, it must be encased in parentheses, so we know where arguments of inner function end and the arguments of outer function start.

outer_func arg_1 (inner_func arg_a, arg_b) arg_2

Pipeline

There is a alternative way of calling functions: using a pipeline. Regardless of whether the pipeline is delimited by pipe symbol | or a new line, the pipeline is equivalent to applying each of functions as the last argument of the next function.

a | foo 3 | bar 'hello' 'world' | baz

… is equivalent to …

baz (bar 'hello' 'world' (foo 3 a))

As you may have noticed, transforms are regular functions too!

PRQL

from employees
filter age > 50
sort name

SQL

SELECT
  *
FROM
  employees
WHERE
  age > 50
ORDER BY
  name

… is equivalent to …

PRQL

from employees | filter age > 50 | sort name

SQL

SELECT
  *
FROM
  employees
WHERE
  age > 50
ORDER BY
  name

… is equivalent to …

PRQL

filter age > 50 (from employees) | sort name

SQL

SELECT
  *
FROM
  employees
WHERE
  age > 50
ORDER BY
  name

… is equivalent to …

PRQL

sort name (filter age > 50 (from employees))

SQL

SELECT
  *
FROM
  employees
WHERE
  age > 50
ORDER BY
  name

As you can see, the first example with pipeline notation is much easier to comprehend, compared to the last one with the regular function call notation. This is why it is recommended to use pipelines for nested function calls that are 3 or more levels deep.

Currying and late binding

In PRQL, functions are first class citizens. As cool as that sounds, we need simpler terms to explain it. In essence in means that we can operate with functions are with any other value.

Syntax highlighting

PRQL contains multiple grammar definitions to enable tools to highlight PRQL code. These are all intended to provide as good an experience as the grammar supports. Please raise any shortcomings in a GitHub issue.

The definitions are somewhat scattered around the codebase; this page serves as an index.

While the pest grammar at prql-compiler/src/parser/prql.pest isn’t used for syntax highlighting, it’s the arbiter of truth given it currently powers the PRQL compiler.