Introduction
PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it’s readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.
Let’s get started with an example:
PRQL
from employees
filter start_date > @2021-01-01 # Clear date syntax
derive [ # `derive` adds columns / variables
gross_salary = salary + (tax ?? 0), # Terse coalesce
gross_cost = gross_salary + benefits_cost, # Variables can use other variables
]
filter gross_cost > 0
group [title, country] ( # `group` runs a pipeline over each group
aggregate [ # `aggregate` reduces each group to a value
average gross_salary,
sum_gross_cost = sum gross_cost, # `=` sets a column name
]
)
filter sum_gross_cost > 100_000 # `filter` replaces both of SQL's `WHERE` & `HAVING`
derive id = f"{title}_{country}" # F-strings like python
derive country_code = s"LEFT(country, 2)" # S-strings allow using SQL as an escape hatch
sort [sum_gross_cost, -country] # `-country` means descending order
take 1..20 # Range expressions (also valid here as `take 20`)
SQL
WITH table_1 AS (
SELECT
title,
country,
salary + COALESCE(tax, 0) + benefits_cost AS _expr_0,
salary + COALESCE(tax, 0) AS _expr_1
FROM
employees
WHERE
start_date > DATE '2021-01-01'
)
SELECT
title,
country,
AVG(_expr_1),
SUM(_expr_0) AS sum_gross_cost,
CONCAT(title, '_', country) AS id,
LEFT(country, 2) AS country_code
FROM
table_1
WHERE
_expr_0 > 0
GROUP BY
title,
country
HAVING
SUM(_expr_0) > 100000
ORDER BY
sum_gross_cost,
country DESC
LIMIT
20
As you can see, PRQL is a linear pipeline of transformations — each line of the query is a transformation of the previous line’s result.
You can see that in SQL, operations do not follow one another, which makes it hard to compose larger queries.
Queries
PRQL queries are composed of a sequence of transforms that form a pipeline that modifies data as it is passed from one to the next.
The major items of a query are listed at the left:
Pipelines
PRQL queries are a sequence of lines (or transforms) that form a pipeline. Each line transforms the data, and passes its result to the next.
The simplest pipeline is just:
Adding transforms
As we add additional lines, each one transforms the result:
PRQL
from employees
derive gross_salary = (salary + payroll_tax)
SQL
SELECT
*,
salary + payroll_tax AS gross_salary
FROM
employees
…and so on:
from employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary
Compiling to SQL
PRQL compiles the query to SQL. The PRQL compiler tries to represent as many
transforms as possible with a single SELECT
statement. When necessary, the
compiler “overflows” and creates CTEs (common table expressions):
PRQL
from e = employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary
take 10
join d = department [==dept_no]
select [e.name, gross_salary, d.name]
SQL
WITH table_1 AS (
SELECT
name,
salary + payroll_tax AS gross_salary,
dept_no
FROM
employees AS e
ORDER BY
gross_salary
LIMIT
10
)
SELECT
table_1.name,
table_1.gross_salary,
d.name
FROM
table_1
JOIN department AS d ON table_1.dept_no = d.dept_no
See also
- Transforms - PRQL Transforms
- Syntax - Notation for PRQL queries
Functions
Functions are a fundamental abstraction in PRQL — they allow us to run code in many places that we’ve written once. This reduces the number of errors in our code, makes our code more readable, and simplifies making changes.
Functions have two types of parameters:
- Positional parameters, which require an argument.
- Named parameters, which optionally take an argument, otherwise using their default value.
So this function is named fahrenheit_to_celsius
and has one parameter temp
:
PRQL
func fahrenheit_to_celsius temp -> (temp - 32) / 1.8
from cities
derive temp_c = (fahrenheit_to_celsius temp_f)
SQL
SELECT
*,
(temp_f - 32) / 1.8 AS temp_c
FROM
cities
This function is named interp
, and has two positional parameters named
higher
and x
, and one named parameter named lower
which takes a default
argument of 0
. It calculates the proportion of the distance that x
is
between lower
and higher
.
PRQL
func interp lower:0 higher x -> (x - lower) / (higher - lower)
from students
derive [
sat_proportion_1 = (interp 1600 sat_score),
sat_proportion_2 = (interp lower:0 1600 sat_score),
]
SQL
SELECT
*,
(sat_score - 0) / 1600 AS sat_proportion_1,
(sat_score - 0) / 1600 AS sat_proportion_2
FROM
students
Piping
Consistent with the principles of PRQL, it’s possible to pipe values into functions, which makes composing many functions more readable. When piping a value into a function, the value is passed as an argument to the final positional parameter of the function. Here’s the same result as the examples above with an alternative construction:
PRQL
func interp lower:0 higher x -> (x - lower) / (higher - lower)
from students
derive [
sat_proportion_1 = (sat_score | interp 1600),
sat_proportion_2 = (sat_score | interp lower:0 1600),
]
SQL
SELECT
*,
(sat_score - 0) / 1600 AS sat_proportion_1,
(sat_score - 0) / 1600 AS sat_proportion_2
FROM
students
and
PRQL
func fahrenheit_to_celsius temp -> (temp - 32) / 1.8
from cities
derive temp_c = (temp_f | fahrenheit_to_celsius)
SQL
SELECT
*,
(temp_f - 32) / 1.8 AS temp_c
FROM
cities
We can combine a chain of functions, which makes logic more readable:
PRQL
func fahrenheit_to_celsius temp -> (temp - 32) / 1.8
func interp lower:0 higher x -> (x - lower) / (higher - lower)
from kettles
derive boiling_proportion = (temp_c | fahrenheit_to_celsius | interp 100)
SQL
SELECT
*,
((temp_c - 32) / 1.8 - 0) / 100 AS boiling_proportion
FROM
kettles
Roadmap
Late binding
Currently, functions require a binding to variables in scope; they can’t late-bind to column names; so for example:
func return price -> (price - dividend) / price_yesterday
…isn’t yet a valid function, and instead would needs to be:
func return price dividend price_yesterday -> (price - dividend) / (price_yesterday)
(which makes functions in this case not useful)
Tables
We can define a temporary table — similar to a CTE in SQL — with let
:
PRQL
let top_50 = (
from employees
sort salary
take 50
aggregate [total_salary = sum salary]
)
from top_50 # Starts a new pipeline
SQL
WITH table_0 AS (
SELECT
salary
FROM
employees
ORDER BY
salary
LIMIT
50
), top_50 AS (
SELECT
SUM(salary) AS total_salary
FROM
table_0
)
SELECT
total_salary
FROM
top_50
We can even place a whole CTE in an s-string, enabling us to use features which PRQL doesn’t yet support.
PRQL
let grouping = s"""
SELECT SUM(a)
FROM tbl
GROUP BY
GROUPING SETS
((b, c, d), (d), (b, d))
"""
from grouping
SQL
WITH table_0 AS (
SELECT
SUM(a)
FROM
tbl
GROUP BY
GROUPING SETS ((b, c, d), (d), (b, d))
),
grouping AS (
SELECT
*
FROM
table_0 AS table_1
)
SELECT
*
FROM
grouping
In PRQL table
s are far less common than CTEs are in SQL, since a linear series
of CTEs can be represented with a single pipeline.
In the final example above, the e
representing the table / namespace is no
longer available after the select
statement. For example, this would raise an error:
from e=employees
select e.first_name
filter e.first_name == "Fred" # Can't find `e.first_name`
To refer to the e.first_name
column in subsequent transforms,
either refer to it using first_name
, or if it requires a different name,
assign one in the select
statement:
PRQL
from e=employees
select fname = e.first_name
filter fname == "Fred"
SQL
WITH table_1 AS (
SELECT
first_name AS fname
FROM
employees AS e
)
SELECT
fname
FROM
table_1
WHERE
fname = 'Fred'
`
Sort
Orders rows based on the values of one or more columns.
sort [{direction}{column}]
Parameters
- One column or a list of columns to sort by
- Each column can be prefixed with:
+
, for ascending order, the default-
, for descending order
- When using prefixes, even a single column needs to be in a list or
parentheses. (Otherwise,
sort -foo
is parsed as a subtraction betweensort
andfoo
.)
Examples
PRQL
from employees
sort [age, -tenure, +salary]
SQL
SELECT
*
FROM
employees
ORDER BY
age,
tenure DESC,
salary
We can also use expressions:
PRQL
from employees
sort [s"substr({first_name}, 2, 5)"]
SQL
WITH table_1 AS (
SELECT
*,
substr(first_name, 2, 5) AS _expr_0
FROM
employees
ORDER BY
_expr_0
)
SELECT
*
FROM
table_1
Notes
Ordering guarantees
Most DBs will persist ordering through most transforms; for example, you can
expect this result to be ordered by tenure
.
PRQL
from employees
sort tenure
derive name = f"{first_name} {last_name}"
SQL
SELECT
*,
CONCAT(first_name, ' ', last_name) AS name
FROM
employees
ORDER BY
tenure
But:
- This is an implementation detail of the DB. If there are instances where this doesn’t hold, please open an issue, and we’ll consider how to manage it.
- Some transforms which change the existence of rows, such as
join
orgroup
, won’t persist ordering; for example:
PRQL
from employees
sort tenure
join locations [==employee_id]
SQL
WITH table_1 AS (
SELECT
*
FROM
employees
ORDER BY
tenure
)
SELECT
table_1.*,
locations.*
FROM
table_1
JOIN locations ON table_1.employee_id = locations.employee_id
See Issue #1363 for more details.
Take
Picks rows based on their position.
take {n|range}
See Ranges for more details on how ranges work.
Examples
PRQL
from orders
sort [-value, date]
take 101..110
SQL
SELECT
*
FROM
orders
ORDER BY
value DESC,
date
LIMIT
10 OFFSET 100
Window
Applies a pipeline to segments of rows, producing one output value for every input value.
window rows:{range} range:{range} expanding:false rolling:0 {pipeline}
For each row, the segment over which the pipeline is applied is determined by one of:
rows
, which takes a range of rows relative to the current row position.0
references the current row.
range
, which takes a range of values relative to current row value.
The bounds of the range are inclusive. If a bound is omitted, the segment will extend until the edge of the table or group.
For ease of use, there are two flags that override rows
or range
:
expanding:true
is an alias forrows:..0
. A sum using this window is also known as “cumulative sum”.rolling:n
is an alias forrow:(-n+1)..0
, wheren
is an integer. This will includen
last values, including current row. An average using this window is also knows as a Simple Moving Average.
Some examples:
Expression | Meaning |
---|---|
rows:0..2 | current row plus two following |
rows:-2..0 | two preceding rows plus current row |
rolling:3 | (same as previous) |
rows:-2..4 | two preceding rows plus current row plus four following rows |
rows:..0 | all rows from the start of the table up to & including current row |
expanding:true | (same as previous) |
rows:0.. | current row and all following rows until the end of the table |
rows:.. | all rows, which same as not having window at all |
Example
PRQL
from employees
group employee_id (
sort month
window rolling:12 (
derive [trail_12_m_comp = sum paycheck]
)
)
SQL
SELECT
*,
SUM(paycheck) OVER (
PARTITION BY employee_id
ORDER BY
month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
) AS trail_12_m_comp
FROM
employees
PRQL
from orders
sort day
window rows:-3..3 (
derive [centered_weekly_average = average value]
)
group [order_month] (
sort day
window expanding:true (
derive [monthly_running_total = sum value]
)
)
SQL
SELECT
*,
AVG(value) OVER (
ORDER BY
day ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
) AS centered_weekly_average,
SUM(value) OVER (
PARTITION BY order_month
ORDER BY
day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS monthly_running_total
FROM
orders
Windowing by default
If you use window functions without window
transform, they will be applied to
the whole table. Unlike in SQL, they will remain window functions and will not
trigger aggregation.
PRQL
from employees
sort age
derive rnk = rank
SQL
SELECT
*,
RANK() OVER (
ORDER BY
age ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS rnk
FROM
employees
ORDER BY
age
You can also only apply group
:
PRQL
from employees
group department (
sort age
derive rnk = rank
)
SQL
SELECT
*,
RANK() OVER (
PARTITION BY department
ORDER BY
age ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS rnk
FROM
employees
Window functions as first class citizens
There is no limitaions where windowed expressions can be used:
PRQL
from employees
filter salary < (average salary)
SQL
WITH table_1 AS (
SELECT
*,
AVG(salary) OVER () AS _expr_0
FROM
employees
)
SELECT
*
FROM
table_1
WHERE
salary < _expr_0
Syntax
Summary
A summary of PRQL syntax
Syntax | Usage | Example |
---|---|---|
| | Pipelines | from employees | select first_name |
= | Assigns & Aliases | from e = employees derive total = (sum salary) |
: | Named args & Parameters | interp lower:0 1600 sat_score |
[] | Lists | select [id, amount] |
() | Precedence & Parentheses | derive celsius = (fahrenheit - 32) / 1.8 |
'' & "" | Strings | derive name = 'Mary' |
` ` | Quoted identifiers | select `first name` |
# | Comments | # A comment |
@ | Dates & Times | @2021-01-01 |
== | Expressions | filter a == b and c != d and e > f |
== | Self-equality in join | join s=salaries [==id] |
-> | Function definitions | func add a b -> a + b |
+ /- | Sort order | sort [-amount, +date] |
?? | Coalesce | amount ?? 0 |
Pipes
Pipes — the connection between transforms that make up a
pipeline — can be either line breaks or a pipe character (|
).
In almost all situations, line-breaks pipe the result of a line’s transform into
the transform on the following line. For example, the filter
transform
operates on the result of from employees
(which is just the employees
table), and the select
transform operates on the result of the filter
transform.
PRQL
from employees
filter department == "Product"
select [first_name, last_name]
SQL
SELECT
first_name,
last_name
FROM
employees
WHERE
department = 'Product'
In the place of a line-break, it’s also possible to use the |
character to
pipe results, such that this is equivalent:
PRQL
from employees | filter department == "Product" | select [first_name, last_name]
SQL
SELECT
first_name,
last_name
FROM
employees
WHERE
department = 'Product'
A line-break doesn’t create a pipeline in a couple of cases:
- within a list (e.g. the
derive
examples below), - when the following line is a new statement, which starts with a keyword of
func
,table
orfrom
.
Lists
Lists are represented with []
, and can span multiple lines. A final trailing
comma is optional.
PRQL
from numbers
derive [x = 1, y = 2]
derive [
a = x,
b = y
]
derive [
c = a,
d = b,
]
SQL
SELECT
*,
1 AS x,
2 AS y,
1 AS a,
2 AS b,
1 AS c,
2 AS d
FROM
numbers
Most transforms can take either a list or a single item, so these are equivalent:
Expressions
PRQL is made up of expressions, like 2 + 3
or ((1 + x) * y)
. In the
example below, note the use of expressions to calculate the alias
circumference
and in the filter
transform.
PRQL
from foo
select [
circumference = diameter * 3.14159,
color,
]
filter circumference > 10 and color != "red"
SQL
WITH table_1 AS (
SELECT
diameter * 3.14159 AS circumference,
color
FROM
foo
)
SELECT
circumference,
color
FROM
table_1
WHERE
circumference > 10
AND color <> 'red'
Precedence and Parentheses
Parentheses — ()
— are used to give precedence to inner expressions.
We realize some of the finer points here are not intuitive. We are considering approaches to make this more intuitive — even at the cost of requiring more syntax in some circumstances. And we’re planning to make the error messages much better, so the compiler is there to help out.
Parentheses are required around:
- Any nested function call containing a pipe, either the
|
symbol or a new line. “Nested” means within a transform; i.e. not just the main pipeline. - Any function call that isn’t a single item in a list or a pipeline, like
sum distance
inround 0 (sum distance)
1. - A minus sign in a function argument, like in
add (-1) (-3)
- Inner transforms for
group
,window
, and other transforms.
Parentheses are not required around expressions which use operators but no
function call, like foo + bar
.
or, technically, it’s on the right side of an assignment in a list…
Here’s a full rundown of times this applier:
PRQL
from employees
# Requires parentheses, because it's contains a pipe
derive is_proximate = (distance | in 0..20)
# Requires parentheses, because it's a function call
derive total_distance = (sum distance)
# `??` doesn't require parentheses, as it's not a function call
derive min_capped_distance = (min distance ?? 5)
# No parentheses needed, because no function call
derive travel_time = distance / 40
# No inner parentheses needed around `1+1` because no function call
derive distance_rounded_2_dp = (round 1+1 distance)
derive [
# Requires parentheses, because it contains a pipe
is_far = (distance | in 100..),
# The left value of the range requires parentheses,
# because of the minus sign
is_negative = (distance | in (-100..0)),
# ...this is equivalent
is_negative = (distance | in (-100)..0),
# Doesn't require parentheses, because it's in a list (confusing, see footnote)!
average_distance = average distance,
]
# Requires parentheses because of the minus sign
sort (-distance)
# A list is fine too
sort [-distance]
SQL
SELECT
*,
distance BETWEEN 0 AND 20 AS is_proximate,
SUM(distance) OVER () AS total_distance,
MIN(COALESCE(distance, 5)) OVER () AS min_capped_distance,
distance / 40 AS travel_time,
ROUND(distance, 2) AS distance_rounded_2_dp,
distance >= 100 AS is_far,
distance BETWEEN -100 AND 0,
distance BETWEEN -100 AND 0 AS is_negative,
AVG(distance) OVER () AS average_distance
FROM
employees
ORDER BY
distance DESC
Inner Transforms
Parentheses are also used for transforms (such as group
and window
) that
pass their result to an “inner transform”. The example below applies the
aggregate
pipeline to each group of unique title
and country
values:
PRQL
from employees
group [title, country] (
aggregate [
average salary,
ct = count
]
)
SQL
SELECT
title,
country,
AVG(salary),
COUNT(*) AS ct
FROM
employees
GROUP BY
title,
country
Comments
Comments are represented by #
.
PRQL
from employees # Comment 1
# Comment 2
aggregate [average salary]
SQL
SELECT
AVG(salary)
FROM
employees
There’s no distinct multiline comment syntax.
Quoted identifiers
To use identifiers that are otherwise invalid, surround them with backticks. Depending on the dialect, these will remain as backticks or be converted to double-quotes.
PRQL
prql target:sql.mysql
from employees
select `first name`
SQL
SELECT
`first name`
FROM
employees
PRQL
prql target:sql.postgres
from employees
select `first name`
SQL
SELECT
"first name"
FROM
employees
BigQuery also uses backticks to surround project & dataset names (even if valid
identifiers) in the SELECT
statement:
PRQL
prql target:sql.bigquery
from `project-foo.dataset.table`
join `project-bar.dataset.table` [==col_bax]
SQL
SELECT
`project-foo.dataset.table`.*,
`project-bar.dataset.table`.*
FROM
`project-foo.dataset.table`
JOIN `project-bar.dataset.table` ON `project-foo.dataset.table`.col_bax = `project-bar.dataset.table`.col_bax
Quoting schemas
This is currently not great and we are working on improving it; see
https://github.com/PRQL/prql/issues/1535 for progress.
If supplying a schema without a column — for example in a from
or join
transform, that also needs to be a quoted identifier:
Parameters
PRQL will retain parameters like $1
in SQL output, which can then be supplied
to the SQL query:
Numbers
Numbers can contain underscores between numbers; which can make reading large numbers easier:
PRQL
from numbers
select [
small = 1.000_000_1,
big = 5_000_000,
]
SQL
SELECT
1.0000001 AS small,
5000000 AS big
FROM
numbers
Language features
The pages of this section describe how PRQL handles various aspects of the language.
Feature | Purpose |
---|---|
Coalesce | Handle potentially NULL values |
Dates & times | Handle dates and times |
Distinct | Equivalent of SQL DISTINCT |
Null handling | Handle NULL values |
Ranges | Syntax for all forms of ranges |
Regex | Handle regular expressions |
Stdlib | PRQL’s “builtin” set of functions |
Strings | Rules for creating strings |
S-Strings | Insert SQL directly into a query with an S-string |
F-Strings | Combine several column’s data with F-strings |
Switch | Create a new column based on the contents of other columns |
Target & Version | Specify a target SQL engine and PRQL version |
Coalesce
We can coalesce values with an ??
operator. Coalescing takes either the first
value or, if that value is null, the second value.
Dates & times
PRQL uses @
followed by a string to represent dates & times. This is less
verbose than SQL’s approach of TIMESTAMP '2004-10-19 10:23:54'
and more
explicit than SQL’s implicit option of just using a string
'2004-10-19 10:23:54'
.
Currently PRQL passes strings which can be compiled straight through to the database, and so many compatible formats string may work, but we may refine this in the future to aid in compatibility across databases. We’ll always support the canonical ISO8601 format described below.
Dates
Dates are represented by @{yyyy-mm-dd}
— a @
followed by the date format.
PRQL
from employees
derive age_at_year_end = (@2022-12-31 - dob)
SQL
SELECT
*,
DATE '2022-12-31' - dob AS age_at_year_end
FROM
employees
Times
Times are represented by @{HH:mm:ss.SSS±Z}
with any parts not supplied being
rounded to zero, including the timezone, which is represented by +HH:mm
,
-HH:mm
or Z
. This is consistent with the ISO8601 time format.
PRQL
from orders
derive should_have_shipped_today = (order_time < @08:30)
SQL
SELECT
*,
order_time < TIME '08:30' AS should_have_shipped_today
FROM
orders
Timestamps
Timestamps are represented by @{yyyy-mm-ddTHH:mm:ss.SSS±Z}
/ @{date}T{time}
,
with any time parts not supplied being rounded to zero, including the timezone,
which is represented by +HH:mm
, -HH:mm
or Z
. This is @
followed by the
ISO8601 datetime format, which uses T
to separate date & time.
PRQL
from commits
derive first_prql_commit = @2020-01-01T13:19:55-0800
SQL
SELECT
*,
TIMESTAMP '2020-01-01T13:19:55-0800' AS first_prql_commit
FROM
commits
Intervals
Intervals are represented by {N}{periods}
, such as 2years
or 10minutes
,
without a space.
These aren’t the same as ISO8601, because we evaluated P3Y6M4DT12H30M5S
to
be difficult to understand, but we could support a simplified form if there’s
demand for it. We don’t currently support compound expressions, for example
2years10months
, but most DBs will allow 2years + 10months
. Please raise an
issue if this is inconvenient.
PRQL
from projects
derive first_check_in = start + 10days
SQL
SELECT
*,
start + INTERVAL 10 DAY AS first_check_in
FROM
projects
Examples
Here’s a fuller list of examples:
@20221231
is forbidden — it must contain full punctuation (-
and:
),@2022-12-31
is a date@2022-12
or@2022
are forbidden — SQL can’t express a month, only a date@16:54:32.123456
is a time@16:54:32
,@16:54
,@16
are all allowed, expressing@16:54:32.000000
,@16:54:00.000000
,@16:00:00.000000
respectively@2022-12-31T16:54:32.123456
is a timestamp without timezone@2022-12-31T16:54:32.123456Z
is a timestamp in UTC@2022-12-31T16:54+02
is timestamp in UTC+2@2022-12-31T16:54+02:00
and@2022-12-31T16:54+02
are datetimes in UTC+2@16:54+02
is forbidden — time is always local, so it cannot have a timezone@2022-12-31+02
is forbidden — date is always local, so it cannot have a timezone
Roadmap
Datetimes
Datetimes are supported by some databases (e.g. MySql, BigQuery) in addition to timestamps. When we have type annotations, these will be represented by a timestamp annotated as a datetime:
derive pi_day = @2017-03-14T15:09:26.535898<datetime>
These are some examples we can then add:
@2022-12-31T16:54<datetime>
is datetime without timezone@2022-12-31<datetime>
is forbidden — datetime must specify time@16:54<datetime>
is forbidden — datetime must specify date
Distinct
PRQL doesn’t have a specific distinct
keyword. Instead, use group
and
take 1
:
PRQL
from employees
select department
group department (
take 1
)
SQL
SELECT
DISTINCT department
FROM
employees
This also works without a linebreak:
PRQL
from employees
select department
group department (take 1)
SQL
SELECT
DISTINCT department
FROM
employees
Selecting from each group
We are be able to
select a single row from each group
by combining group
and sort
:
PRQL
# youngest employee from each department
from employees
group department (
sort age
take 1
)
SQL
WITH table_1 AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY department
ORDER BY
age
) AS _expr_0
FROM
employees
)
SELECT
*
FROM
table_1
WHERE
_expr_0 <= 1
Roadmap
When using Postgres dialect, we are planning to compile:
# youngest employee from each department
from employees
group department (
sort age
take 1
)
… to …
SELECT DISTINCT ON (department) *
FROM employees
ORDER BY department, age
Null handling
SQL has an unconventional way of handling NULL
values, since it treats them as
unknown values. As a result, in SQL:
NULL
is not a value indicating a missing entry, but a placeholder for anything possible,NULL = NULL
evaluates toNULL
, since one cannot know if one unknown is equal to another unknown,NULL <> NULL
evaluates toNULL
, using same logic,- to check if a value is
NULL
, SQL introducesIS NULL
andIS NOT NULL
operators, DISTINCT column
may return multipleNULL
values.
For more information, check out the Postgres documentation.
PRQL, on the other hand, treats null
as a value, which means that:
null == null
evaluates totrue
,null != null
evaluates tofalse
,- distinct column cannot contain multiple
null
values.
PRQL
from employees
filter first_name == null
filter null != last_name
SQL
SELECT
*
FROM
employees
WHERE
first_name IS NULL
AND last_name IS NOT NULL
Note that PRQL doesn’t change how NULL
is compared between columns, for
example in joins. (PRQL compiles to SQL and so can’t change the behavior of the
database).
For more context or to provide feedback check out the discussion on issue #99.
Ranges
PRQL has a concise range syntax start..end
. If only one of start
& end
are
supplied, the range is open on the empty side.
Ranges can be used in filters with the in
function, with any type of literal,
including dates:
PRQL
from events
filter (date | in @1776-07-04..@1787-09-17)
filter (magnitude | in 50..100)
derive is_northern = (latitude | in 0..)
SQL
SELECT
*,
latitude >= 0 AS is_northern
FROM
events
WHERE
date BETWEEN DATE '1776-07-04' AND DATE '1787-09-17'
AND magnitude BETWEEN 50 AND 100
Like in SQL, ranges are inclusive.
As discussed in the take docs, ranges can also be used
in take
:
PRQL
from orders
sort [-value, date]
take 101..110
SQL
SELECT
*
FROM
orders
ORDER BY
value DESC,
date
LIMIT
10 OFFSET 100
Roadmap
We’d like to use ranges for other types, such as whether an object is in an array or list literal.
Regex expressions
At this time, PRQL does not have a facility for regular expression (regex) processing.
The Milestone for 0.5
talks about general plans for regex’s.
The Discussion #1123
provides a workaround function that emulates %LIKE%
.
Standard Library
The standard library currently contains commonly used functions that are used in SQL. It’s not yet as broad as we’d like, and we’re very open to expanding it.
Currently s-strings are an escape-hatch for any function that isn’t in our standard library. If we find ourselves using them for something frequently, raise an issue and we’ll add it to the stdlib.
Currently the stdlib implementation doesn’t support different DB implementations itself; those need to be built deeper into the compiler. We’ll resolve this at some point. Until then, we’ll only add functions here that are broadly supported by most DBs.
Here’s the source of the current
PRQL std
:
# Aggregate Functions
func min <scalar|column> column -> null
func max <scalar|column> column -> null
func sum <scalar|column> column -> null
func avg <scalar|column> column -> null
func stddev <scalar|column> column -> null
func average <scalar|column> column -> null
func count <scalar|column> non_null:s"*" -> null
# TODO: Possibly make this into `count distinct:true` (or like `distinct:` as an
# abbreviation of that?)
func count_distinct <scalar|column> column -> null
# Window functions
func lag<column> offset column -> null
func lead<column> offset column -> null
func first<column> offset column -> null
func last<column> offset column -> null
func rank<column> -> null
func rank_dense<column> -> null
func row_number<column> -> null
# Other functions
func round<scalar> n_digits column -> null
func as<scalar> `noresolve.type` column -> null
func in<bool> pattern value -> null
# Transform type definitions
func from<table> `default_db.source`<table> -> null
func select<table> columns<column> tbl<table> -> null
func filter<table> condition<bool> tbl<table> -> null
func derive<table> columns<column> tbl<table> -> null
func aggregate<table> a<column> tbl<table> -> null
func sort<table> by tbl<table> -> null
func take<table> expr tbl<table> -> null
func join<table> `default_db.with`<table> filter `noresolve.side`:inner tbl<table> -> null
func group<table> by pipeline tbl<table> -> null
func window<table> rows:0..0 range:0..0 expanding:false rolling:0 pipeline tbl<table> -> null
func append<table> `default_db.bottom`<table> top<table> -> null
func intersect<table> `default_db.bottom`<table> top<table> -> (
from t = _param.top
join b = _param.bottom (all (map _eq (zip t.* b.*)))
select t.*
)
func remove<table> `default_db.bottom`<table> top<table> -> (
from t = _param.top
join side:left b = _param.bottom (all (map _eq (zip t.* b.*)))
filter (all (map _is_null b.*))
select t.*
)
# List functions
func all<bool> list<list> -> null
func map<list> fn list<list> -> null
func zip<list> a<list> b<list> -> null
func _eq<list> a<list> -> null
func _is_null a -> _param.a == null
# Misc
func from_text<table> text<string> `noresolve.format`:csv -> null
And a couple of examples:
PRQL
from employees
derive [
gross_salary = (salary + payroll_tax | as int),
gross_salary_rounded = (gross_salary | round 0),
time = s"NOW()", # an s-string, given no `now` function exists in PRQL
]
SQL
SELECT
*,
CAST(salary + payroll_tax AS int) AS gross_salary,
ROUND(CAST(salary + payroll_tax AS int), 0) AS gross_salary_rounded,
NOW() AS time
FROM
employees
Strings
Strings in PRQL can use either single or double quotes:
To quote a string containing quotes, either use the “other” type of quote, or use three-or-more quotes, and close with the same number.
PRQL
from my_table
select x = """I said "hello world"!"""
SQL
SELECT
'I said "hello world"!' AS x
FROM
my_table
PRQL
from my_table
select x = """""I said """hello world"""!"""""
SQL
SELECT
'I said """hello world"""!' AS x
FROM
my_table
F-Strings and S-Strings
These special case strings can be used to:
F-Strings - Build up a new string from a set of columns or values
S-Strings - Insert SQL statements directly into the query. Use when PRQL doesn’t have an equivalent facility.
Currently PRQL allows multiline strings with either a single character or multiple character quotes. This may change for strings using a single character quote in future versions.
S-Strings
An s-string inserts SQL directly, as an escape hatch when there’s something that
PRQL doesn’t yet implement. For example, there’s no version()
function in SQL
that returns the Postgres version, so if we want to use that, we use an
s-string:
PRQL
from my_table
select db_version = s"version()"
SQL
SELECT
version() AS db_version
FROM
my_table
Embed a column name in an s-string using braces. For example, PRQL’s standard
library defines the average
function as:
func average column -> s"AVG({column})"
So this compiles using the function:
Here’s an example of a more involved use of an s-string:
PRQL
from de=dept_emp
join s=salaries side:left [
(s.emp_no == de.emp_no),
s"""({s.from_date}, {s.to_date})
OVERLAPS
({de.from_date}, {de.to_date})"""
]
SQL
SELECT
de.*,
s.*
FROM
dept_emp AS de
LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)
For those who have used python, s-strings are similar to python’s f-strings, but
the result is SQL code, rather than a string literal. For example, a python
f-string of f"average{col}"
would produce "average(salary)"
, with quotes;
while in PRQL, s"average{col}"
produces average(salary)
, without quotes.
We can also use s-strings to produce a full table:
PRQL
from s"SELECT DISTINCT ON first_name, id, age FROM employees ORDER BY age ASC"
join s = s"SELECT * FROM salaries" [==id]
SQL
WITH table_2 AS (
SELECT
DISTINCT ON first_name,
id,
age
FROM
employees
ORDER BY
age ASC
),
table_3 AS (
SELECT
*
FROM
salaries
)
SELECT
table_0.*,
table_1.*
FROM
table_2 AS table_0
JOIN table_3 AS table_1 ON table_0.id = table_1.id
S-strings in user code are intended as an escape-hatch for an unimplemented feature. If we often need s-strings to express something, that’s a sign we should implement it in PRQL or PRQL’s stdlib.
Braces
To output braces from an s-string, use double braces:
PRQL
from employees
derive [
has_valid_title = s"regexp_contains(title, '([a-z0-9]*-){{2,}}')"
]
SQL
SELECT
*,
regexp_contains(title, '([a-z0-9]*-){2,}') AS has_valid_title
FROM
employees
Precedence
The PRQL compiler simply places a literal copy of each variable into the resulting string, which means we may get surprising behavior when the variable is has multiple terms and the s-string isn’t parenthesized.
In this toy example, the salary + benefits / 365
gets precedence wrong:
PRQL
from employees
derive [
gross_salary = salary + benefits,
daily_rate = s"{gross_salary} / 365"
]
SQL
SELECT
*,
salary + benefits AS gross_salary,
salary + benefits / 365 AS daily_rate
FROM
employees
Instead, put the denominator {gross_salary}
in parentheses:
PRQL
from employees
derive [
gross_salary = salary + benefits,
daily_rate = s"({gross_salary}) / 365"
]
SQL
SELECT
*,
salary + benefits AS gross_salary,
(salary + benefits) / 365 AS daily_rate
FROM
employees
F-Strings
F-strings are a readable approach to building new strings from existing strings. Currently PRQL supports this for concatenating strings:
PRQL
from employees
select full_name = f"{first_name} {last_name}"
SQL
SELECT
CONCAT(first_name, ' ', last_name) AS full_name
FROM
employees
This can be much easier to read for longer strings, relative to the SQL approach:
PRQL
from web
select url = f"http{tls}://www.{domain}.{tld}/{page}"
SQL
SELECT
CONCAT(
'http',
tls,
'://www.',
domain,
'.',
tld,
'/',
page
) AS url
FROM
web
Roadmap
In the future, f-strings may incorporate string formatting such as datetimes, numbers, and padding. If there’s a feature that would be helpful, please post an issue.
Switch
PRQL uses switch
for both SQL’s CASE
and IF
statements. Here’s an example:
PRQL
from employees
derive distance = switch [
city == "Calgary" -> 0,
city == "Edmonton" -> 300,
]
SQL
SELECT
*,
CASE
WHEN city = 'Calgary' THEN 0
WHEN city = 'Edmonton' THEN 300
ELSE NULL
END AS distance
FROM
employees
If no condition is met, the value takes a null
value. To set a default, use a
true
condition:
PRQL
from employees
derive distance = switch [
city == "Calgary" -> 0,
city == "Edmonton" -> 300,
true -> "Unknown",
]
SQL
SELECT
*,
CASE
WHEN city = 'Calgary' THEN 0
WHEN city = 'Edmonton' THEN 300
ELSE 'Unknown'
END AS distance
FROM
employees
Target & Version
Target dialect
PRQL allows specifying a target dialect at the top of the query, which allows PRQL to compile to a database-specific SQL flavor.
Examples
PRQL
prql target:sql.postgres
from employees
sort age
take 10
SQL
SELECT
*
FROM
employees
ORDER BY
age
LIMIT
10
PRQL
prql target:sql.mssql
from employees
sort age
take 10
SQL
SELECT
TOP (10) *
FROM
employees
ORDER BY
age
Supported dialects
Note that dialect support is early — most differences are not implemented, and
most dialects’ implementations are identical to generic
’s. Contributions are
very welcome.
sql.ansi
sql.bigquery
sql.clickhouse
sql.generic
sql.hive
sql.mssql
sql.mysql
sql.postgres
sql.sqlite
sql.snowflake
sql.duckdb
Version
PRQL allows specifying a version of the language in the PRQL header, like:
This has two roles, one of which is implemented:
- The compiler will raise an error if the compiler is older than the query version. This prevents confusing errors when queries use newer features of the language but the compiler hasn’t yet been upgraded.
- The compiler will compile for the major version of the query. This allows the language to evolve without breaking existing queries, or forcing multiple installations of the compiler. This isn’t yet implemented, but is a gating feature for PRQL 1.0.
Bindings
PRQL has bindings for many languages. These include:
PRQL
PRQL bindings for Elixir.
Installation
def deps do
[
{:prql, "~> 0.1.0"}
]
end
Basic Usage
iex> PRQL.compile("from customers")
{:ok, "SELECT\n *\nFROM\n customers\n\n-- Generated by PRQL compiler version 0.3.1 (https://prql-lang.org)\n"}
iex> PRQL.compile("from customers\ntake 10", dialect: :mssql)
{:ok, "SELECT\n TOP (10) *\nFROM\n customers\n\n-- Generated by PRQL compiler version 0.3.1 (https://prql-lang.org)\n"}
Development
We are in the early stages of developing Elixir bindings.
We’re using Rustler
to provide rust bindings for prql-compiler
.
Currently using the bindings in an Elixir project requires compiling the rust crate from this repo:
- Install dependencies with
mix deps.get
- Compile project
mix compile
- Run tests
mix test
Future work includes publishing pre-compiled artifacts, so Elixir projects can run PRQL without needing a rust toolchain.
Java (prql-java)
prql-java
offers rust bindings to the prql-compiler
rust library. It exposes
a java native method public static native String toSql(String query)
.
Installation
<dependency>
<groupId>org.prqllang</groupId>
<artifactId>prql-java</artifactId>
<version>${PRQL_VERSION}</version>
</dependency>
Usage
import org.prqllang.prql4j.PrqlCompiler;
class Main {
public static void main(String[] args) {
String sql = PrqlCompiler.toSql("from table");
System.out.println(sql);
}
}
prql-js
JavaScript bindings for prql-compiler
.
Installation
npm install prql-js
Usage
Currently these functions are exposed
function compile(prql_query: string, options?: CompileOptions): string;
function prql_to_pl(prql_query: string): string;
function pl_to_rq(pl_json: string): string;
function rq_to_sql(rq_json: string): string;
From NodeJS
Direct usage
const prqljs = require("prql-js");
const sql = prqljs.compile(`from employees | select first_name`);
console.log(sql.sql);
Template literal
const prqljs = require("prql-js");
const prql = (string) => prqljs.compile(string[0] || "");
const sql = prql`from employees | select first_name`;
console.log(sql.sql);
Template literal with newlines
const prqljs = require("prql-js");
const prql = (string) => prqljs.compile(string[0] || "");
const sql = prql`
from employees
select first_name
`;
console.log(sql.sql);
From a Browser
<html>
<head>
<script src="./node_modules/prql-js/dist/web/prql_js.js"></script>
<script>
const { compile } = wasm_bindgen;
async function run() {
await wasm_bindgen("./node_modules/prql-js/dist/web/prql_js_bg.wasm");
const sql = compile("from employees | select first_name");
console.log(sql);
}
run();
</script>
</head>
<body></body>
</html>
From a Framework or a Bundler
import compile from "prql-js/dist/bundler";
const sql = compile(`from employees | select first_name`);
console.log(sql);
Errors
Errors are returned as following object, serialized as a JSON array:
interface ErrorMessage {
/// Plain text of the error
reason: string;
/// A list of suggestions of how to fix the error
hint: string | null;
/// Character offset of error origin within a source file
span: [number, number] | null;
/// Annotated code, containing cause and hints.
display: string | null;
/// Line and column number of error origin within a source file
location: SourceLocation | null;
}
/// Location within the source file.
/// Tuples contain:
/// - line number (0-based),
/// - column number within that line (0-based),
interface SourceLocation {
start: [number, number];
end: [number, number];
}
These errors can be caught as such:
try {
const sql = prqlJs.compile(`from employees | foo first_name`);
} catch (error) {
const errorMessages = JSON.parse(error.message).inner;
console.log(errorMessages[0].display);
console.log(errorMessages[0].location);
}
Development
Build:
npm run build
This builds Node, bundler and web packages in the dist
path.
Test:
npm test
Notes
- This uses
wasm-pack
to generate bindings1. - We’ve added an
npm
layer on top of the usual approach of just usingwasm-pack
, so we can distribute a single package with targets ofnode
,bundler
andno-modules
— somewhat inverting the approach recommended bywasm-pack
. The build instruction goes in abuild
script, rather than apack
script.
Though we would be very open to other approaches, given wasm-pack does not
seem maintained, and we're eliding many of its features to build for three
targets.
Python (prql-python)
Installation
pip install prql-python
Usage
import prql_python as prql
prql_query = """
from employees
join salaries [==emp_id]
group [dept_id, gender] (
aggregate [
avg_salary = average salary
]
)
"""
sql = prql.compile(prql_query)
R (prqlr)
R bindings for prql-compiler
. Check out
https://eitsupi.github.io/prqlr/ for more context.
prqlr
is generously maintained by @eitsupi in the
eitsupi/prqlr repo.
Installation
install.packages("prqlr")
Usage
library(prqlr)
"
from employees
join salaries [emp_id]
group [dept_id, gender] (
aggregate [
avg_salary = average salary
]
)
" |>
prql_compile()
Rust (prql-compiler)
Installation
cargo new myproject
cd myproject
cargo add prql-compiler
Usage
cargo run
src/main.rs
use prql_compiler::compile; use prql_compiler::sql; fn main() { let prql = "from employees | select [name,age] "; let opt = sql::Options { format: true, dialect: Some(sql::Dialect::SQLite), signature_comment: true, }; let sql = compile(&prql, Some(opt)).unwrap(); println!("PRQL: {}\nSQLite: {}", prql, sql); }
Cargo.toml
[package]
name = "myproject"
version = "0.1.0"
edition = "2021"
[dependencies]
prql-compiler = "0.4.0"
Integrations
PRQL is building integrations with lots of external tools, including:
dbt-prql
Original docs at https://github.com/prql/dbt-prql
dbt-prql allows writing PRQL in dbt models. This combines the benefits of PRQL’s power & simplicity within queries, with dbt’s version control, lineage & testing across queries.
Once dbt-prql
in installed, dbt commands compile PRQL between {% prql %}
&
{% endprql %}
jinja tags to SQL as part of dbt’s compilation. No additional
config is required.
Examples
Simple example
{% prql %}
from employees
filter (age | in 20..30)
{% endprql %}
…would appear to dbt as:
SELECT
employees.*
FROM
employees
WHERE
age BETWEEN 20
AND 30
Less simple example
{% prql %}
from {{ source('salesforce', 'in_process') }}
derive expected_sales = probability * value
join {{ ref('team', 'team_sales') }} [==name]
group name (
aggregate (expected_sales)
)
{% endprql %}
…would appear to dbt as:
SELECT
name,
{{ source('salesforce', 'in_process') }}.probability * {{ source('salesforce', 'in_process') }}.value AS expected_sales
FROM
{{ source('salesforce', 'in_process') }}
JOIN {{ ref('team', 'team_sales') }} USING(name)
GROUP BY
name
…and then dbt will compile the source
and ref
s to a full SQL query.
Replacing macros
dbt’s use of macros has saved many of us many lines of code, and even saved some
people some time. But imperatively programming text generation with code like
if not loop.last
is not our highest calling. It’s the “necessary” part rather
than beautiful part of dbt.
Here’s the canonical example of macros in the dbt documentation:
{%- set payment_methods = ["bank_transfer", "credit_card", "gift_card"] -%}
select
order_id,
{%- for payment_method in payment_methods %}
sum(case when payment_method = '{{payment_method}}' then amount end) as {{payment_method}}_amount
{%- if not loop.last %},{% endif -%}
{% endfor %}
from {{ ref('raw_payments') }}
group by 1
Here’s that model using PRQL1, including the prql jinja tags.
{% prql %}
func filter_amount method -> s"sum(case when payment_method = '{method}' then amount end) as {method}_amount"
from {{ ref('raw_payments') }}
group order_id (
aggregate [
filter_amount bank_transfer,
filter_amount credit_card,
filter_amount gift_card,
]
)
{% endprql %}
As well the query being simpler in its final form, writing in PRQL also gives us live feedback around any errors, on every keystroke. Though there’s much more to come, check out the current version on PRQL Playground.
What it does
When dbt compiles models to SQL queries:
- Any text in a dbt model between
{% prql %}
and{% endprql %}
tags is compiled from PRQL to SQL before being passed to dbt. - The PRQL compiler passes text that’s containing
{{
&}}
through to dbt without modification, which allows us to embed jinja expressions in PRQL. (This was added to PRQL specifically for this use-case.) - dbt will then compile the resulting model into its final form of raw SQL, and dispatch it to the database, as per usual.
There’s no config needed in the dbt project; this works automatically on any dbt
command (e.g. dbt run
) assuming dbt-prql
is installed.
Installation
pip install dbt-prql
Current state
Currently this is new, but fairly feature-complete. It’s enthusiastically supported — if there are any problems, please open an issue.
Jupyter
Original docs at https://pyprql.readthedocs.io/en/latest/magic_readme.html
Work with pandas and PRQL in an IPython terminal or Jupyter notebook.
Implementation
This is a thin wrapper around the fantastic IPython-sql magic. Roughly
speaking, all we do is parse PRQL to SQL and pass that through to ipython-sql
.
A full documentation of the supported features is available at their
repository. Here, we document those places where we differ from them,
plus those features we think you are mostly likely to find useful.
Usage
Installation
If you have already installed PyPRQL into your environment, then you should be
could to go! We bundle in IPython
and pandas
, though you’ll need to install
Jupyter
separately. If you haven’t installed PyPRQL, that’s as simple as:
pip install pyprql
Set Up
Open up either an IPython
terminal or Jupyter
notebook. First, we need to
load the extension and connect to a database.
In [1]: %load_ext pyprql.magic
Connecting a database
We have two options for connecting a database
-
Create an in-memory DB. This is the easiest way to get started.
In [2]: %prql duckdb:///:memory:
However, in-memory databases start off empty! So, we need to add some data. We have a two options:
-
We can easily add a pandas dataframe to the
DuckDB
database like so:In [3]: %prql --persist df
where
df
is a pandas dataframe. This adds a table nameddf
to the in-memoryDuckDB
instance. -
Or download a CSV and query it directly, with DuckDB:
!wget https://github.com/graphql-compose/graphql-compose-examples/blob/master/examples/northwind/data/csv/products.csv
…and then
from products.csv
will work.
-
-
Connect to an existing database
When connecting to a database, pass the connection string as an argument to the line magic
%prql
. The connection string needs to be in SQLAlchemy format, so any connection supported bySQLAlchemy
is supported by the magic. Additional connection parameters can be passed as a dictionary using the--connection_arguments
flag to the the%prql
line magic. We ship with the necessary extensions to use DuckDB as the backend, and here connect to an in-memory database.
Querying
Now, let’s do a query! By default, PRQLMagic
always returns the results as
dataframe, and always prints the results. The results of the previous query are
accessible in the _
variable.
These examples are based on the products.csv
example above.
In [4]: %%prql
...: from p = products.csv
...: filter supplierID == 1
Done.
Returning data to local variable _
productID productName supplierID categoryID quantityPerUnit unitPrice unitsInStock unitsOnOrder reorderLevel discontinued
0 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0
1 2 Chang 1 1 24 - 12 oz bottles 19.0 17 40 25 0
2 3 Aniseed Syrup 1 2 12 - 550 ml bottles 10.0 13 70 25 0
In [5]: %%prql
...: from p = products.csv
...: group categoryID (
...: aggregate [average unitPrice]
...: )
Done.
Returning data to local variable _
categoryID avg("unitPrice")
0 1 37.979167
1 2 23.062500
2 7 32.370000
3 6 54.006667
4 8 20.682500
5 4 28.730000
6 3 25.160000
7 5 20.250000
We can capture the results into a different variable like so:
In [6]: %%prql results <<
...: from p = products.csv
...: aggregate [min unitsInStock, max unitsInStock]
Done.
Returning data to local variable results
min("unitsInStock") max("unitsInStock")
0 0 125
Now, the output of the query is saved to results
.
Prefect
Because Prefect is in native python, it’s extremely easy to integrate with PRQL.
With a Postgres Task, replace:
PostgresExecute.run(..., query=sql)
…with…
PostgresExecute.run(..., query=pyprql.to_sql(prql))
We’re big fans of Prefect, and if there is anything that would make the integration easier, please open an issue.
VSCode extension
PRQL has a VSCode Extension that compiles a PRQL query in a VSCode editor and displays the resulting SQL code in a second pane on the side. This is very handy for editing, saving, and reusing PRQL queries in VScode.
To install the VSCode extension, open VSCode and type Ctl-Shift-P (Cmd-Shift-P
on a Mac) and type PRQL
. Install the extension as usual.
Repo for the PRQL VScode Extension
Rill
PRQL has had some work to integrate with Rill. See the Rill Issues for more details.
Examples
These examples are rewritten from other languages such as SQL. They try to express real-world problems in PRQL, covering most of the language features. We are looking for different use-cases of data transformation, be it database queries, semantic business modeling or data cleaning.
If you want to help, translate some of your queries to PRQL and open a PR to add them here!
PRQL
from employees
filter country == "USA" # Each line transforms the previous result.
derive [ # This adds columns / variables.
gross_salary = salary + payroll_tax,
gross_cost = gross_salary + benefits_cost # Variables can use other variables.
]
filter gross_cost > 0
group [title, country] ( # For each group use a nested pipeline
aggregate [ # Aggregate each group to a single row
average salary,
average gross_salary,
sum salary,
sum gross_salary,
average gross_cost,
sum_gross_cost = sum gross_cost,
ct = count,
]
)
sort sum_gross_cost
filter ct > 200
take 20
SQL
WITH table_1 AS (
SELECT
title,
country,
salary + payroll_tax + benefits_cost AS _expr_0,
salary + payroll_tax AS _expr_1,
salary
FROM
employees
WHERE
country = 'USA'
)
SELECT
title,
country,
AVG(salary),
AVG(_expr_1),
SUM(salary),
SUM(_expr_1),
AVG(_expr_0),
SUM(_expr_0) AS sum_gross_cost,
COUNT(*) AS ct
FROM
table_1
WHERE
_expr_0 > 0
GROUP BY
title,
country
HAVING
COUNT(*) > 200
ORDER BY
sum_gross_cost
LIMIT
20
PRQL
from employees
group [emp_no] (
aggregate [
emp_salary = average salary # average salary resolves to "AVG(salary)" (from stdlib)
]
)
join titles [==emp_no]
group [title] (
aggregate [
avg_salary = average emp_salary
]
)
select salary_k = avg_salary / 1000 # avg_salary should resolve to "AVG(emp_salary)"
take 10 # induces new SELECT
derive salary = salary_k * 1000 # salary_k should not resolve to "avg_salary / 1000"
SQL
WITH table_1 AS (
SELECT
AVG(salary) AS _expr_0,
emp_no
FROM
employees
GROUP BY
emp_no
)
SELECT
AVG(table_1._expr_0) / 1000 AS salary_k,
AVG(table_1._expr_0) / 1000 * 1000 AS salary
FROM
table_1
JOIN titles ON table_1.emp_no = titles.emp_no
GROUP BY
titles.title
LIMIT
10
Single item is coerced into a list
Same as above but with salary
in a list:
Multiple items
PRQL
from employees
derive [
gross_salary = salary + payroll_tax,
gross_cost = gross_salary + benefits_cost
]
SQL
SELECT
*,
salary + payroll_tax AS gross_salary,
salary + payroll_tax + benefits_cost AS gross_cost
FROM
employees
Same as above but split into two lines:
PRQL
from employees
derive gross_salary = salary + payroll_tax
derive gross_cost = gross_salary + benefits_cost
SQL
SELECT
*,
salary + payroll_tax AS gross_salary,
salary + payroll_tax + benefits_cost AS gross_cost
FROM
employees
PRQL
let newest_employees = (
from employees
sort tenure
take 50
)
let average_salaries = (
from salaries
group country (
aggregate average_country_salary = (average salary)
)
)
from newest_employees
join average_salaries [==country]
select [name, salary, average_country_salary]
SQL
WITH average_salaries AS (
SELECT
country,
AVG(salary) AS average_country_salary
FROM
salaries
GROUP BY
country
),
newest_employees AS (
SELECT
*
FROM
employees
ORDER BY
tenure
LIMIT
50
)
SELECT
newest_employees.name,
newest_employees.salary,
average_salaries.average_country_salary
FROM
newest_employees
JOIN average_salaries ON newest_employees.country = average_salaries.country
Employees
These are homework tasks on employees database.
Clone and init the database (requires a local PostgreSQL instance):
psql -U postgres -c 'CREATE DATABASE employees;'
git clone https://github.com/vrajmohan/pgsql-sample-data.git
psql -U postgres -d employees -f pgsql-sample-data/employee/employees.dump
Execute a PRQL query:
cd prql-compiler
cargo run compile examples/employees/average-title-salary.prql | psql -U postgres -d employees
Task 1
rank the employee titles according to the average salary for each department.
My solution:
- for each employee, find their average salary,
- join employees with their departments and titles (duplicating employees for each of their titles and departments)
- group by department and title, aggregating average salary
- join with department to get department name
PRQL
from salaries
group [emp_no] (
aggregate [emp_salary = average salary]
)
join t=titles [==emp_no]
join dept_emp side:left [==emp_no]
group [dept_emp.dept_no, t.title] (
aggregate [avg_salary = average emp_salary]
)
join departments [==dept_no]
select [dept_name, title, avg_salary]
SQL
WITH table_1 AS (
SELECT
AVG(salary) AS _expr_0,
emp_no
FROM
salaries
GROUP BY
emp_no
),
table_2 AS (
SELECT
t.title,
AVG(table_1._expr_0) AS avg_salary,
dept_emp.dept_no
FROM
table_1
JOIN titles AS t ON table_1.emp_no = t.emp_no
LEFT JOIN dept_emp ON table_1.emp_no = dept_emp.emp_no
GROUP BY
dept_emp.dept_no,
t.title
)
SELECT
departments.dept_name,
table_2.title,
table_2.avg_salary
FROM
table_2
JOIN departments ON table_2.dept_no = departments.dept_no
Task 2
Estimate distribution of salaries and gender for each department departments.
PRQL
from e=employees
join salaries [==emp_no]
group [e.emp_no, e.gender] (
aggregate [
emp_salary = average salaries.salary
]
)
join de=dept_emp [==emp_no] side:left
group [de.dept_no, gender] (
aggregate [
salary_avg = average emp_salary,
salary_sd = stddev emp_salary,
]
)
join departments [==dept_no]
select [dept_name, gender, salary_avg, salary_sd]
SQL
WITH table_1 AS (
SELECT
e.gender,
AVG(salaries.salary) AS _expr_0,
e.emp_no
FROM
employees AS e
JOIN salaries ON e.emp_no = salaries.emp_no
GROUP BY
e.emp_no,
e.gender
),
table_2 AS (
SELECT
table_1.gender,
AVG(table_1._expr_0) AS salary_avg,
STDDEV(table_1._expr_0) AS salary_sd,
de.dept_no
FROM
table_1
LEFT JOIN dept_emp AS de ON table_1.emp_no = de.emp_no
GROUP BY
de.dept_no,
table_1.gender
)
SELECT
departments.dept_name,
table_2.gender,
table_2.salary_avg,
table_2.salary_sd
FROM
table_2
JOIN departments ON table_2.dept_no = departments.dept_no
Task 3
Estimate distribution of salaries and gender for each manager.
PRQL
from e=employees
join salaries [==emp_no]
group [e.emp_no, e.gender] (
aggregate [
emp_salary = average salaries.salary
]
)
join de=dept_emp [==emp_no]
join dm=dept_manager [
(dm.dept_no == de.dept_no) and s"(de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)"
]
group [dm.emp_no, gender] (
aggregate [
salary_avg = average emp_salary,
salary_sd = stddev emp_salary
]
)
derive mng_no = emp_no
join managers=employees [==emp_no]
derive mng_name = s"managers.first_name || ' ' || managers.last_name"
select [mng_name, managers.gender, salary_avg, salary_sd]
SQL
WITH table_1 AS (
SELECT
e.gender,
AVG(salaries.salary) AS _expr_0,
e.emp_no
FROM
employees AS e
JOIN salaries ON e.emp_no = salaries.emp_no
GROUP BY
e.emp_no,
e.gender
),
table_2 AS (
SELECT
AVG(table_1._expr_0) AS salary_avg,
STDDEV(table_1._expr_0) AS salary_sd,
dm.emp_no
FROM
table_1
JOIN dept_emp AS de ON table_1.emp_no = de.emp_no
JOIN dept_manager AS dm ON dm.dept_no = de.dept_no
AND (de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)
GROUP BY
dm.emp_no,
table_1.gender
)
SELECT
managers.first_name || ' ' || managers.last_name AS mng_name,
managers.gender,
table_2.salary_avg,
table_2.salary_sd
FROM
table_2
JOIN employees AS managers ON table_2.emp_no = managers.emp_no
Task 4
Find distributions of titles, salaries and genders for each department.
PRQL
from de=dept_emp
join s=salaries side:left [
(s.emp_no == de.emp_no),
s"({s.from_date}, {s.to_date}) OVERLAPS ({de.from_date}, {de.to_date})"
]
group [de.emp_no, de.dept_no] (
aggregate salary = (average s.salary)
)
join employees [==emp_no]
join titles [==emp_no]
select [dept_no, salary, employees.gender, titles.title]
SQL
WITH table_1 AS (
SELECT
de.dept_no,
AVG(s.salary) AS salary,
de.emp_no
FROM
dept_emp AS de
LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)
GROUP BY
de.emp_no,
de.dept_no
)
SELECT
table_1.dept_no,
table_1.salary,
employees.gender,
titles.title
FROM
table_1
JOIN employees ON table_1.emp_no = employees.emp_no
JOIN titles ON table_1.emp_no = titles.emp_no
Contributing
If you’re interested in joining the community to build a better SQL, here are ways to start:
- Star this repo.
- Send a link to PRQL to a couple of people whose opinion you respect.
- Subscribe to Issue #1 for updates.
- Join our Discord.
- Follow us on Twitter.
- Find an issue labeled Good First Issue and start contributing to the code.
- Join our fortnightly Developer Call; (iCal file).
PRQL is evolving from a project with lots of excitement into a project that folks are using in their work and integrating into their tools. We’re actively looking for collaborators to lead that growth with us.
Areas for larger contributions
Compiler
The compiler is written in Rust, and there’s enough to do such that any level of experience with rust is sufficient.
We try to keep a few onboarding issues on hand under the “good first issue” label. These have been screened to have sufficient context to get started (and we very much welcome questions where there’s some context missing).
To get started, check out the docs on Development and the Compiler architecture
And if you have questions, there are lots of friendly people on the Discord who will patiently help you.
Bindings & integrations
For PRQL to be successful, it needs to be available for the languages & tools that people already use.
- We currently have bindings to the PRQL compiler in a few different languages; many of these can be improved, documented, and packaged in a better way.
- If you have experience with packaging in an ecosystem that doesn’t currently have bindings, then creating PRQL bindings for that language we don’t currently support would be valuable to the project.
- If there’s a tool that you use yourself to run data queries which you think would benefit from a PRQL integration, suggest one to us or the tool. If it’s open-source, build & share a prototype.
Relevant issues are labeled Integrations.
Language design
We decide on new language features in GitHub issues, usually under “language design” label.
You can also contribute by:
- Finding instances where the compiler produces incorrect results, and post a bug report — feel free to use the playground.
- Opening an issue / append to an existing issue with examples of queries that are difficult to express in PRQL — especially if more difficult than SQL.
With sufficient examples, suggest a change to the language! (Though suggestions without examples are difficult to engage with, so please do anchor suggestions in examples.)
Marketing
- Improve our website. We have a few issues open on this front and are looking for anyone with at least some design skills.
- Contribute towards the docs. Anything from shaping a whole section of the docs, to simply improving a confusing paragraph or fixing a typo.
- Tell people about PRQL.
- Find a group of users who would be interested in PRQL, help them get up to speed, help the project understand what they need.
Core team
If you have any questions or feedback and don’t receive a response on one of the general channels such as GitHub or Discord, feel free to reach out to:
- @aljazerzen — Aljaž Mur Eržen
- @max-sixty — Maximilian Roos
- @snth — Tobias Brandt
Core team Emeritus
Thank you to those who have previously served on the core team:
- @charlie-sanders — Charlie Sanders
Development
Setting up an initial dev environment
We can set up a local development environment sufficient for navigating, editing, and testing PRQL’s compiler code in two minutes:
-
Install
rustup
&cargo
. -
[Optional but highly recommended] Install
cargo-insta
, our testing framework:cargo install cargo-insta
-
That’s it! Running the unit tests for the
prql-compiler
crate after cloning the repo should complete successfully:cargo test -p prql-compiler --lib
…or, to run tests and update the test snapshots:
cargo insta test --accept -p prql-compiler --lib
There’s more context on our tests in How we test below.
That’s sufficient for making an initial contribution to the compiler.
Setting up a full dev environment
Note: We really care about this process being easy, both because the project benefits from more contributors like you, and to reciprocate your future contribution. If something isn’t easy, please let us know in a GitHub Issue. We’ll enthusiastically help you, and use your feedback to improve the scripts & instructions.
For more advanced development; for example compiling for wasm or previewing the website, we have two options:
Option 1: Use the project’s task
Note: This is tested on MacOS, should work on Linux, but won’t work on Windows.
-
Install Task; either
brew install go-task/tap/go-task
or as described on Task -
Then run the
setup-dev
task. This runs commands from our Taskfile.yml, installing dependencies withcargo
,brew
,npm
&pip
, and suggests some VSCode extensions.task setup-dev
Option 2: Install tools individually
-
We’ll need
cargo-insta
, to update snapshot tests:cargo install cargo-insta
-
We’ll need a couple of additional components, which most systems will have already. The easiest way to check whether they’re installed is to try running the full tests:
cargo test
…and if that doesn’t complete successfully, check we have:
-
A clang compiler, to compile the DuckDB integration tests, since we use `duckdb-rs’. To install one:
- On MacOS, install xcode with
xcode-select --install
- On Debian Linux,
apt-get update && apt-get install clang
- On Windows,
duckdb-rs
isn’t supported, so these tests are excluded
- On MacOS, install xcode with
-
Python >= 3.7, to compile
prql-python
.
-
-
For more involved contributions, such as building the website, playground, book, or some release artifacts, we’ll need some additional tools. But we won’t need those immediately, and the error messages on what’s missing should be clear when we attempt those things. When we hit them, the Taskfile.yml will be a good source to copy & paste instructions from.
Building & testing the full project
We have a couple of tasks which incorporate all building & testing. While they don’t need to be run as part of a standard dev loop — generally we’ll want to run a more specific test — they can be useful as a backstop to ensure everything works, and as a reference for how each part of the repo is built & tested. They should be consistent with the GitHub Actions workflows; please report any inconsistencies.
To build everything:
task build-all
To run all tests; (which includes building everything):
task test-all
These require installing Task, either brew install go-task/tap/go-task
or as
described on Task.
Contribution workflow
We’re similar to most projects on GitHub — open a Pull Request with a suggested change!
Commits
- If a change is user-facing, please add a line in
CHANGELOG.md
, with{message}, ({@contributor, #X})
whereX
is the PR number.- If there’s a missing entry, a follow-up PR containing just the changelog entry is welcome.
- We’re experimenting with using the Conventional Commits message format, enforced through action-semantic-pull-request. This would let us generate Changelogs automatically. The check is not required to pass at the moment.
Merges
- We merge any code that makes PRQL better
- A PR doesn’t need to be perfect to be merged; it doesn’t need to solve a big
problem. It needs to:
- be in the right direction
- make incremental progress
- be explicit on its current state, so others can continue the progress
- If you have merge permissions, and are reasonably confident that a PR is
suitable to merge (whether or not you’re the author), feel free to merge.
- If you don’t have merge permissions and have authored a few PRs, ask and ye shall receive.
- The primary way we ratchet the code quality is through automated tests.
- This means PRs almost always need a test to demonstrate incremental progress.
- If a change breaks functionality without breaking tests, our tests were insufficient.
- We use PR reviews to give general context, offer specific assistance, and
collaborate on larger decisions.
- Reviews around ‘nits’ like code formatting / idioms / etc are very welcome. But the norm is for them to be received as helpful advice, rather than as mandatory tasks to complete. Adding automated tests & lints to automate these suggestions is welcome.
- If you have merge permissions and would like a PR to be reviewed before it merges, that’s great — ask or assign a reviewer.
- If a PR hasn’t received attention after a day, please feel free to ping the pull request.
- People may review a PR after it’s merged. As part of the understanding that we can merge quickly, contributors are expected to incorporate substantive feedback into a future PR.
- We should revert quickly if the impact of a PR turns out not to be consistent with our expectations, or there isn’t as much consensus on a decision as we had hoped. It’s very easy to revert code and then re-revert when we’ve resolved the issue; it’s a sign of moving quickly.
Components of PRQL
The PRQL project has several components. Instructions for working with them are in the README.md file in their respective paths. Here’s an overview:
book: The PRQL language book, which documents the language.
playground: A web GUI for the PRQL compiler. It shows the PRQL source beside the resulting SQL output.
prql-compiler:
Installation and usage instructions for building and running the
prql-compiler
.
prql-java:
Rust bindings to the prql-compiler
rust library.
prql-js:
Javascript bindings to the prql-compiler
rust library.
prql-lib:
Generates .a
and .so
libraries from the prql-compiler
rust library for
bindings to other languages
prql-python:
Python bindings to the prql-compiler
rust library.
website: Our
website, hosted at https://prql-lang.org, built with hugo
.
How we test
We use a pyramid of tests — we have fast, focused tests at the bottom of the pyramid, which give us low latency feedback when developing, and then slower, broader tests which ensure that we don’t miss anything as PRQL develops1.
Our approach is very consistent with
**[@matklad](https://github.com/matklad)**'s advice, in his excellent blog
post [How to Test](https://matklad.github.io//2021/05/31/how-to-test.html).
Note
If you’re making your first contribution, you don’t need to engage with all this — it’s fine to just make a change and push the results; the tests that run in GitHub will point you towards any errors, which can be then be run locally if needed. We’re always around to help out.
Our tests, from the bottom of the pyramid to the top:
-
Static checks — we run a few static checks to ensure the code stays healthy and consistent. They’re defined in
.pre-commit-config.yaml
, using pre-commit. They can be run locally withpre-commit run -a
The tests fix most of the issues they find themselves. Most of them also run on GitHub on every commit; any changes they make are added onto the branch automatically in an additional commit.
-
Unit tests & inline insta snapshots — like most projects, we rely on unit tests to rapidly check that our code basically works. We extensively use Insta, a snapshot testing tool which writes out the values generated by our code, making it fast & simple to write and modify tests2.
These are the fastest tests which run our code; they’re designed to run on every save while you’re developing. We include a
task
which does this (thy full command run on every save iscargo insta test --accept -p prql-compiler --lib
):task -w test-rust-fast
[Here's an example of an insta test](https://github.com/PRQL/prql/blob/0.2.2/prql-compiler/src/parser.rs#L580-L605)
— note that only the initial line of each test is written by us; the
remainder is filled in by insta.
-
Integration tests — these run tests against real databases, to ensure we’re producing correct SQL.
-
Examples — we compile all examples in the PRQL Book, to test that they produce the SQL we expect, and that changes to our code don’t cause any unexpected regressions.
-
GitHub Actions on every commit — we run tests on
prql-compiler
for standard & wasm targets, and the examples in the book on every pull request every time a commit is pushed. These are designed to run in under two minutes, and we should be reassessing their scope if they grow beyond that. Once these pass, a pull request can be merged.All tests up to this point can be run with:
task test-all
-
GitHub Actions on specific changes — we run additional tests on pull requests when we identify changes to some paths, such as bindings to other languages.
-
GitHub Actions on merge — we run many more tests on every merge to main. This includes testing across OSs, all our language bindings, our
task
tasks, a measure of test code coverage, and some performance benchmarks.3We can run these tests before a merge by adding a label
pr-test-all
to the PR.If these tests fail after merging, we revert the merged commit before fixing the test and then re-reverting.
We reference "actions", such as
[`build-prql-python`](https://github.com/PRQL/prql/blob/main/.github/actions/build-prql-python/action.yaml)
from workflows. We need to use these actions since workflow calls can only
have a depth of 2 (i.e. workflow can call workflows, but those workflows
can't call other workflows). We occasionally copy & paste small amounts of
yaml where we don't want to abstract something tiny into another action.
An alternative approach would be to have all jobs in a single workflow which
is called on every change, and then each job contains all its filtering
logic. So `pull-request.yaml` and `test-all.yaml` would be a single file,
and `test-python` would a job that has an `if` containing a) path changes,
b) a branch condition for `main`, and c) a PR label filter. That would be a
"flatter" approach — each job contains all its own criteria. The downside
would less abstraction, more verbose jobs, and a long list of ~25/30 skipped
jobs on every PR (since each job is skipped, rather than never started).
Ideally we wouldn't have to make these tradeoffs — GHA would offer an
arbitrary DAG of workflows, with filters at each level, and a UI that less
prominently displays workflows which aren't designed to run.
-
GitHub Actions nightly — we run tests that take a long time or are unrelated to code changes, such as security checks, or expensive timing benchmarks, every night.
We can run these tests before a merge by adding a label
pr-cron
to the PR.
The goal of our tests is to allow us to make changes quickly. If you find they’re making it more difficult for you to make changes, or there are missing tests that would give you the confidence to make changes faster, then please raise an issue.
Releasing
Currently we release in a semi-automated way:
-
PR & merge an updated Changelog. GitHub will produce a draft version at https://github.com/PRQL/prql/releases/new, including “New Contributors”.
-
Run
cargo release version patch -x && cargo release replace -x
to bump the versions, then PR the resulting commit. -
After merging, go to Draft a new release4, copy the changelog entry into the release description5, enter the tag to be created, and hit “Publish”.
-
From there, both the tag and release is created and all packages are published automatically based on our release workflow.
-
Add in the sections for a new Changelog:
## 0.4.X — [unreleased] **Features**: **Fixes**: **Documentation**: **Web**: **Integrations**: **Internal changes**: **New Contributors**:
We may make this more automated in future; e.g. automatic changelog creation.
Unfortunately GitHub's markdown parser interprets linebreaks as newlines. I
haven't found a better way of editing the markdown to look reasonable than
manually editing the text.
Only maintainers have access to this page.
Using the Dockerfile
The Dockerfile
in this repo builds a Docker image that has current versions of
our rust development tools. This can be the lowest-effort way of setting up a
rust environment for those that don’t have one already.
Development cycle
The developer loop when using Docker is substantially the same as if the tools had been installed directly.
All the source files live in the prql
directory on the host. As the source
changes, the tools (running in the Docker container) can watch those directories
and re-run so results are instantly visible.
When the Docker container exits (say, at the end of the development session),
the prql
directory on the local machine contains the latest files. Use git
to pull or to push the prql
repo from the host as normal.
To do all this, build the Docker image and start a container as described in the Installation section.
Installation
Once Docker is installed, build the Docker image with the following commands.
Note: It will take some time while Docker pulls in all the necessary developer tools.
cd <top-level-PRQL-directory>
docker build -t prql .
Optional: Install pre-commit
on the machine that hosts Docker. It runs
several Static Checks to ensure code consistency. You
can also configure git
to run pre-commit
automatically for each commit with
the second (one-time) command below.
pre-commit run -a # Run checks manually
pre-commit install # (one time) install the git hooks
Finally, start up the Docker container with:
cd <top-level-PRQL-directory>
docker run -it -v $(pwd)/:/src -p 3000:3000 prql
- There’ll be a
root@xxxxxxxxx:/src/#
prompt - Enter a command to run or test code; for example
cargo test
- Enter
exit
to stop the container
Running code with Docker
Currently our Docker image only supports running rust dependencies. (adding
hugo
& nodejs
so that the playground can run would be a welcome
contribution.)
Use the docker run...
command above, then enter the relevant commands; for
example cargo insta test --accept
or task run book
— more details of the
commands are in each component’s README.md
file or our
Development docs.
Note: The first time you run a component, it may take some time to install additional files. Once they’re built, start up is quick.
Developing the Dockerfile
When making updates to the Dockerfile, we have automated testing that the
Dockerfile builds on each merge in
test-all.yaml
,
and automated testing that the confirms all rust tests pass, in
nightly.yaml
.
Add a label to the PR pr-test-all
or pr-cron
to run these tests on a PR.
Internals
This chapter explains PRQL’s semantics: how expressions are interpreted and their meaning. It’s intended for advanced users and compiler contributors.
Compiler architecture
Compiler works in the following stages:
-
Lexing & parsing - split PRQL text into tokens, build parse tree and convert into our AST (Abstract Syntax Tree, see
ast
module). Parsing is done using PEST parser (prql.pest
), AST is constructed inparser.rs
. -
Semantic analysis - resolves names (identifiers), extracts declarations, determines frames (columns of the table in each step). It declares
Context
that contains root module (mapping from accessible names to their declarations).Resolving includes following operations:
- Assign an id to each node (
Expr
andStmt
). - Extract function declarations and variable def into appropriate
Module
, accessible fromContext::root_mod
- Lookup identifiers in module and find associated declaration. Ident is
replaced with fully qualified name that guarantees unique name in
root_mod
. Sometimes,Expr::target
is also set. - Function calls to transforms (
from
,derive
,filter
) are converted fromFuncCall
intoTransformCall
, which is more convenient for later processing. - Determine type of expr. If expr is a reference to a table use the frame of
the table as the type. If it is a
TransformCall
, apply the transform to the input frame to obtain resulting type. For simple expressions, try to infer fromExprKind
.
- Assign an id to each node (
-
Lowering - converts PL into RQ that is more strictly typed, contains less information but is convenient for translating into SQL or some other backend.
-
SQL backend - converts RQ into SQL. It converts each of the relations into a SQL query. Pipelines are analyzed and split at appropriate positions into “AtomicPipelines” which can be represented by a single SELECT statement.
Splitting is done back-to-front. First, we start with list all output columns we want. Then we traverse the pipeline backwards and split when we encounter a transform that is incompatible with transforms already present in the pipeline. Split can also be triggered by encountering an expression that cannot be materialized where it is used (window function is WHERE for example).
This process is also called anchoring, because it anchors a column definition to a specific location in the output query.
During this process,
sql::context
keeps track of:- table instances in the query (to prevent mixing up two instances of the same table)
- column definitions, whether computed or a reference to a table column,
- column names, as defined in RQ or generated
Name resolving
Because PRQL primarily handles relational data, it has specialized scoping rules for referencing columns.
Scopes
In PRQL’s compiler, a scope is the collection of all names one can reference from a specific point in the program.
In PRQL, names in the scope are composed from namespace and variable name which are separated by a dot, similar to SQL. Namespaces can contain many dots, but variable names cannot.
Name my_table.some_column
is a variable some_column
from namespace my_table
.
Name foo.bar.baz
is a variable baz
from namespace foo.bar
.
When processing a query, a scope is maintained and updated for each point in the query.
It start with only namespace std
, which is the standard library. It contains
common functions like sum
or count
, along with all transform functions such
as derive
and group
.
In pipelines (or rather in transform functions), scope is also injected with
namespaces of tables which may have been referenced with from
or join
transforms. These namespaces contain simply all the columns of the table and
possibly a wildcard variable, which matches any variable (see the algorithm
below). Within transforms, there is also a special namespace that does not have
a name. It is called a “frame” and it contains columns of the current table
the transform is operating on.
Resolving
For each ident we want to resolve, we search the scope’s items in order. One of three things can happen:
-
Scope contains an exact match, e.g. a name that matches in namespace and the variable name.
-
Scope does not contain an exact match, but the ident did not specify a namespace, so we can match a namespace that contains a
*
wildcard. If there’s a single namespace, the matched namespace is also updated to contain this new variable name. -
Otherwise, the nothing is matched and an error is raised.
Translating to SQL
When translating into a SQL statement which references only one table, there is no need to reference column names with table prefix.
But when there are multiple tables and we don’t have complete knowledge of all
table columns, a column without a prefix (i.e. first_name
) may actually reside
in multiple tables. Because of this, we have to use table prefixes for all
column names.
PRQL
from employees
derive [first_name, dept_id]
join d=departments [==dept_id]
select [first_name, d.title]
SQL
SELECT
employees.first_name,
d.title
FROM
employees
JOIN departments AS d ON employees.dept_id = d.dept_id
As you can see, employees.first_name
now needs table prefix, to prevent
conflicts with potential column with the same name in departments
table.
Similarly, d.title
needs the table prefix.
Functions
Function call
The major distinction between PRQL and today’s conventional programming languages such as C or Python is the function call syntax. It consists of the function name followed by arguments separated by whitespace.
function_name arg1 arg2 arg3
If one of the arguments is also a function call, it must be encased in parentheses, so we know where arguments of inner function end and the arguments of outer function start.
outer_func arg_1 (inner_func arg_a, arg_b) arg_2
Pipeline
There is a alternative way of calling functions: using a pipeline. Regardless of
whether the pipeline is delimited by pipe symbol |
or a new line, the pipeline
is equivalent to applying each of functions as the last argument of the next
function.
a | foo 3 | bar 'hello' 'world' | baz
… is equivalent to …
baz (bar 'hello' 'world' (foo 3 a))
As you may have noticed, transforms are regular functions too!
PRQL
from employees
filter age > 50
sort name
SQL
SELECT
*
FROM
employees
WHERE
age > 50
ORDER BY
name
… is equivalent to …
PRQL
from employees | filter age > 50 | sort name
SQL
SELECT
*
FROM
employees
WHERE
age > 50
ORDER BY
name
… is equivalent to …
PRQL
filter age > 50 (from employees) | sort name
SQL
SELECT
*
FROM
employees
WHERE
age > 50
ORDER BY
name
… is equivalent to …
PRQL
sort name (filter age > 50 (from employees))
SQL
SELECT
*
FROM
employees
WHERE
age > 50
ORDER BY
name
As you can see, the first example with pipeline notation is much easier to comprehend, compared to the last one with the regular function call notation. This is why it is recommended to use pipelines for nested function calls that are 3 or more levels deep.
Currying and late binding
In PRQL, functions are first class citizens. As cool as that sounds, we need simpler terms to explain it. In essence in means that we can operate with functions are with any other value.
Syntax highlighting
PRQL contains multiple grammar definitions to enable tools to highlight PRQL code. These are all intended to provide as good an experience as the grammar supports. Please raise any shortcomings in a GitHub issue.
The definitions are somewhat scattered around the codebase; this page serves as an index.
-
Lezer — used by CodeMirror editors. The PRQL file is at
prql-lezer/README.me
. -
Handlebars — currently duplicated:
- The book:
book/highlight-prql.js
- The website (outside of the book & playground):
website/themes/prql-theme/static/plugins/highlight/prql.js
- The book:
-
Textmate — used by the VSCode Extension. It’s in the
prql-vscode
repo inprql-vscode/syntaxes/prql.tmLanguage.json
. -
Monarch — used by the Monaco editor, which we use for the Playground. The grammar is at
playground/src/workbench/prql-syntax.js
.
While the pest grammar at
prql-compiler/src/parser/prql.pest
isn’t used for syntax highlighting, it’s the arbiter of truth given it currently
powers the PRQL compiler.