Introduction

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it’s readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL.

Let’s get started with an example:

PRQL

from employees
filter start_date > @2021-01-01               # Clear date syntax.
derive [                                      # `derive` adds columns / variables.
  gross_salary = salary + (tax ?? 0),         # Terse coalesce
  gross_cost = gross_salary + benefits_cost,  # Variables can use other variables.
]
filter gross_cost > 0
group [title, country] (                      # `group` runs a pipeline over each group.
  aggregate [                                 # `aggregate` reduces each group to a row.
    average gross_salary,
    sum_gross_cost = sum gross_cost,          # `=` sets a column name.
  ]
)
filter sum_gross_cost > 100000                # Identical syntax for SQL's `WHERE` & `HAVING`.
derive [
  id = f"{title}_{country}",                   # F-strings like python.
  db_version = s"version()",                  # An S-string, which transpiles directly into SQL
]
sort [sum_gross_cost, -country]               # `-country` means descending order.
take 1..20                                    # Range expressions (also valid here as `take 20`).

SQL

SELECT
  title,
  country,
  AVG(salary + COALESCE(tax, 0)),
  SUM(salary + COALESCE(tax, 0) + benefits_cost) AS sum_gross_cost,
  CONCAT(title, '_', country) AS id,
  version() AS db_version
FROM
  employees
WHERE
  start_date > DATE '2021-01-01'
  AND salary + COALESCE(tax, 0) + benefits_cost > 0
GROUP BY
  title,
  country
HAVING
  SUM(salary + COALESCE(tax, 0) + benefits_cost) > 100000
ORDER BY
  sum_gross_cost,
  country DESC
LIMIT
  20

As you can see, PRQL is a linear pipeline of transformations — each line of the query is a transformation of the previous line’s result.

You can see that in SQL, operations do not follow one another, which makes it hard to compose larger queries.

Queries

Pipelines

The simplest pipeline

The simplest pipeline is just:

PRQL

from employees

SQL

SELECT
  employees.*
FROM
  employees

Adding transformations

We can add additional lines, each one transforms the result:

PRQL

from employees
derive gross_salary = (salary + payroll_tax)

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary
FROM
  employees

…and so on:

PRQL

from employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary
FROM
  employees
ORDER BY
  gross_salary

Compiling to SQL

When compiling to SQL, the PRQL compiler will try to represent as many transforms as possible with a single SELECT statement. When necessary it will “overflow” using CTEs (common table expressions):

PRQL

from e = employees
derive gross_salary = (salary + payroll_tax)
sort gross_salary
take 10
join d = department [dept_no]
select [e.name, gross_salary, d.name]

SQL

WITH table_0 AS (
  SELECT
    e.*,
    salary + payroll_tax AS gross_salary
  FROM
    employees AS e
  ORDER BY
    gross_salary
  LIMIT
    10
)
SELECT
  table_0.name,
  table_0.gross_salary,
  d.name
FROM
  table_0
  JOIN department AS d USING(dept_no)

See also

Syntax

Functions

Functions are a fundamental abstraction in PRQL — they allow us to run code in many places that we’ve written once. This reduces the number of errors in our code, makes our code more readable, and simplifies making changes.

Functions have two types of parameters:

  1. Positional parameters, which require an argument.
  2. Named parameters, which optionally take an argument, otherwise using their default value.

So this function is named celsius_of_fahrenheit and has one parameter temp:

PRQL

func celsius_of_fahrenheit temp -> (temp - 32) * 3

from cities
derive temp_c = (celsius_of_fahrenheit temp_f)

SQL

SELECT
  cities.*,
  (temp_f - 32) * 3 AS temp_c
FROM
  cities

This function is named interp, and has two positional parameters named higher and x, and one named parameter named lower which takes a default argument of 0. It calculates the proportion of the distance that x is between lower and higher.

PRQL

func interp lower:0 higher x -> (x - lower) / (higher - lower)

from students
derive [
  sat_proportion_1 = (interp 1600 sat_score),
  sat_proportion_2 = (interp lower:0 1600 sat_score),
]

SQL

SELECT
  students.*,
  (sat_score - 0) / (1600 - 0) AS sat_proportion_1,
  (sat_score - 0) / (1600 - 0) AS sat_proportion_2
FROM
  students

Piping

Consistent with the principles of PRQL, it’s possible to pipe values into functions, which makes composing many functions more readable. When piping a value into a function, the value is passed as an argument to the final positional parameter of the function. Here’s the same result as the examples above with an alternative construction:

PRQL

func interp lower:0 higher x -> (x - lower) / (higher - lower)

from students
derive [
  sat_proportion_1 = (sat_score | interp 1600),
  sat_proportion_2 = (sat_score | interp lower:0 1600),
]

SQL

SELECT
  students.*,
  (sat_score - 0) / (1600 - 0) AS sat_proportion_1,
  (sat_score - 0) / (1600 - 0) AS sat_proportion_2
FROM
  students

and

PRQL

func celsius_of_fahrenheit temp -> (temp - 32) * 3

from cities
derive temp_c = (temp_f | celsius_of_fahrenheit)

SQL

SELECT
  cities.*,
  (temp_f - 32) * 3 AS temp_c
FROM
  cities

We can combine a chain of functions, which makes logic more readable:

PRQL

func celsius_of_fahrenheit temp -> (temp - 32) * 3
func interp lower:0 higher x -> (x - lower) / (higher - lower)

from kettles
derive boiling_proportion = (temp_c | celsius_of_fahrenheit | interp 100)

SQL

SELECT
  kettles.*,
  ((temp_c - 32) * 3 - 0) / (100 - 0) AS boiling_proportion
FROM
  kettles

Roadmap

Late binding

Currently, functions require a binding to variables in scope; they can’t late-bind to column names; so for example:

func return price -> (price - dividend) / price_yesterday

…isn’t yet a valid function, and instead would needs to be:

func return price dividend price_yesterday ->  (price - dividend) / (price_yesterday)

(which makes functions in this case not useful)

Tables

We can create a table — similar to a CTE in SQL — with table:

PRQL

table top_50 = (
  from employees
  sort salary
  take 50
  aggregate (sum salary)
)

from another_table      # Starts a new pipeline

SQL

WITH table_0 AS (
  SELECT
    employees.*
  FROM
    employees
  ORDER BY
    salary
  LIMIT
    50
), top_50 AS (
  SELECT
    SUM(salary)
  FROM
    table_0
)
SELECT
  another_table.*
FROM
  another_table

In PRQL tables are far less common than CTEs are in SQL, since a linear series of CTEs can be represented with a single pipeline.

Roadmap

Currently it’s not yet possible to have an s-string as a whole table. See #376 for more details.

table a = s"""
  SELECT *
  FROM employees
"""

from a

Syntax

Summary

A summary of PRQL syntax

SyntaxUsageExample
|Pipefrom employees | select first_name
=Assigns & Aliasesfrom e = employees
derive total = (sum salary)
:Named args & Parametersinterp lower:0 1600 sat_score
[]Listselect [id, amount]
()Precedencederive fahrenheit = (celsius - 32) * 1.8
#Comment# A comment
@Date & times@2021-01-01
==Equality comparisonjoin s=salaries [s.emp_id == e.id]
->Function definitionsfunc add a b -> a + b
+/-Sort ordersort [-amount, +date]
??Coalesceamount ?? 0
<type>Annotations@2021-01-01<datetime>

Pipes

Pipes — the connection between transforms that make up a pipeline — can be either line breaks or a pipe character (|).

In almost all situations, line-breaks pipe the result of a line’s transform into the transform on the following line. For example, the filter transform operates on the result of from employees (which is just the employees table), and the select transform operates on the result of the filter transform.

PRQL

from employees
filter department == "Product"
select [first_name, last_name]

SQL

SELECT
  first_name,
  last_name
FROM
  employees
WHERE
  department = 'Product'

In the place of a line-break, it’s also possible to use the | character to pipe results, such that this is equivalent:

PRQL

from employees | filter department == "Product" | select [first_name, last_name]

SQL

SELECT
  first_name,
  last_name
FROM
  employees
WHERE
  department = 'Product'

A line-break doesn’t create a pipeline in a couple of cases:

  • within a list (e.g. the derive examples below),
  • when the following line is a new statement, which starts with a keyword of func, table or from.

Lists

Lists are represented with [], and can span multiple lines. A final trailing comma is optional.

PRQL

derive [x = 1, y = 2]
derive [
  a = x,
  b = y
]
derive [
  c = a,
  d = b,
]

SQL

SELECT
  1 AS x,
  2 AS y,
  1 AS a,
  2 AS b,
  1 AS c,
  2 AS d

Most transforms can take either a list or a single item, so these are equivalent:

PRQL

from employees
select [first_name]

SQL

SELECT
  first_name
FROM
  employees

PRQL

from employees
select first_name

SQL

SELECT
  first_name
FROM
  employees

Parentheses

Parentheses — () — are used to give precedence to inner expressions, as is the case in almost all languages / math.

In particular, parentheses are used to nest pipelines for transforms such as group and window, which take a pipeline. Here, the aggregate pipeline is applied to each group of unique title and country values.

PRQL

from employees
group [title, country] (
  aggregate [
    average salary,
    ct = count
  ]
)

SQL

SELECT
  title,
  country,
  AVG(salary),
  COUNT(*) AS ct
FROM
  employees
GROUP BY
  title,
  country

Comments

Comments are represented by #. Currently only single line comments exist.

PRQL

from employees  # Comment 1
# Comment 2
aggregate [average salary]

SQL

SELECT
  AVG(salary)
FROM
  employees

Backticks

To use identifiers that are otherwise invalid, surround them with backticks. Depending on the dialect, these will remain as backticks or be converted to double-quotes

PRQL

prql dialect:mysql
from employees
select `first name`

SQL

SELECT
  `first name`
FROM
  employees

PRQL

prql dialect:postgres
from employees
select `first name`

SQL

SELECT
  "first name"
FROM
  employees

BigQuery also uses backticks to surround project & dataset names (even if valid identifiers) in the SELECT statement:

PRQL

prql dialect:bigquery
from `project-foo.dataset.table`
join `project-bar.dataset.table` [col_bax]

SQL

SELECT
  `project-foo.dataset.table`.*,
  `project-bar.dataset.table`.*,
  col_bax
FROM
  `project-foo.dataset.table`
  JOIN `project-bar.dataset.table` USING(col_bax)

Query header: Dialect & Version

Dialect

PRQL allows specifying a dialect at the top of the query, which allows PRQL to compile to a database-specific SQL flavor.

Examples

PRQL

prql dialect:postgres

from employees
sort age
take 10

SQL

SELECT
  employees.*
FROM
  employees
ORDER BY
  age
LIMIT
  10

PRQL

prql dialect:mssql

from employees
sort age
take 10

SQL

SELECT
  TOP (10) employees.*
FROM
  employees
ORDER BY
  age

Supported dialects

Note that dialect support is very early — most differences are not implemented, and most dialects’ implementations are identical to generic’s. Contributions are very welcome.

  • ansi
  • bigquery
  • clickhouse
  • generic
  • hive
  • mssql
  • mysql
  • postgres
  • sqlite
  • snowflake

Version

PRQL allows specifying a version of the language in the PRQL header, like:

PRQL

prql version:1

from employees

SQL

SELECT
  employees.*
FROM
  employees

When the functionality is implemented, it will allow the language to evolve without breaking existing queries.

Transforms

PRQL queries are a pipeline of transformations (“transforms”), where each transform takes the previous result and adjusts it in some way, before passing it onto to the next transform.

Because PRQL focuses on modularity, we have far fewer transforms than SQL, each one fulfilling a specific purpose. That’s often referred to as “orthogonality”.

These are the currently available transforms:

TransformPurposeSQL Equivalent
fromStarts from a tableFROM
deriveComputes new columnsSELECT *, ... AS ...
selectPicks & computes columnsSELECT ... AS ...
filterPicks rows based on their valuesWHERE, HAVING,QUALIFY
sortOrders rows based on the values of columnsORDER BY
joinAdds columns from another table, matching rows based on a conditionJOIN
takePicks rows based on their positionTOP, LIMIT, OFFSET
groupPartitions rows into groups and applies a pipeline to each of themGROUP BY, PARTITION BY
aggregateSummarizes many rows into one rowSELECT foo(...)
windowApplies a pipeline to overlapping segments of rowsOVER, ROWS, RANGE

Aggregate

Summarizes many rows into one row.

When applied:

  • without group, it produces one row from the whole table,
  • within a group pipeline, it produces one row from each group.
aggregate [{expression or assign operations}]

Examples

PRQL

from employees
aggregate [
  average salary,
  ct = count
]

SQL

SELECT
  AVG(salary),
  COUNT(*) AS ct
FROM
  employees

PRQL

from employees
group [title, country] (
  aggregate [
    average salary,
    ct = count
  ]
)

SQL

SELECT
  title,
  country,
  AVG(salary),
  COUNT(*) AS ct
FROM
  employees
GROUP BY
  title,
  country

Derive

Computes one or more new columns.

derive [{new_name} = {expression}]

Examples

PRQL

from employees
derive gross_salary = salary + payroll_tax

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary
FROM
  employees

PRQL

from employees
derive [
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost
]

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary,
  salary + payroll_tax + benefits_cost AS gross_cost
FROM
  employees

derive generally computes a column from existing columns, but can also take a literal in which case it can begin a pipeline without a from:

PRQL

derive x = 5

SQL

SELECT
  5 AS x

Filter

Picks rows based on their values.

filter {boolean_expression}

Examples

PRQL

from employees
filter age > 25

SQL

SELECT
  employees.*
FROM
  employees
WHERE
  age > 25

PRQL

from employees
filter (age | in 25..40)

SQL

SELECT
  employees.*
FROM
  employees
WHERE
  age BETWEEN 25
  AND 40

From

Specifies a data source.

from {table_reference}

Examples

PRQL

from employees

SQL

SELECT
  employees.*
FROM
  employees

To introduce an alias, use an assign expression:

PRQL

from e = employees
select e.first_name

SQL

SELECT
  first_name
FROM
  employees AS e

Group

Partitions the rows into groups and applies a pipeline to each of the groups.

group [{key_columns}] {pipeline}

The partitioning of groups are determined by the key_columns (first argument).

The most conventional use of group is with aggregate:

PRQL

from employees
group [title, country] (
  aggregate [
    average salary,
    ct = count
  ]
)

SQL

SELECT
  title,
  country,
  AVG(salary),
  COUNT(*) AS ct
FROM
  employees
GROUP BY
  title,
  country

In concept, a transform in context of a group does the same transformation to the group as it would to the table — for example finding the employee who joined first across the whole table:

PRQL

from employees
sort join_date
take 1

SQL

SELECT
  employees.*
FROM
  employees
ORDER BY
  join_date
LIMIT
  1

To find the employee who joined first in each department, it’s exactly the same pipeline, but within a group expression:

PRQL

from employees
group role (
  sort join_date  # taken from above
  take 1
)

SQL

WITH table_0 AS (
  SELECT
    employees.*,
    ROW_NUMBER() OVER (
      PARTITION BY role
      ORDER BY
        join_date
    ) AS _rn_82
  FROM
    employees
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  _rn_82 <= 1

Join

Adds columns from another table, matching rows based on a condition.

join side:{inner|left|right|full} {table} {[conditions]}

Parameters

  • side decides which rows to include. Defaults to inner
  • Table reference
  • List of conditions
    • If all terms are column identifiers, this will compile to USING(...). In this case, both tables must contain specified columns. The result will only contain one column for each specified column.

Examples

PRQL

from employees
join side:left positions [id==employee_id]

SQL

SELECT
  employees.*,
  positions.*
FROM
  employees
  LEFT JOIN positions ON id = employee_id

PRQL

from employees
join side:full positions [emp_no]

SQL

SELECT
  employees.*,
  positions.*,
  emp_no
FROM
  employees FULL
  JOIN positions USING(emp_no)

PRQL

from employees
join side:left p=positions [id==employee_id]

SQL

SELECT
  employees.*,
  p.*
FROM
  employees
  LEFT JOIN positions AS p ON id = employee_id

Select

Picks and computes columns.

select [{assign_expression}]

Examples

PRQL

from employees
select first_name

SQL

SELECT
  first_name
FROM
  employees

PRQL

from employees
select [first_name, last_name]

SQL

SELECT
  first_name,
  last_name
FROM
  employees

PRQL

from employees
select name = f"{first_name} {last_name}"

SQL

SELECT
  CONCAT(first_name, ' ', last_name) AS name
FROM
  employees

PRQL

from employees
select [
  name = f"{first_name} {last_name}",
  age_eoy = dob - @2022-12-31,
]

SQL

SELECT
  CONCAT(first_name, ' ', last_name) AS name,
  dob - DATE '2022-12-31' AS age_eoy
FROM
  employees

Sort

Orders rows based on the values of one or more columns.

sort [{direction}{column}]

Parameters

  • One column or a list of columns to sort by
  • Each column can be prefixed with:
    • +, for ascending order, the default
    • -, for descending order
  • When using prefixes, make sure to wrap columns in a list. Otherwise, sort -column is interpreted as subtraction between sort and column.

Examples

PRQL

from employees
sort age

SQL

SELECT
  employees.*
FROM
  employees
ORDER BY
  age

PRQL

from employees
sort [-age]

SQL

SELECT
  employees.*
FROM
  employees
ORDER BY
  age DESC

PRQL

from employees
sort [age, -tenure, +salary]

SQL

SELECT
  employees.*
FROM
  employees
ORDER BY
  age,
  tenure DESC,
  salary

Roadmap

Currently sort does not accept expressions:

from employees
sort [s"substr({first_name}, 2, 5)"]  # Currently will fail

Take

Picks rows based on their position.

take {n|range}

See Ranges for more details on how ranges work.

Examples

PRQL

from employees
take 10

SQL

SELECT
  employees.*
FROM
  employees
LIMIT
  10

PRQL

from orders
sort [-value, date]
take 101..110

SQL

SELECT
  orders.*
FROM
  orders
ORDER BY
  value DESC,
  date
LIMIT
  10 OFFSET 100

Window

Applies a pipeline to segments of rows, producing one output value for every input value.

window rows:{range} range:{range} expanding:false rolling:0 {pipeline}

For each row, the segment over which the pipeline is applied is determined by one of:

  • rows, which takes a range of rows relative to the current row position.
    • 0 references the current row.
  • range, which takes a range of values relative to current row value.

The bounds of the range are inclusive. If a bound is omitted, the segment will extend until the edge of the table or group.

For ease of use, there are two flags that override rows or range:

  • expanding:true is an alias for rows:..0. A sum using this window is also known as “cumulative sum”.
  • rolling:n is an alias for row:(-n+1)..0, where n is an integer. This will include n last values, including current row. An average using this window is also knows as a Simple Moving Average.

Some examples:

ExpressionMeaning
rows:0..2current row plus two following
rows:-2..0two preceding rows plus current row
rolling:3(same as previous)
rows:-2..4two preceding rows plus current row plus four following rows
rows:..0all rows from the start of the table up to & including current row
expanding:true(same as previous)
rows:0..current row and all following rows until the end of the table
rows:..all rows, which same as not having window at all

Example

PRQL

from employees
group employee_id (
  sort month
  window rolling:12 (
    derive [trail_12_m_comp = sum paycheck]
  )
)

SQL

SELECT
  employees.*,
  SUM(paycheck) OVER (
    PARTITION BY employee_id
    ORDER BY
      month ROWS BETWEEN 11 PRECEDING
      AND CURRENT ROW
  ) AS trail_12_m_comp
FROM
  employees

PRQL

from orders
sort day
window rows:-3..3 (
  derive [centered_weekly_average = average value]
)
group [order_month] (
  sort day
  window expanding:true (
    derive [monthly_running_total = sum value]
  )
)

SQL

SELECT
  orders.*,
  AVG(value) OVER (
    ORDER BY
      day ROWS BETWEEN 3 PRECEDING
      AND 3 FOLLOWING
  ) AS centered_weekly_average,
  SUM(value) OVER (
    PARTITION BY order_month
    ORDER BY
      day ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW
  ) AS monthly_running_total
FROM
  orders

Language Features

Coalesce

We can coalesce values with an ?? operator. Coalescing takes either the first value or, if that value is null, the second value.

PRQL

from orders
derive amount ?? 0

SQL

SELECT
  orders.*,
  COALESCE(amount, 0)
FROM
  orders

Dates & Times

PRQL uses @ followed by a string to represent dates & times. This is less verbose than SQL’s approach of TIMESTAMP '2004-10-19 10:23:54' and more explicit than SQL’s implicit option of just using a string '2004-10-19 10:23:54'.

Note

Currently PRQL passes strings which can be compiled straight through to the database, and so many compatible formats string may work, but we may refine this in the future to aid in compatibility across databases. We’ll always support the canonical ISO8601 format described below.

Dates

Dates are represented by @{yyyy-mm-dd} — a @ followed by the date format.

PRQL

from employees
derive age_at_year_end = (@2022-12-31 - dob)

SQL

SELECT
  employees.*,
  DATE '2022-12-31' - dob AS age_at_year_end
FROM
  employees

Times

Times are represented by @{HH:mm:ss.SSS±Z} with any parts not supplied being rounded to zero, including the timezone, which is represented by +HH:mm, -HH:mm or Z. This is consistent with the ISO8601 time format.

PRQL

from orders
derive should_have_shipped_today = (order_time < @08:30)

SQL

SELECT
  orders.*,
  order_time < TIME '08:30' AS should_have_shipped_today
FROM
  orders

Timestamps

Timestamps are represented by @{yyyy-mm-ddTHH:mm:ss.SSS±Z} / @{date}T{time}, with any time parts not supplied being rounded to zero, including the timezone, which is represented by +HH:mm, -HH:mm or Z. This is @ followed by the ISO8601 datetime format, which uses T to separate date & time.

PRQL

derive first_prql_commit = @2020-01-01T13:19:55-0800

SQL

SELECT
  TIMESTAMP '2020-01-01T13:19:55-0800' AS first_prql_commit

Intervals

Intervals are represented by {N}{periods}, such as 2years or 10minutes, without a space.

Note

These aren’t the same as ISO8601, because we evaluated P3Y6M4DT12H30M5S to be difficult to understand, but we could support a simplified form if there’s demand for it. We don’t currently support compound expressions, for example 2years10months, but most DBs will allow 2years + 10months. Please raise an issue if this is inconvenient.

PRQL

from projects
derive first_check_in = start + 10days

SQL

SELECT
  projects.*,
  start + INTERVAL 10 DAY AS first_check_in
FROM
  projects

Examples

Here’s a fuller list of examples:

  • @20221231 is forbidden — it must contain full punctuation (- and :),
  • @2022-12-31 is a date
  • @2022-12 or @2022 are forbidden — SQL can’t express a month, only a date
  • @16:54:32.123456 is a time
  • @16:54:32, @16:54, @16 are all allowed, expressing @16:54:32.000000, @16:54:00.000000, @16:00:00.000000 respectively
  • @2022-12-31T16:54:32.123456 is a timestamp without timezone
  • @2022-12-31T16:54:32.123456Z is a timestamp in UTC
  • @2022-12-31T16:54+02 is timestamp in UTC+2
  • @2022-12-31T16:54+02:00 and @2022-12-31T16:54+02 are datetimes in UTC+2
  • @16:54+02 is forbidden — time is always local, so it cannot have a timezone
  • @2022-12-31+02 is forbidden — date is always local, so it cannot have a timezone

Roadmap

Datetimes

Datetimes are supported by some databases (e.g. MySql, BigQuery) in addition to timestamps. When we have type annotations, these will be represented by a timestamp annotated as a datetime:

derive pi_day = @2017-03-14T15:09:26.535898<datetime>

These are some examples we can then add:

  • @2022-12-31T16:54<datetime> is datetime without timezone
  • @2022-12-31<datetime> is forbidden — datetime must specify time
  • @16:54<datetime> is forbidden — datetime must specify date

Distinct

PRQL doesn’t have a specific distinct keyword. Instead, use group and take 1:

PRQL

from employees
select department
group department (
  take 1
)

SQL

SELECT
  DISTINCT department
FROM
  employees

This also works without a linebreak:

PRQL

from employees
select department
group department (take 1)

SQL

SELECT
  DISTINCT department
FROM
  employees

Selecting from each group

We are be able to select a single row from each group by combining group and sort:

PRQL

# youngest employee from each department
from employees
group department (
  sort age
  take 1
)

SQL

WITH table_0 AS (
  SELECT
    employees.*,
    ROW_NUMBER() OVER (
      PARTITION BY department
      ORDER BY
        age
    ) AS _rn_82
  FROM
    employees
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  _rn_82 <= 1

Roadmap

When using Postgres dialect, we are planning to compile:

# youngest employee from each department
from employees
group department (
  sort age
  take 1
)

… to …

SELECT DISTINCT ON (department) *
FROM employees
ORDER BY department, age

F-Strings

f-strings are a readable approach to building new strings from existing strings. Currently PRQL supports this for concatenating strings:

PRQL

from employees
select full_name = f"{first_name} {last_name}"

SQL

SELECT
  CONCAT(first_name, ' ', last_name) AS full_name
FROM
  employees

This can be much easier to read for longer strings, relative to the SQL approach:

PRQL

from web
select url = f"http{tls}://www.{domain}.{tld}/{page}"

SQL

SELECT
  CONCAT(
    'http',
    tls,
    '://www.',
    domain,
    '.',
    tld,
    '/',
    page
  ) AS url
FROM
  web

Roadmap

In the future, f-strings may incorporate string formatting such as datetimes, numbers, and padding. If there’s a feature that would be helpful, please post an issue.

Null handling

SQL has an unconventional way of handling NULL values, since it treats them as unknown values. As a result, in SQL:

  • NULL is not a value indicating a missing entry, but a placeholder for anything possible,
  • NULL = NULL evaluates to NULL, since one cannot know if one unknown is equal to another unknown,
  • NULL <> NULL evaluates to NULL, using same logic,
  • to check if a value is NULL, SQL introduces IS NULL and IS NOT NULL operators,
  • DISTINCT column may return multiple NULL values.

For more information, check out the Postgres documentation.

PRQL, on the other hand, treats null as a value, which means that:

  • null == null evaluates to true,
  • null != null evaluates to false,
  • distinct column cannot contain multiple null values.

PRQL

from employees
filter first_name == null
filter null != last_name

SQL

SELECT
  employees.*
FROM
  employees
WHERE
  first_name IS NULL
  AND last_name IS NOT NULL

Note that PRQL doesn’t change how NULL is compared between columns, for example in joins. (PRQL compiles to SQL and so can’t change the behavior of the database).

For more context or to provide feedback check out the discussion on issue #99.

Ranges

PRQL has a concise range syntax start..end.

This can be used in filters with the in function, with any type of literal, including dates:

PRQL

from events
filter (date | in @1776-07-04..@1787-09-17)
filter (magnitude | in 50..100)

SQL

SELECT
  events.*
FROM
  events
WHERE
  date BETWEEN DATE '1776-07-04'
  AND DATE '1787-09-17'
  AND magnitude BETWEEN 50
  AND 100

Like in SQL, ranges are inclusive.

As discussed in the take docs, ranges can also be used in take:

PRQL

from orders
sort [-value, date]
take 101..110

SQL

SELECT
  orders.*
FROM
  orders
ORDER BY
  value DESC,
  date
LIMIT
  10 OFFSET 100

Roadmap

We’d like to use this for more like whether an object is in an array or list literal.

S-Strings

An s-string inserts SQL directly, as an escape hatch when there’s something that PRQL doesn’t yet implement. For example, there’s no version() function in SQL that returns the Postgres version, so if we want to use that, we use an s-string:

PRQL

derive db_version = s"version()"

SQL

SELECT
  version() AS db_version

We can embed columns in an s-string using braces. For example, PRQL’s standard library defines the average function as:

func average column -> s"AVG({column})"

So this compiles using the function:

PRQL

from employees
aggregate [average salary]

SQL

SELECT
  AVG(salary)
FROM
  employees

Here’s an example of a more involved use of an s-string:

PRQL

from de=dept_emp
join s=salaries side:left [
  (s.emp_no == de.emp_no),
  s"""({s.from_date}, {s.to_date})
  OVERLAPS
  ({de.from_date}, {de.to_date})"""
]

SQL

SELECT
  de.*,
  s.*
FROM
  dept_emp AS de
  LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
  AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)

To use brackets in an s-string, use double brackets:

PRQL

from employees
derive [
  has_valid_title = s"regexp_contains(title, '([a-z0-9]*-){{2,}}')"
]

SQL

SELECT
  employees.*,
  regexp_contains(title, '([a-z0-9]*-){2,}') AS has_valid_title
FROM
  employees

For those who have used python, s-strings are similar to python’s f-strings, but the result is SQL, rather than a string literal — a python f-string of f"average{col}" where col="salary" would produce "average(salary)", with the quotes. s"average{col}" produces average(salary), without quotes.

s-strings in user code are intended as an escape-hatch for an unimplemented feature. If we often need s-strings to express something, that’s a sign we should implement it in PRQL or PRQL’s stdlib.

Strings

Strings in PRQL can use either single or double quotes:

PRQL

derive x = "hello world"

SQL

SELECT
  'hello world' AS x

PRQL

derive x = 'hello world'

SQL

SELECT
  'hello world' AS x

To quote a string containing quotes, either use the “other” type of quote, or use three-or-more quotes, and close with the same number.

PRQL

derive x = '"hello world"'

SQL

SELECT
  '"hello world"' AS x

PRQL

derive x = """I said "hello world"!"""

SQL

SELECT
  'I said "hello world"!' AS x

PRQL

derive x = """""I said """hello world"""!"""""

SQL

SELECT
  'I said """hello world"""!' AS x

Note

Currently PRQL does not adjust escape characters.

Warning

Currently PRQL allows multiline strings with either a single character or multiple character quotes. This may change for strings using a single character quote in future versions.

Stdlib

The standard library is currently fairly limited, and we’re very to expanding it. If we find ourselves using s-strings for something frequently, raise an issue and we’ll add it to the stdlib.

Note

Currently the stdlib implementation doesn’t support different DB implementations itself; those need to be built deeper into the compiler. We’ll resolve this at some point. Until then, we’ll only add functions here that are broadly supported by most DBs.

Here’s the source of the current PRQL stdlib:

# Aggregate Functions

func min <scalar|column> column ->  s"MIN({column})"
func max <scalar|column> column ->  s"MAX({column})"
func sum <scalar|column> column ->  s"SUM({column})"
func avg <scalar|column> column ->  s"AVG({column})"
func stddev <scalar|column> column ->  s"STDDEV({column})"
func average <scalar|column> column ->  s"AVG({column})"
func count <scalar|column> non_null:s"*" ->  s"COUNT({non_null})"
# TODO: Possibly make this into `count distinct:true` (or like `distinct:` as an
# abbreviation of that?)
func count_distinct <scalar|column> column ->  s"COUNT(DISTINCT `{column}`)"

# Window functions
func lag<column> offset column ->  s"LAG({column}, {offset})"
func lead<column> offset column ->  s"LEAD({column}, {offset})"
func first<column> offset column ->  s"FIRST_VALUE({column}, {offset})"
func last<column> offset column ->  s"LAST_VALUE({column}, {offset})"
func rank<column> ->  s"RANK()"
func rank_dense<column> ->  s"DENSE_RANK()"
func row_number<column> ->  s"ROW_NUMBER()"

# Other functions

func round<scalar> n_digits column ->  s"ROUND({column}, {n_digits})"
func as<scalar> type column ->  s"CAST({column} AS {type})"
# TODO: Introduce a notation for getting start and end out of a ranges
# could be range.0? or range.start? But to make this happen, we need to make
# changes to how variables are resolved.
func in<bool> range value ->  s"{value} BETWEEN {range}"

# Logical functions
# TODO: should we remove in favor of `??` to reduce ambiguity?
func coalesce value default -> s"COALESCE({value}, {default})"

# Transform type definitions

# (make sure to update cast_transfrom if you change any of the declarations
# here)

func from<table> source -> 0
func select<table> assigns tbl -> 0
func filter<table> condition tbl -> 0
func derive<table> assigns tbl -> 0
func aggregate<table> assigns tbl -> 0
func sort<table> by tbl -> 0
func take<table> expr tbl -> 0
func join<table> with filter side:inner tbl -> 0
func group<table> by pipeline tbl -> 0
func window<table> rows:0..0 range:0..0 expanding:false rolling:0 pipeline tbl -> 0

And a couple of examples:

PRQL

from employees
derive [
  gross_salary = (salary + payroll_tax | as int),
  gross_salary_rounded = (gross_salary | round 0),
]

SQL

SELECT
  employees.*,
  CAST(salary + payroll_tax AS int) AS gross_salary,
  ROUND(CAST(salary + payroll_tax AS int), 0) AS gross_salary_rounded
FROM
  employees

Bindings

Java (prql-java)

prql-java offers rust bindings to the prql-compiler rust library. It exposes a java native method public static native String toSql(String query).

Installation

<dependency>
    <groupId>org.prqllang</groupId>
    <artifactId>prql-java</artifactId>
    <version>${PRQL_VERSION}</version>
</dependency>

Usage

import org.prqllang.prql4j.PrqlCompiler;

class Main {
    public static void main(String[] args) {
        String sql = PrqlCompiler.toSql("from table");
        System.out.println(sql);
    }
}

Javascript (prql-js)

JavaScript bindings for prql-compiler. Check out https://prql-lang.org for more context.

Installation

npm install prql-js

Usage

Currently these functions are exposed

function compile(prql_string) # returns CompileResult
function to_sql(prql_string) # returns SQL string
function to_json(prql_string) # returns JSON string ( needs JSON.parse() to get the json)

From NodeJS

const prql = require("prql-js");

const { sql, error } = compile(`from employees | select first_name`);
console.log(sql);
// handle error as well...

From a Browser

<html>
  <head>
    <script src="./node_modules/prql-js/dist/web/prql_js.js"></script>
    <script>
      const { compile } = wasm_bindgen;

      async function run() {
        await wasm_bindgen("./node_modules/prql-js/dist/web/prql_js_bg.wasm");
        const { sql, error } = compile("from employees | select first_name");

        console.log(sql);
        // handle error as well...
      }

      run();
    </script>
  </head>

  <body></body>
</html>

From a Framework or a Bundler

import compile from "prql-js/dist/bundler";

const { sql, error } = compile(`from employees | select first_name`);
console.log(sql);
// handle error as well...

Notes

This uses wasm-pack to generate bindings1.

1

though we would be very open to other approaches, and used trunk successfully in a rust-driven approach to this, RIP prql-web.

Development

Build:

npm run build-all

This builds Node, bundler and web packages in the dist path.

Test:

wasm-pack test --firefox

Python (prql-python)

Installation

pip install prql-python

Usage

import prql_python as prql

prql_query = """
    from employees
    join salaries [emp_id]
    group [dept_id, gender] (
      aggregate [
        avg_salary = average salary
      ]
    )
"""

sql = prql.to_sql(prql_query)

Rust (prql-compiler)

Installation

cargo new myproject
cd myproject
cargo add prql-compiler

Usage

cargo run

src/main.rs:

use prql_compiler::compile;

fn main() {
    let prql = "from employees | select [name,age]  ";
    let sql = compile(prql).unwrap();
    println!("{:?}", sql.replace("\n", " "));
}

Cargo.toml:

[package]
name = "myproject"
version = "0.1.0"
edition = "2021"

[dependencies]
prql-compiler = "0.2.2"

Integrations

dbt-prql

Original docs at https://github.com/prql/dbt-prql

dbt-prql allows writing PRQL in dbt models. This combines the benefits of PRQL’s power & simplicity within queries, with dbt’s version control, lineage & testing across queries.

Once dbt-prql in installed, dbt commands compile PRQL between {% prql %} & {% endprql %} jinja tags to SQL as part of dbt’s compilation. No additional config is required.

Examples

Simple example

{% prql %}
from employees
filter (age | in 20..30)
{% endprql %}

…would appear to dbt as:

SELECT
  employees.*
FROM
  employees
WHERE
  age BETWEEN 20
  AND 30

Less simple example

{% prql %}
from {{ source('salesforce', 'in_process') }}
derive expected_sales = probability * value
join {{ ref('team', 'team_sales') }} [name]
group name (
  aggregate (expected_sales)
)
{% endprql %}

…would appear to dbt as:

SELECT
  name,
  {{ source('salesforce', 'in_process') }}.probability * {{ source('salesforce', 'in_process') }}.value AS expected_sales
FROM
  {{ source('salesforce', 'in_process') }}
  JOIN {{ ref('team', 'team_sales') }} USING(name)
GROUP BY
  name

…and then dbt will compile the source and refs to a full SQL query.

Replacing macros

dbt’s use of macros has saved many of us many lines of code, and even saved some people some time. But imperatively programming text generation with code like if not loop.last is not our highest calling. It’s the “necessary” part rather than beautiful part of dbt.

Here’s the canonical example of macros in the dbt documentation:

{%- set payment_methods = ["bank_transfer", "credit_card", "gift_card"] -%}

select
order_id,
{%- for payment_method in payment_methods %}
sum(case when payment_method = '{{payment_method}}' then amount end) as {{payment_method}}_amount
{%- if not loop.last %},{% endif -%}
{% endfor %}
from {{ ref('raw_payments') }}
group by 1

Here’s that model using PRQL1, including the prql jinja tags.

{% prql %}
func filter_amount method -> s"sum(case when payment_method = '{method}' then amount end) as {method}_amount"

from {{ ref('raw_payments') }}
group order_id (
  aggregate [
    filter_amount bank_transfer,
    filter_amount credit_card,
    filter_amount gift_card,
  ]
)
{% endprql %}

As well the query being simpler in its final form, writing in PRQL also gives us live feedback around any errors, on every keystroke. Though there’s much more to come, check out the current version on PRQL Playground.

What it does

When dbt compiles models to SQL queries:

  • Any text in a dbt model between {% prql %} and {% endprql %} tags is compiled from PRQL to SQL before being passed to dbt.
  • The PRQL complier passes text that’s containing {{ & }} through to dbt without modification, which allows us to embed jinja expressions in PRQL. (This was added to PRQL specifically for this use-case.)
  • dbt will then compile the resulting model into its final form of raw SQL, and dispatch it to the database, as per usual.

There’s no config needed in the dbt project; this works automatically on any dbt command (e.g. dbt run) assuming dbt-prql is installed.

Installation

pip install dbt-prql

Current state

Currently this is new, but fairly feature-complete. It’s enthusiastically supported — if there are any problems, please open an issue.

Jupyter

Original docs at https://pyprql.readthedocs.io/en/latest/magic_readme.html

Work with pandas and PRQL in an IPython terminal or Jupyter notebook.

Implementation

This is a thin wrapper around the fantastic IPython-sql magic. Roughly speaking, all we do is parse PRQL to SQL and pass that through to ipython-sql. A full documentation of the supported features is available at their repository. Here, we document those places where we differ from them, plus those features we think you are mostly likely to find useful.

Usage

Installation

If you have already installed PyPRQL into your environment, then you should be could to go! We bundle in IPython and pandas, though you’ll need to install Jupyter separately. If you haven’t installed PyPRQL, that’s as simple as:

pip install pyprql

Set Up

Open up either an IPython terminal or Jupyter notebook. First, we need to load the extension and connect to a database.

In [1]: %load_ext pyprql.magic

Connecting a database

We have two options for connecting a database

  1. Create an in-memory DB. This is the easiest way to get started.

    In [2]: %prql duckdb:///:memory:
    

    However, in-memory databases start off empty! So, we need to add some data. We have a two options:

    • We can easily add a pandas dataframe to the DuckDB database like so:

      In [3]: %prql --persist df
      

      where df is a pandas dataframe. This adds a table named df to the in-memory DuckDB instance.

    • Or download a CSV and query it directly, with DuckDB:

      !wget https://github.com/graphql-compose/graphql-compose-examples/blob/master/examples/northwind/data/csv/products.csv
      

      …and then from products.csv will work.

  2. Connect to an existing database

    When connecting to a database, pass the connection string as an argument to the line magic %prql. The connection string needs to be in SQLAlchemy format, so any connection supported by SQLAlchemy is supported by the magic. Additional connection parameters can be passed as a dictionary using the --connection_arguments flag to the the %prql line magic. We ship with the necessary extensions to use DuckDB as the backend, and here connect to an in-memory database.

Querying

Now, let’s do a query! By default, PRQLMagic always returns the results as dataframe, and always prints the results. The results of the previous query are accessible in the _ variable.

These examples are based on the products.csv example above.



In [4]: %%prql
   ...: from p = products.csv
   ...: filter supplierID == 1

Done.
Returning data to local variable _
   productID    productName  supplierID  categoryID      quantityPerUnit  unitPrice  unitsInStock  unitsOnOrder  reorderLevel  discontinued
0          1           Chai           1           1   10 boxes x 20 bags       18.0            39             0            10             0
1          2          Chang           1           1   24 - 12 oz bottles       19.0            17            40            25             0
2          3  Aniseed Syrup           1           2  12 - 550 ml bottles       10.0            13            70            25             0
In [5]: %%prql
   ...: from p = products.csv
   ...: group categoryID (
   ...:   aggregate [average unitPrice]
   ...: )

Done.
Returning data to local variable _
   categoryID  avg("unitPrice")
0           1         37.979167
1           2         23.062500
2           7         32.370000
3           6         54.006667
4           8         20.682500
5           4         28.730000
6           3         25.160000
7           5         20.250000

We can capture the results into a different variable like so:

In [6]: %%prql results <<
   ...: from p = products.csv
   ...: aggregate [min unitsInStock, max unitsInStock]

Done.
Returning data to local variable results
   min("unitsInStock")  max("unitsInStock")
0                    0                  125

Now, the output of the query is saved to results.

Prefect

Because Prefect is in native python, it’s extremely easy to integrate with PRQL.

With a Postgres Task, replace:

PostgresExecute.run(..., query=sql)

…with…

PostgresExecute.run(..., query=pyprql.to_sql(prql))

We’re big fans of Prefect, and if there is anything that would make the integration easier, please open an issue.

Examples

These examples are rewritten from other languages such as SQL. They try to express real-world problems in PRQL, covering most of the language features. We are looking for different use-cases of data transformation, be it database queries, semantic business modeling or data cleaning.

If you want to help, translate some of your queries to PRQL and open a PR to add them here!

PRQL

from employees
filter country == "USA"                       # Each line transforms the previous result.
derive [                                     # This adds columns / variables.
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost  # Variables can use other variables.
]
filter gross_cost > 0
group [title, country] (                     # For each group use a nested pipeline
  aggregate [                                # Aggregate each group to a single row
    average salary,
    average gross_salary,
    sum salary,
    sum gross_salary,
    average gross_cost,
    sum_gross_cost = sum gross_cost,
    ct = count,
  ]
)
sort sum_gross_cost
filter ct > 200
take 20

SQL

SELECT
  title,
  country,
  AVG(salary),
  AVG(salary + payroll_tax),
  SUM(salary),
  SUM(salary + payroll_tax),
  AVG(salary + payroll_tax + benefits_cost),
  SUM(salary + payroll_tax + benefits_cost) AS sum_gross_cost,
  COUNT(*) AS ct
FROM
  employees
WHERE
  country = 'USA'
  AND salary + payroll_tax + benefits_cost > 0
GROUP BY
  title,
  country
HAVING
  COUNT(*) > 200
ORDER BY
  sum_gross_cost
LIMIT
  20

PRQL

from employees
group [emp_no] (
  aggregate [
    emp_salary = average salary     # average salary resolves to "AVG(salary)" (from stdlib)
  ]
)
join titles [emp_no]
group [title] (
  aggregate [
    avg_salary = average emp_salary
  ]
)
select salary_k = avg_salary / 1000 # avg_salary should resolve to "AVG(emp_salary)"
take 10                             # induces new SELECT
derive salary = salary_k * 1000     # salary_k should not resolve to "avg_salary / 1000"

SQL

WITH table_0 AS (
  SELECT
    emp_no,
    AVG(salary) AS emp_salary
  FROM
    employees
  GROUP BY
    emp_no
)
SELECT
  AVG(table_0.emp_salary) / 1000 AS salary_k,
  AVG(table_0.emp_salary) / 1000 * 1000 AS salary
FROM
  table_0
  JOIN titles USING(emp_no)
GROUP BY
  titles.title
LIMIT
  10

Single item is coerced into a list

PRQL

from employees
select salary

SQL

SELECT
  salary
FROM
  employees

Same as above but with salary in a list:

PRQL

from employees
select [salary]

SQL

SELECT
  salary
FROM
  employees

Multiple items

PRQL

from employees
derive [
  gross_salary = salary + payroll_tax,
  gross_cost = gross_salary + benefits_cost
]

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary,
  salary + payroll_tax + benefits_cost AS gross_cost
FROM
  employees

Same as above but split into two lines:

PRQL

from employees
derive gross_salary = salary + payroll_tax
derive gross_cost = gross_salary + benefits_cost

SQL

SELECT
  employees.*,
  salary + payroll_tax AS gross_salary,
  salary + payroll_tax + benefits_cost AS gross_cost
FROM
  employees

PRQL

table newest_employees = (
  from employees
  sort tenure
  take 50
)

table average_salaries = (
  from salaries
  group country (
    aggregate average_country_salary = (average salary)
  )
)

from newest_employees
join average_salaries [country]
select [name, salary, average_country_salary]

SQL

WITH newest_employees AS (
  SELECT
    employees.*
  FROM
    employees
  ORDER BY
    tenure
  LIMIT
    50
), average_salaries AS (
  SELECT
    country,
    AVG(salary) AS average_country_salary
  FROM
    salaries
  GROUP BY
    country
)
SELECT
  name,
  average_salaries.salary,
  average_salaries.average_country_salary
FROM
  newest_employees
  JOIN average_salaries USING(country)

PRQL

from mytable
filter id == $1

SQL

SELECT
  mytable.*
FROM
  mytable
WHERE
  id = $1

Employees

These are homework tasks on employees database.

Clone and init the database (requires a local PostgreSQL instance):

$ psql -U postgres -c 'CREATE DATABASE employees;'
$ git clone https://github.com/vrajmohan/pgsql-sample-data.git
$ psql -U postgres -d employees -f pgsql-sample-data/employee/employees.dump

Execute a PRQL query:

$ cd prql-compiler
$ cargo run compile examples/employees/average-title-salary.prql | psql -U postgres -d employees

Task 1

rank the employee titles according to the average salary for each department.

My solution:

  • for each employee, find their average salary,
  • join employees with their departments and titles (duplicating employees for each of their titles and departments)
  • group by department and title, aggregating average salary
  • join with department to get department name

PRQL

from salaries
group [emp_no] (
  aggregate [emp_salary = average salary]
)
join t=titles [emp_no]
join dept_emp side:left [emp_no]
group [dept_emp.dept_no, t.title] (
  aggregate [avg_salary = average emp_salary]
)
join departments [dept_no]
select [dept_name, title, avg_salary]

SQL

WITH table_0 AS (
  SELECT
    emp_no,
    AVG(salary) AS emp_salary
  FROM
    salaries
  GROUP BY
    emp_no
),
table_1 AS (
  SELECT
    dept_emp.dept_no,
    t.title,
    AVG(table_0.emp_salary) AS avg_salary
  FROM
    table_0
    JOIN titles AS t USING(emp_no)
    LEFT JOIN dept_emp USING(emp_no)
  GROUP BY
    dept_emp.dept_no,
    t.title
)
SELECT
  departments.dept_name,
  table_1.title,
  table_1.avg_salary
FROM
  table_1
  JOIN departments USING(dept_no)

Task 2

Estimate distribution of salaries and gender for each department departments.

PRQL

from employees
join salaries [emp_no]
group [emp_no, gender] (
  aggregate [
    emp_salary = average salary
  ]
)
join de=dept_emp [emp_no] side:left
group [de.dept_no, gender] (
  aggregate [
    salary_avg = average emp_salary,
    salary_sd = stddev emp_salary,
  ]
)
join departments [dept_no]
select [dept_name, gender, salary_avg, salary_sd]

SQL

WITH table_0 AS (
  SELECT
    emp_no,
    gender,
    AVG(salary) AS emp_salary
  FROM
    employees
    JOIN salaries USING(emp_no)
  GROUP BY
    emp_no,
    gender
),
table_1 AS (
  SELECT
    de.dept_no,
    de.gender,
    AVG(table_0.emp_salary) AS salary_avg,
    STDDEV(table_0.emp_salary) AS salary_sd
  FROM
    table_0
    LEFT JOIN dept_emp AS de USING(emp_no)
  GROUP BY
    de.dept_no,
    de.gender
)
SELECT
  departments.dept_name,
  table_1.gender,
  table_1.salary_avg,
  table_1.salary_sd
FROM
  table_1
  JOIN departments USING(dept_no)

Task 3

Estimate distribution of salaries and gender for each manager.

PRQL

from employees
join salaries [emp_no]
group [emp_no, gender] (
  aggregate [
    emp_salary = average salary
  ]
)
join de=dept_emp [emp_no]
join dm=dept_manager [
  (dm.dept_no == de.dept_no) and s"(de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)"
]
group [dm.emp_no, gender] (
  aggregate [
    salary_avg = average emp_salary,
    salary_sd = stddev emp_salary
  ]
)
derive mng_no = dm.emp_no
join managers=employees [emp_no]
derive mng_name = s"managers.first_name || ' ' || managers.last_name"
select [mng_name, managers.gender, salary_avg, salary_sd]

SQL

WITH table_0 AS (
  SELECT
    emp_no,
    gender,
    AVG(salary) AS emp_salary
  FROM
    employees
    JOIN salaries USING(emp_no)
  GROUP BY
    emp_no,
    gender
),
table_1 AS (
  SELECT
    dm.emp_no,
    gender,
    AVG(table_0.emp_salary) AS salary_avg,
    STDDEV(table_0.emp_salary) AS salary_sd,
    dm.emp_no AS mng_no
  FROM
    table_0
    JOIN dept_emp AS de USING(emp_no)
    JOIN dept_manager AS dm ON dm.dept_no = de.dept_no
    AND (de.from_date, de.to_date) OVERLAPS (dm.from_date, dm.to_date)
  GROUP BY
    dm.emp_no,
    gender
)
SELECT
  managers.first_name || ' ' || managers.last_name AS mng_name,
  managers.gender,
  table_1.salary_avg,
  table_1.salary_sd
FROM
  table_1
  JOIN employees AS managers USING(emp_no)

Task 4

Find distributions of titles, salaries and genders for each department.

PRQL

from de=dept_emp
join s=salaries side:left [
  (s.emp_no == de.emp_no),
  s"({s.from_date}, {s.to_date}) OVERLAPS ({de.from_date}, {de.to_date})"
]
group [de.emp_no, de.dept_no] (
  aggregate salary = (average s.salary)
)
join employees [emp_no]
join titles [emp_no]
select [dept_no, salary, employees.gender, titles.title]

SQL

WITH table_0 AS (
  SELECT
    de.emp_no,
    de.dept_no,
    AVG(s.salary) AS salary
  FROM
    dept_emp AS de
    LEFT JOIN salaries AS s ON s.emp_no = de.emp_no
    AND (s.from_date, s.to_date) OVERLAPS (de.from_date, de.to_date)
  GROUP BY
    de.emp_no,
    de.dept_no
)
SELECT
  table_0.dept_no,
  table_0.salary,
  employees.gender,
  titles.title
FROM
  table_0
  JOIN employees USING(emp_no)
  JOIN titles USING(emp_no)