Calculate the digits of Pi with DuckDB and PRQL
TL;DR: PRQL recently added a loop
construct which makes it Turing Complete
and allows doing cool things like calculating Pi right in your database.
Background
Last week saw the 0.6 release of PRQL which brought with it the capability to express Recursive CTEs in PRQL. “Recursive” CTEs aren’t actually truly recursive in the sense that that term is usually used, rather they use a “recursive” (i.e. self-referential) syntax to provide a looping construct in SQL.
PRQL is a modern, functional query language for transforming data. One of its goals is to simplify working with data wherever you can currently use SQL. As such it compiles to SQL while making available modern ergonomics such as f-strings, as well as not so modern features such as functions.
Given that the underlying semantics of Recursive CTEs are really about iteration
or “looping”, we have called this feature loop
in PRQL.
Introducing loop
Recursive CTEs in SQL consist of two parts, an initial_query
and an
update_query
. First the initial_query
is executed and then the rows produced
are fed to update_query
which is applied to the result set. The update_query
is then iteratively applied to the rows produced in the last iteration until no
more rows are produced, at which point iteration stops. (For a great review of
this as well as some interesting proposals to extend the semantics of Recursive
CTEs in SQL, see the paper
“A Fix for the Fixation on Fixpoints” by Denis Hirn and Torsten Grust.)
This behavior can be expressed with following pseudo-code:
def loop(step, initial_query):
result = []
current = initial_query()
while current is not empty:
result = append(result, current)
current = update_query(current)
return result
The minimal loop
example from the documentation in the
PRQL book
is:
from_text format:json '[{"n": 1 }]'
loop (
select n = n+1
filter n<=3
)
Here we use a PRQL utility function from_text
to conveniently turn a JSON
representation of some example data into a SQL table (from_text
currently also
accepts CSV input).
Initially a row with {"n":1}
is fed in. Then the update query is applied to
this which in this case just increments n
by one and filters the result set to
only the rows where n
is less than or equal to three. This produces {"n":2}
in the next step and {"n":3}
after that. On the next iteration {"n":4}
is
produced but that is eliminated by the filter condition. Since that leaves no
new rows, iteration stops.
If you try the query above in the PRQL Playground, the result set you get (in the “output.arrow” tab) is:
n |
---|
1 |
2 |
3 |
Fibonacci Numbers
Inspired by this, let’s try to calculate Fibonacci numbers which are often one of the first examples when recursion is introduced:
from_text format:json '[{"a":1, "b":1}]'
loop (
derive b_new = a + b
select [a=b, b=b_new]
)
take 7
which produces the first 7 Fibonacci numbers.
a |
---|
1 |
1 |
2 |
3 |
5 |
8 |
13 |
You might have noticed that we didn’t actually include a filter
in our loop
this time. Instead we relied on the fact that DuckDB produces results lazily and
since we only took 7 numbers it only produced what we needed.
Now, let’s set ourselves a bigger challenge.
Calculating the digits of pi
March 14th is written on American calendars as 3/14 which reminds us of the
first three digits of Pi=3.1415926535… . Therefore this day is commonly known
as Pi day. In order to celebrate Pi Day 2023 and the recent release of loop
in
PRQL 0.6, why don’t we try to calculate the digits of Pi in PRQL!
For our query engine we will use DuckDB because it is really fast and has many modern features that make it ideally suited for the kinds of analytical queries that PRQL is targeting. There is even a duckdb-prql extension which allows you to write PRQL queries right inside DuckDB!
We follow the algorithms in the paper “Unbounded Spigot Algorithms for the Digits of Pi” by Jeremy Gibbons, in particular Rabinowitz and Wagon’s Spigot Algorithm. While it is not the most efficient algorithm presented in the paper, it has the advantage that it uses only standard data types while the more efficient algorithms rely on unbounded integer types which are not available in many database systems.
For my implementation, I adapted the Python implementation by John Burkardt
found
here. A
bit of care had to be taken in porting the implementation to PRQL, as PRQL like
SQL, is stateless so I introduced another state variable k
to track the
position inside the body of the loop. In order to ensure that all values of k
are handled in the query we define a PRQL function called loop_steps
. The
ability to define functions in PRQL is one of its key features and I feel the
query below really shows the power that the composability of functions brings to
PRQL. PRQL is a functional language with features such as currying. For more
details see our previous post
A functional approach to relational queries.
The resulting PRQL query is the following:
prql target:sql.duckdb
let config = (
from_text format:json '[{"num_digits":50}]'
derive {
array_len = (10*num_digits)/3,
calc_len = 1+4,
loop_len = array_len + calc_len,
}
)
let loop_steps = step_0 step_i step_1 step_2 step_3 other -> case [
k==0 => step_0,
1 <= k and k <= array_len => step_i,
k==array_len+1 => step_1,
k==array_len+2 => step_2,
k==array_len+3 => step_3,
true => other,
]
let q_steps = step_q9 step_q10 step_j2 step_jg2 other -> case [
q==9 => step_q9,
q==10 => step_q10,
j==2 => step_j2,
j>2 => step_jg2,
true => other,
]
from config
select [
num_digits,
array_len,
loop_len,
j = 0,
k = 0,
q = 0,
a = s"[2 for i in range({array_len})]",
nines = 0,
predigit = 0,
output = '',
]
loop (
filter j < num_digits + 1
derive [
j_new = case [k==0 => j+1, true => j],
k_new = (k+1) % loop_len,
q_step_i = (10*s"{a}[{k}]"+q*(array_len-k+1))/(2*(array_len-k)+1),
q_new = loop_steps 0 q_step_i (q/10) q q q,
a_step_i = s"[CASE WHEN i=={k} THEN (10*{a}[i]+{q}*({array_len}-i+1))%(2*({array_len}-i)+1) ELSE {a}[i] END for i in generate_series(1,{array_len})]",
a_step_1 = s"[CASE WHEN i=={array_len} THEN {q}%10 ELSE {a}[i] END for i in generate_series(1,{array_len})]",
a_new = loop_steps a a_step_i a_step_1 a a a,
nines_new = loop_steps nines nines nines (q_steps (nines+1) 0 nines nines nines) (case [q!=9 and q!=10 and nines!=0 => 0, true => nines]) nines,
predigit_new = loop_steps predigit predigit predigit (q_steps predigit 0 q q q) predigit predigit,
output_step_2 = (q_steps '' s"({predigit}+1)::string || repeat('0', {nines})" s"{predigit}::string || '.'" s"{predigit}::string" ''),
output_step_3 = (case [q!=9 and q!=10 and nines!=0 => s"repeat('9', {nines})", true => '']),
output_new = loop_steps '' '' '' output_step_2 output_step_3 '',
]
select {
num_digits,
array_len,
loop_len,
j = j_new,
k = k_new,
q = q_new,
a = a_new,
nines = nines_new,
predigit = predigit_new,
output = output_new,
}
)
aggregate [pi=s"string_agg({output}, '')"]
Go ahead and run this query right now in your browser with our
Online PRQL Playground! There you can
also see the SQL that is produced in the output.sql
tab (for you convenience
reproduced in the Appendix below) as well as the result of the calculation in
the output.arrow
tab.
The output you see will be something like the following:
┌────────────────────────────────────────────────────┐
│ pi │
│ varchar │
├────────────────────────────────────────────────────┤
│ 3.141592653589793238462643383279502884197169399375 │
└────────────────────────────────────────────────────┘
Why don’t you play around with the num_digits
parameter to try and see how
many digits you can get on your laptop? (Unfortunately this algorithm is quite a
bit more inefficient than the equivalent Python implementation as Recursive CTEs
currently don’t allow for state to be kept outside of the result set.)
Something you might have noticed is that there are some expressions surrounded
by s""
. This is called an
s-string (s for
SQL) and allows us to include raw SQL inside our PRQL queries. We use this to
include features from DuckDB that haven’t made it into PRQL yet, such as the
Pythonesque
list comprehensions.
The loop
functionality in PRQL is brand new and is marked as experimental
and we will be working to stabilise the feature and iterate on the design to
make it as easy and useful as possible. We felt that it was exciting enough to
share it at this stage (and in time for Pi Day) and make it available for you to
try out and play around with. PRQL is developed completely in the open so let us
know your use cases so that we can make PRQL the best tool for the data
challenges that you face.
Conclusion
Recursive CTEs in SQL make SQL Turing Complete and the same should hold for PRQL
now that loop
is included. As we saw above, not all algorithms are easily
expressed in this paradigm, so while theoretically they could be, whether they
should be is another question. However I hope this example demonstrates that
loop
brings great power to PRQL and in follow up posts I will demonstrate how
we can use this to do tree and graph traversals which do come up in practice in
the kind of analytical data work that PRQL is made for. After that I will also
look at online algorithms such as moving averages an online gradient descent, so
be sure to come back for those!
Appendix
The following is the SQL query that was produced and run in DuckDB:
WITH table_0 AS (
SELECT
50 AS num_digits
),
config AS (
SELECT
num_digits,
10 * num_digits / 3 AS array_len,
5 AS calc_len,
10 * num_digits / 3 + 5 AS loop_len
FROM
table_0 AS table_1
),
table_6 AS (
WITH RECURSIVE loop AS (
SELECT
num_digits,
array_len,
loop_len,
0 AS _expr_0,
0 AS _expr_1,
0 AS _expr_2,
[2 for i in range(array_len)] AS _expr_3,
0 AS _expr_4,
0 AS _expr_5,
'' AS _expr_6
FROM
config
UNION
ALL
SELECT
num_digits,
array_len,
loop_len,
_expr_12 AS _expr_15,
_expr_11 AS _expr_16,
_expr_10 AS _expr_17,
_expr_9 AS _expr_18,
_expr_8 AS _expr_19,
_expr_7 AS _expr_20,
CASE
WHEN _expr_1 = 0 THEN ''
WHEN 1 <= _expr_1
AND _expr_1 <= array_len THEN ''
WHEN _expr_1 = array_len + 1 THEN ''
WHEN _expr_1 = array_len + 2 THEN _expr_13
WHEN _expr_1 = array_len + 3 THEN _expr_14
ELSE ''
END
FROM
(
SELECT
num_digits,
array_len,
loop_len,
CASE
WHEN _expr_1 = 0 THEN _expr_5
WHEN 1 <= _expr_1
AND _expr_1 <= array_len THEN _expr_5
WHEN _expr_1 = array_len + 1 THEN _expr_5
WHEN _expr_1 = array_len + 2 THEN CASE
WHEN _expr_2 = 9 THEN _expr_5
WHEN _expr_2 = 10 THEN 0
WHEN _expr_0 = 2 THEN _expr_2
WHEN _expr_0 > 2 THEN _expr_2
ELSE _expr_2
END
WHEN _expr_1 = array_len + 3 THEN _expr_5
ELSE _expr_5
END AS _expr_7,
CASE
WHEN _expr_1 = 0 THEN _expr_4
WHEN 1 <= _expr_1
AND _expr_1 <= array_len THEN _expr_4
WHEN _expr_1 = array_len + 1 THEN _expr_4
WHEN _expr_1 = array_len + 2 THEN CASE
WHEN _expr_2 = 9 THEN _expr_4 + 1
WHEN _expr_2 = 10 THEN 0
WHEN _expr_0 = 2 THEN _expr_4
WHEN _expr_0 > 2 THEN _expr_4
ELSE _expr_4
END
WHEN _expr_1 = array_len + 3 THEN CASE
WHEN _expr_2 <> 9
AND _expr_2 <> 10
AND _expr_4 <> 0 THEN 0
ELSE _expr_4
END
ELSE _expr_4
END AS _expr_8,
CASE
WHEN _expr_1 = 0 THEN _expr_3
WHEN 1 <= _expr_1
AND _expr_1 <= array_len THEN [CASE WHEN i==_expr_1 THEN (10*_expr_3[i] + _expr_2 *(array_len - i + 1)
) %(2 *(array_len - i) + 1)
ELSE _expr_3 [i]
END for i in generate_series(1, array_len) ]
WHEN _expr_1 = array_len + 1 THEN [CASE WHEN i==array_len THEN _expr_2%10 ELSE _expr_3[i]
END for i in generate_series(1, array_len) ]
WHEN _expr_1 = array_len + 2 THEN _expr_3
WHEN _expr_1 = array_len + 3 THEN _expr_3
ELSE _expr_3
END AS _expr_9,
CASE
WHEN _expr_1 = 0 THEN 0
WHEN 1 <= _expr_1
AND _expr_1 <= array_len THEN (
10 * _expr_3 [_expr_1] + _expr_2 * (array_len - _expr_1 + 1)
) / (2 * (array_len - _expr_1) + 1)
WHEN _expr_1 = array_len + 1 THEN _expr_2 / 10
WHEN _expr_1 = array_len + 2 THEN _expr_2
WHEN _expr_1 = array_len + 3 THEN _expr_2
ELSE _expr_2
END AS _expr_10,
(_expr_1 + 1) % loop_len AS _expr_11,
CASE
WHEN _expr_1 = 0 THEN _expr_0 + 1
ELSE _expr_0
END AS _expr_12,
_expr_1,
CASE
WHEN _expr_2 = 9 THEN ''
WHEN _expr_2 = 10 THEN (_expr_5 + 1) :: string || repeat('0', _expr_4)
WHEN _expr_0 = 2 THEN _expr_5 :: string || '.'
WHEN _expr_0 > 2 THEN _expr_5 :: string
ELSE ''
END AS _expr_13,
CASE
WHEN _expr_2 <> 9
AND _expr_2 <> 10
AND _expr_4 <> 0 THEN repeat('9', _expr_4)
ELSE ''
END AS _expr_14,
_expr_2,
_expr_4
FROM
loop AS table_2
WHERE
_expr_0 < num_digits + 1
) AS table_3
)
SELECT
*
FROM
loop
)
SELECT
string_agg(_expr_6, '') AS pi
FROM
table_6 AS table_5
-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)