Calculate the digits of Pi with DuckDB and PRQL

Tobias Brandt· ·
Updated

TL;DR: PRQL recently added a loop construct which makes it Turing Complete and allows doing cool things like calculating Pi right in your database.

Background

Last week saw the 0.6 release of PRQL which brought with it the capability to express Recursive CTEs in PRQL. “Recursive” CTEs aren’t actually truly recursive in the sense that that term is usually used, rather they use a “recursive” (i.e. self-referential) syntax to provide a looping construct in SQL.

PRQL is a modern, functional query language for transforming data. One of its goals is to simplify working with data wherever you can currently use SQL. As such it compiles to SQL while making available modern ergonomics such as f-strings, as well as not so modern features such as functions.

Given that the underlying semantics of Recursive CTEs are really about iteration or “looping”, we have called this feature loop in PRQL.

Introducing loop

Recursive CTEs in SQL consist of two parts, an initial_query and an update_query. First the initial_query is executed and then the rows produced are fed to update_query which is applied to the result set. The update_query is then iteratively applied to the rows produced in the last iteration until no more rows are produced, at which point iteration stops. (For a great review of this as well as some interesting proposals to extend the semantics of Recursive CTEs in SQL, see the paper “A Fix for the Fixation on Fixpoints” by Denis Hirn and Torsten Grust.)

This behavior can be expressed with following pseudo-code:

def loop(step, initial_query):
    result = []
    current = initial_query()
    while current is not empty:
        result = append(result, current)
        current = update_query(current)

    return result

The minimal loop example from the documentation in the PRQL book is:

from_text format:json '[{"n": 1 }]'
loop (
    select n = n+1
    filter n<=3
)

Here we use a PRQL utility function from_text to conveniently turn a JSON representation of some example data into a SQL table (from_text currently also accepts CSV input).

Initially a row with {"n":1} is fed in. Then the update query is applied to this which in this case just increments n by one and filters the result set to only the rows where n is less than or equal to three. This produces {"n":2} in the next step and {"n":3} after that. On the next iteration {"n":4} is produced but that is eliminated by the filter condition. Since that leaves no new rows, iteration stops.

If you try the query above in the PRQL Playground, the result set you get (in the “output.arrow” tab) is:

n
1
2
3

Fibonacci Numbers

Inspired by this, let’s try to calculate Fibonacci numbers which are often one of the first examples when recursion is introduced:

from_text format:json '[{"a":1, "b":1}]'
loop (
    derive b_new = a + b
    select [a=b, b=b_new]
)
take 7

which produces the first 7 Fibonacci numbers.

a
1
1
2
3
5
8
13

You might have noticed that we didn’t actually include a filter in our loop this time. Instead we relied on the fact that DuckDB produces results lazily and since we only took 7 numbers it only produced what we needed.

Now, let’s set ourselves a bigger challenge.

Calculating the digits of pi

March 14th is written on American calendars as 3/14 which reminds us of the first three digits of Pi=3.1415926535… . Therefore this day is commonly known as Pi day. In order to celebrate Pi Day 2023 and the recent release of loop in PRQL 0.6, why don’t we try to calculate the digits of Pi in PRQL!

For our query engine we will use DuckDB because it is really fast and has many modern features that make it ideally suited for the kinds of analytical queries that PRQL is targeting. There is even a duckdb-prql extension which allows you to write PRQL queries right inside DuckDB!

We follow the algorithms in the paper “Unbounded Spigot Algorithms for the Digits of Pi” by Jeremy Gibbons, in particular Rabinowitz and Wagon’s Spigot Algorithm. While it is not the most efficient algorithm presented in the paper, it has the advantage that it uses only standard data types while the more efficient algorithms rely on unbounded integer types which are not available in many database systems.

For my implementation, I adapted the Python implementation by John Burkardt found here. A bit of care had to be taken in porting the implementation to PRQL, as PRQL like SQL, is stateless so I introduced another state variable k to track the position inside the body of the loop. In order to ensure that all values of k are handled in the query we define a PRQL function called loop_steps. The ability to define functions in PRQL is one of its key features and I feel the query below really shows the power that the composability of functions brings to PRQL. PRQL is a functional language with features such as currying. For more details see our previous post A functional approach to relational queries.

The resulting PRQL query is the following:

prql target:sql.duckdb

let config = (
    from_text format:json '[{"num_digits":50}]'
    derive {
        array_len = (10*num_digits)/3,
        calc_len = 1+4,
        loop_len = array_len + calc_len,
    }
)

let loop_steps = step_0 step_i step_1 step_2 step_3 other -> case [
    k==0 => step_0,
    1 <= k and k <= array_len => step_i,
    k==array_len+1 => step_1,
    k==array_len+2 => step_2,
    k==array_len+3 => step_3,
    true => other,
]

let q_steps = step_q9 step_q10 step_j2 step_jg2 other -> case [
    q==9 => step_q9,
    q==10 => step_q10,
    j==2 => step_j2,
    j>2 => step_jg2,
    true => other,
]


from config
select [
    num_digits,
    array_len,
    loop_len,
    j = 0,
    k = 0,
    q = 0,
    a = s"[2 for i in range({array_len})]",
    nines = 0,
    predigit = 0,
    output = '',
]
loop (
    filter j < num_digits + 1
    derive [
      j_new = case [k==0 => j+1, true => j],
      k_new = (k+1) % loop_len,
      q_step_i = (10*s"{a}[{k}]"+q*(array_len-k+1))/(2*(array_len-k)+1),
      q_new = loop_steps 0 q_step_i (q/10) q q q,

      a_step_i = s"[CASE WHEN i=={k} THEN (10*{a}[i]+{q}*({array_len}-i+1))%(2*({array_len}-i)+1) ELSE {a}[i] END for i in generate_series(1,{array_len})]",
      a_step_1 = s"[CASE WHEN i=={array_len} THEN {q}%10 ELSE {a}[i] END for i in generate_series(1,{array_len})]",
      a_new = loop_steps a a_step_i a_step_1 a a a,

      nines_new = loop_steps nines nines nines (q_steps (nines+1) 0 nines nines nines) (case [q!=9 and q!=10 and nines!=0 => 0, true => nines]) nines,
      predigit_new = loop_steps predigit predigit predigit (q_steps predigit 0 q q q) predigit predigit,

      output_step_2 = (q_steps '' s"({predigit}+1)::string || repeat('0', {nines})" s"{predigit}::string || '.'" s"{predigit}::string" ''),
      output_step_3 = (case [q!=9 and q!=10 and nines!=0 => s"repeat('9', {nines})", true => '']),
      output_new = loop_steps '' '' '' output_step_2 output_step_3 '',
    ]
    select {
        num_digits,
        array_len,
        loop_len,
        j = j_new,
        k = k_new,
        q = q_new,
        a = a_new,
        nines = nines_new,
        predigit = predigit_new,
        output = output_new,
    }
)
aggregate [pi=s"string_agg({output}, '')"]

Go ahead and run this query right now in your browser with our Online PRQL Playground! There you can also see the SQL that is produced in the output.sql tab (for you convenience reproduced in the Appendix below) as well as the result of the calculation in the output.arrow tab.

The output you see will be something like the following:

┌────────────────────────────────────────────────────┐
│                         pi                         │
│                      varchar                       │
├────────────────────────────────────────────────────┤
│ 3.141592653589793238462643383279502884197169399375 │
└────────────────────────────────────────────────────┘

Why don’t you play around with the num_digits parameter to try and see how many digits you can get on your laptop? (Unfortunately this algorithm is quite a bit more inefficient than the equivalent Python implementation as Recursive CTEs currently don’t allow for state to be kept outside of the result set.)

Something you might have noticed is that there are some expressions surrounded by s"". This is called an s-string (s for SQL) and allows us to include raw SQL inside our PRQL queries. We use this to include features from DuckDB that haven’t made it into PRQL yet, such as the Pythonesque list comprehensions.

The loop functionality in PRQL is brand new and is marked as experimental and we will be working to stabilise the feature and iterate on the design to make it as easy and useful as possible. We felt that it was exciting enough to share it at this stage (and in time for Pi Day) and make it available for you to try out and play around with. PRQL is developed completely in the open so let us know your use cases so that we can make PRQL the best tool for the data challenges that you face.

Conclusion

Recursive CTEs in SQL make SQL Turing Complete and the same should hold for PRQL now that loop is included. As we saw above, not all algorithms are easily expressed in this paradigm, so while theoretically they could be, whether they should be is another question. However I hope this example demonstrates that loop brings great power to PRQL and in follow up posts I will demonstrate how we can use this to do tree and graph traversals which do come up in practice in the kind of analytical data work that PRQL is made for. After that I will also look at online algorithms such as moving averages an online gradient descent, so be sure to come back for those!

Appendix

The following is the SQL query that was produced and run in DuckDB:

WITH table_0 AS (
  SELECT
    50 AS num_digits
),
config AS (
  SELECT
    num_digits,
    10 * num_digits / 3 AS array_len,
    5 AS calc_len,
    10 * num_digits / 3 + 5 AS loop_len
  FROM
    table_0 AS table_1
),
table_6 AS (
  WITH RECURSIVE loop AS (
    SELECT
      num_digits,
      array_len,
      loop_len,
      0 AS _expr_0,
      0 AS _expr_1,
      0 AS _expr_2,
      [2 for i in range(array_len)] AS _expr_3,
      0 AS _expr_4,
      0 AS _expr_5,
      '' AS _expr_6
    FROM
      config
    UNION
    ALL
    SELECT
      num_digits,
      array_len,
      loop_len,
      _expr_12 AS _expr_15,
      _expr_11 AS _expr_16,
      _expr_10 AS _expr_17,
      _expr_9 AS _expr_18,
      _expr_8 AS _expr_19,
      _expr_7 AS _expr_20,
      CASE
        WHEN _expr_1 = 0 THEN ''
        WHEN 1 <= _expr_1
        AND _expr_1 <= array_len THEN ''
        WHEN _expr_1 = array_len + 1 THEN ''
        WHEN _expr_1 = array_len + 2 THEN _expr_13
        WHEN _expr_1 = array_len + 3 THEN _expr_14
        ELSE ''
      END
    FROM
      (
        SELECT
          num_digits,
          array_len,
          loop_len,
          CASE
            WHEN _expr_1 = 0 THEN _expr_5
            WHEN 1 <= _expr_1
            AND _expr_1 <= array_len THEN _expr_5
            WHEN _expr_1 = array_len + 1 THEN _expr_5
            WHEN _expr_1 = array_len + 2 THEN CASE
              WHEN _expr_2 = 9 THEN _expr_5
              WHEN _expr_2 = 10 THEN 0
              WHEN _expr_0 = 2 THEN _expr_2
              WHEN _expr_0 > 2 THEN _expr_2
              ELSE _expr_2
            END
            WHEN _expr_1 = array_len + 3 THEN _expr_5
            ELSE _expr_5
          END AS _expr_7,
          CASE
            WHEN _expr_1 = 0 THEN _expr_4
            WHEN 1 <= _expr_1
            AND _expr_1 <= array_len THEN _expr_4
            WHEN _expr_1 = array_len + 1 THEN _expr_4
            WHEN _expr_1 = array_len + 2 THEN CASE
              WHEN _expr_2 = 9 THEN _expr_4 + 1
              WHEN _expr_2 = 10 THEN 0
              WHEN _expr_0 = 2 THEN _expr_4
              WHEN _expr_0 > 2 THEN _expr_4
              ELSE _expr_4
            END
            WHEN _expr_1 = array_len + 3 THEN CASE
              WHEN _expr_2 <> 9
              AND _expr_2 <> 10
              AND _expr_4 <> 0 THEN 0
              ELSE _expr_4
            END
            ELSE _expr_4
          END AS _expr_8,
          CASE
            WHEN _expr_1 = 0 THEN _expr_3
            WHEN 1 <= _expr_1
            AND _expr_1 <= array_len THEN [CASE WHEN i==_expr_1 THEN (10*_expr_3[i] + _expr_2 *(array_len - i + 1)
          ) %(2 *(array_len - i) + 1)
          ELSE _expr_3 [i]
      END for i in generate_series(1, array_len) ]
      WHEN _expr_1 = array_len + 1 THEN [CASE WHEN i==array_len THEN _expr_2%10 ELSE _expr_3[i]
  END for i in generate_series(1, array_len) ]
  WHEN _expr_1 = array_len + 2 THEN _expr_3
  WHEN _expr_1 = array_len + 3 THEN _expr_3
  ELSE _expr_3
END AS _expr_9,
CASE
  WHEN _expr_1 = 0 THEN 0
  WHEN 1 <= _expr_1
  AND _expr_1 <= array_len THEN (
    10 * _expr_3 [_expr_1] + _expr_2 * (array_len - _expr_1 + 1)
  ) / (2 * (array_len - _expr_1) + 1)
  WHEN _expr_1 = array_len + 1 THEN _expr_2 / 10
  WHEN _expr_1 = array_len + 2 THEN _expr_2
  WHEN _expr_1 = array_len + 3 THEN _expr_2
  ELSE _expr_2
END AS _expr_10,
(_expr_1 + 1) % loop_len AS _expr_11,
CASE
  WHEN _expr_1 = 0 THEN _expr_0 + 1
  ELSE _expr_0
END AS _expr_12,
_expr_1,
CASE
  WHEN _expr_2 = 9 THEN ''
  WHEN _expr_2 = 10 THEN (_expr_5 + 1) :: string || repeat('0', _expr_4)
  WHEN _expr_0 = 2 THEN _expr_5 :: string || '.'
  WHEN _expr_0 > 2 THEN _expr_5 :: string
  ELSE ''
END AS _expr_13,
CASE
  WHEN _expr_2 <> 9
  AND _expr_2 <> 10
  AND _expr_4 <> 0 THEN repeat('9', _expr_4)
  ELSE ''
END AS _expr_14,
_expr_2,
_expr_4
FROM
  loop AS table_2
WHERE
  _expr_0 < num_digits + 1
) AS table_3
)
SELECT
  *
FROM
  loop
)
SELECT
  string_agg(_expr_6, '') AS pi
FROM
  table_6 AS table_5

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)