ishan's notes

Composable SQL abstractions with FunSQL


SQL queries get tricky to write as they get longer, and even trickier to read. FunSQL is a Julia / Python library to make it a bit easier by constructing SQL queries one table operation at a time. This post provides a couple examples to make this point.

Note: FunSQL was originally implemented as a Julia library by Clark Evans and Kyrylo Simonov. The examples in this post use a Python port I wrote, imitating the same API. While this post only gives a small overview, the original Julia docs are the best place to learn more.

Contents

How FunSQL works?

SQL is tricky to write in a modular fashion since it is a DSL with its own grammar. So, while individual relational operations are all short and concise, figuring out how to put them together in the right order can get unwieldy.

The straightforward way to compose SQL query fragments then must rely on string interpolation/concatenation, often extended with a templating language like Jinja. A very good example is the DBT library which makes writing data transforms modular.

Another alternative replaces writing SQL directly with a pipeline style DSL, like dplyr, prql or ibis. These query languages define a set of verbs, each representing a table operation and let us define analytics queries incrementally. The DSL engine translates the pipeline into SQL for us.

FunSQL occupies a slightly distinct point in the design space of relational query languages. It exposes the full expressive power of SQL by implementing the SQL verbs (FROM, WHERE, GROUP BY, ...) as regular objects in the host language (Julia/Python). The smaller SQL fragments can then be manipulated or combined freely to construct a single SQL query. Using a regular language makes extending it to support new syntactic features easy, or even writing your own query DSLs on top of it.

Note: I'd think the other pipelined query languages (prql) or libraries (ibis) can be used similarly, by working with their internal compiler representations. ORMs on the other hand might seem similar, but they generally abstract over the SQL grammar than modeling the data flow correctly.

Edit: FunSQL creator Kyrylo Simonov wrote a new article illustrating the differences here.

Illustration 1 - getting around SQL quirks

Jamie Brandon has a super thoughtful writeup titled Against SQL about how working with SQL is difficult and ways to improve upon it. While the latter task is comparable to implementing a new programming language, writing SQL queries can be made easier using regular programming abstractions.

We address the SQL query examples highlighted in the post using FunSQL. While not dissimilar to pipelined DSLs or ORMs in its objective, it stays close to SQL semantics and aims to feel just like writing SQL directly.

from funsql import *

SQL is verbose to express

The example given shows how SQL is verbose since we can't abstract over common patterns. While the SQL spec allows for function definitions, the article points out the limits regarding the types of the input args and hence the reduced flexibility.

select foo.id, quux.value 
from foo, bar, quux 
where foo.bar_id = bar.id and bar.quux_id = quux.id

Regular programming languages like python don't share the restrictions of SQL functions, so creating higher level abstractions is convenient. Here, we create a function to join multiple tables through foreign key relationships.

foo = SQLTable(S.foo, ["id", "bar_id"])
bar = SQLTable(S.bar, ["id", "quux_id"])
quux = SQLTable(S.quux, ["id", "value"])

Since FunSQL constructs are regular functions and objects in the host language, we can use them to abstract over composite SQL clauses.

def fk_join(*args, id_column="id"):
    # args is an interleaved list of tables and foreign key names
    table = From(args[0])
    fk_name = None
    for i, arg in enumerate(args[1:]):
        if i % 2 == 0:
            fk_name = S(arg)
        else:
            joinee = From(arg)
            table = table >> Join(
                joinee, on=Fun("=", Get(fk_name), Get(id_column, over=joinee))
            )
    return table


q = fk_join(foo, "bar_id", bar, "quux_id", quux)
render(q)

FunSQL compiles this query object to the following SQL.

-- output query:
SELECT
  "foo_1"."bar_id", 
  "bar_1"."quux_id", 
  "quux_1"."value"
FROM "foo" AS "foo_1"
INNER JOIN "bar" AS "bar_1" ON ("foo_1"."bar_id" = "bar_1"."id")
INNER JOIN "quux" AS "quux_1" ON ("bar_1"."quux_id" = "quux_1"."id")

You'd note writing FunSQL expressions in python isn't particularly concise; Fun("=", Get(fk_name), Get(id_column, over=joinee)) is too verbose for a join condition. The Julia version uses operator overloading to simplify this which I have been too lazy to copy. Besides, FunSQL being just a python library, we can create our own abstractions per ergonomic preferences.

SQL queries are fragile

The example shows how correlated subqueries in a SELECT clause can only return a single column, and must be swapped for lateral joins if we need flexibility in the output type.

-- inline for a single column
select
    manager.name,
    (select employee.name
    from employee
    where employee.manager = manager.name
    order by employee.salary desc
    limit 1)
from manager;

-- lateral join for multiple columns
select manager.name, employee.name, employee.salary
from manager
join lateral (
   select employee.name, employee.salary
   from employee
   where employee.manager = manager.name
   order by employee.salary desc
   limit 1
) as employee
on true;

I couldn't find the reason for why inline subqueries are allowed at places where a scalar expression is required - args for a SELECT clause, WHERE expressions, and more. Though this feels like an inconsistency coming from SQL's desire to be less verbose? The nested query is really more like a table than a column, so lateral joins are the "correct" choice. However, inline subqueries are slightly easier to read/write (and also test?).

To hide this detail from the query writer, we could compile to an inline query when a single column is selected, and use a lateral join otherwise. Alternatively, we could just output a lateral join everytime. By creating an abstraction for the correlated join, we can still keep the query syntax concise.

manager = SQLTable("manager", ["id", "name"])
employee = SQLTable("employee", ["id", "name", "salary", "manager"])

# returns highest paid employee for a given manager
def most_paid_employee(m_name):
    return (
        From(employee)
        >> Where(Fun("=", Get.manager, Var.MANAGER_NAME))
        >> Order(Get.salary >> Desc())
        >> Limit(1)
        >> Bind(aka(m_name, S.MANAGER_NAME))
    )


q = most_paid_employee("ABC") >> Select(Get.name, Get.salary)
render(q)
-- output query:
SELECT
  "employee_1"."name", 
  "employee_1"."salary"
FROM "employee" AS "employee_1"
WHERE ("employee_1"."manager" = 'ABC')
ORDER BY "employee_1"."salary" DESC
LIMIT 1

Now, we can also reuse this subquery to compute top salaried employee for all the managers.

q = (
    From(manager)
    >> Join(most_paid_employee(Get.name) >> As("employee"), on=True, left=True)
    >> Select(
        Get.name,
        aka(Get.employee >> Get.name, "emp_name"),
        aka(Get.employee.salary, "emp_salary"),
    )
)
render(q)
-- output query:
SELECT
  "manager_1"."name", 
  "employee_2"."name" AS "emp_name", 
  "employee_2"."salary" AS "emp_salary"
FROM "manager" AS "manager_1"
LEFT JOIN LATERAL (
  SELECT
    "employee_1"."name", 
    "employee_1"."salary"
  FROM "employee" AS "employee_1"
  WHERE ("employee_1"."manager" = "manager_1"."name")
  ORDER BY "employee_1"."salary" DESC
  LIMIT 1
) AS "employee_2" ON TRUE

Since the columns selected are specified in the end, we don't have to go back and edit the correlated query whether we pick single, multiple or no columns from it!

SQL code is incompressible

The post provides multiple examples where SQL makes you tear your hair out. I concur.

Variables

Temporary scalar variables can't be created unless they are included in the output. For example, this arithmetic op can't be abstracted over without creating a subquery.

-- repeated structure
select a+((z*2)-1), b+((z*2)-1) from foo;

-- compressed?
select a2, b2 from (select a+tmp as a2, b+tmp as b2, (z*2)-1 as tmp from foo);

Since FunSQL nodes are regular python variables, we can just reuse them and hope they will be compiled away.

foo = SQLTable("foo", ["a", "b", "z"])


def add_z(col):
    z_sq = Fun("-", Fun("*", Get.z, 2), 1)
    return Fun("+", z_sq, col)


q = From(foo) >> Select(add_z(Get.a) >> As(S.a), add_z(Get.b) >> As(S.b))
render(q)
-- output query: 
SELECT
  ((("foo_1"."z" * 2) - 1) + "foo_1"."a") AS "a", 
  ((("foo_1"."z" * 2) - 1) + "foo_1"."b") AS "b"
FROM "foo" AS "foo_1"

Grouping by a custom column

SQL doesn't allow naming args to a GROUP BY clause.

-- can't name this value
> select x2 from foo group by x+1 as x2;
ERROR:  syntax error at or near "as"
LINE 1: select x2 from foo group by x+1 as x2;

-- sprinkle some more select on it
> select x2 from (select x+1 as x2 from foo) group by x2;

FunSQL fundamentally, just tracks the column/relation namespaces through each operation. It adds the variables created by the Group node to the namespace for that subquery, and moves the alias to the corresponding SELECT.

foo = SQLTable("foo", ["x", "y"])
q = (
    From(foo)
    >> Group(aka(Fun("+", Get.x, 1), S.x2))
    >> Select(Get.x2, Agg.count(Get.y))
)

render(q)
-- output query:
SELECT
  ("foo_1"."x" + 1) AS "x2", 
  count("foo_1"."y") AS "count"
FROM "foo" AS "foo_1"
GROUP BY ("foo_1"."x" + 1)

CTEs

SQL didn't have CTEs until SQL:99. While most common SQL dialects support it now, it might lead to unexpected performance characteristics. Postgres, for example, treated CTEs as optimization fences until 2019.

-- repeated structure
select * 
from 
  (select x, x+1 as x2 from foo) as foo1 
left join 
  (select x, x+1 as x2 from foo) as foo2 
on 
  foo1.x2 = foo2.x;
  
-- compressed?
with foo_plus as 
  (select x, x+1 as x2 from foo)
select * 
from 
  foo_plus as foo1 
left join 
  foo_plus as foo2 
on 
  foo1.x2 = foo2.x;

With FunSQL, inline subqueries can be written similarly to CTEs without duplication. We just reuse the variable representing the subquery.

foo = SQLTable("foo", ["x", "y"])
foo_plus = From(foo) >> Select(Get.x, aka(Fun("+", Get.x, 1), S.x2))

The inline version gets rendered as,

q = foo_plus >> Join(
    aka(foo_plus, "foo_2"), left=True, on=Fun("=", Get.x2, Get.foo_2.x)
)
render(q)
-- output query:
SELECT
  "foo_2"."x", 
  "foo_2"."x2"
FROM (
  SELECT
    "foo_1"."x", 
    ("foo_1"."x" + 1) AS "x2"
  FROM "foo" AS "foo_1"
) AS "foo_2"
LEFT JOIN (
  SELECT
    "foo_3"."x", 
    ("foo_3"."x" + 1) AS "x2"
  FROM "foo" AS "foo_3"
) AS "foo_2_1" ON ("foo_2"."x2" = "foo_2_1"."x")

While with the base table defined as a CTE,

q = (
    From(S.foo_plus)
    >> Join(
        aka(From(S.foo_plus), "foo_plus_2"),
        left=True,
        on=Fun("=", Get.x2, Get.foo_plus_2.x),
    )
    >> With(foo_plus >> As(S.foo_plus))
)
render(q)
-- output query:
WITH "foo_plus_1" ("x", "x2")  AS (
  SELECT
    "foo_1"."x", 
    ("foo_1"."x" + 1) AS "x2"
  FROM "foo" AS "foo_1"
)
SELECT
  "foo_plus_2"."x", 
  "foo_plus_2"."x2"
FROM "foo_plus_1" AS "foo_plus_2"
LEFT JOIN "foo_plus_1" AS "foo_plus_3" ON ("foo_plus_2"."x2" = "foo_plus_3"."x")

Conclusion

While SQL definitely needs a redo for the big list of reasons specified in the Against SQL post, FunSQL lets us get around some of the lexical issues with SQL. It could be useful to query systems speaking SQL either directly, or implementing a more concise DSL on top of it.

Illustration 2 - making your own DBT

DBT is a widely used tool for ETL. You define the data transformations as a set of SELECT statements with dependencies defined, and DBT takes care of executing the full graph by running them in the right order. To make the table/view definitions parameteric and resuable, DBT lets you combine SQL with Jinja, which is a templating language. Jinja provides some niceties, like control flow (if/for) and macros, making DBT useful almost everywhere.

FunSQL library helps compose SQL queries, we could do away with a templating language like Jinja. It isn't particularly desirable! DBT fits the interactive workflow - models are SQL queries introduced one at a time, with some Jinja markup/macros. That is a more declarative approach than combining python functions.

However, writing data transformations using a python DSL affords some flexibility.

The Jaffle shop

The DBT introductory tutorial uses the example of a jaffle shop, which we try to reproduce. Since this is an illustrative example, We use a sqlite database file, and no other dependencies. The full pipeline involves multiple tables, so we just show the code for a single view below. The full script is available here.

class orders_final(DataModel):
    __materialize__ = True
    orders: stg_orders
    payments: order_payments

    def query(self, ctx: Context) -> SQLNode:
        payment_methods: list[str] = ctx["payment_methods"]
        return _join_on_key(
            self.orders(ctx), self.payments(ctx), "payments", "order_id"
        ) >> Select(
            Get.order_id,
            Get.customer_id,
            Get.order_date,
            Get.status,
            *[Get.payments >> Get(f"{method}_amount") for method in payment_methods],
            aka(Get.payments.total_amount, "amount"),
        )

We define each table view as a class object, with the other views it depends on as attributes. Now, we can get the dependency graph by inspecting the code, and optimize how to go about generating the tables/views.

def fill_graph(model: Type[DataModel], node_children: dict, node_parent_count: dict):
    """utility routine to show how to construct a graph of data models"""
    if model not in node_parent_count:
        node_children[model] = []
        node_parent_count[model] = 0

    for _, parent in get_parent_models(model):
        node_children[parent].append(model)
        node_parent_count[model] += 1
        fill_graph(parent, node_children, node_parent_count)

To make things resilient, we could go further and adopt a workflow tool like Prefect. By making each table materialization a discrete task, we can get caching/scheduling and the other good stuff.

To emphasize again, this would be a terrible idea for a production system. DBT is a great tool, and replicating it wouldn't be serious. But this serves as a good illustration for the compositionality FunSQL provides.