A principled approach to language integrated queries

Quelques λ dans un monde de 👊

Kim Nguyễn1

joined work with:

V. Benzaken1 G. Castagna2 L. Daynès3 J. Lopez1 R. Vernoux4
  1. LRI, Université Paris-Sud
  2. IRIF, Université Paris-Diderot
  3. Oracle Corp.
  4. (formerly) ENS Cachan

[…] strategic breakthrough will come from redoing the representation of your data or table. This is where the heart of a program lies. Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

The Mythical Man-Month (Frederick P. Brooks, Jr., 1975/1995)



Data dominates.

Rob Pike's rule number 5

Programming with a database

function printPeopleFromCity (db, city) { //db holds a connection to the database q = "SELECT firstname, lastname FROM PEOPLE P WHERE P.city = '" + city + "'"; result = db.query (q); while ((row = next(result)) != null) { println (row.get("lastname") + " " + row.get("firstname")); } }

Problems of the text interface : errors

Partial solutions

//Generates a fresh ID and insert a corresponding DB line People p1 = new People ("Nguyen", "Kim", "Saclay"); People p2 = new People ("Baelde", "David", "Cachan"); //Adds an entry in a join table ! p1.friends.add(p2); p2.friends.add(p1); // cb is a CriteriaBuilder object CriteriaQuery cq = cb.createQuery(People.class); Root p = cq.from(People.class); //People_ is a singleton object with one field for each //database column (with the name of that column inside) cq.where(cb.equal(p.get(People_.town), city)); //The typesystem ensures that queries are well formed and //well typed (city needs to be a String)

Language Integrated Queries (LINQ)

The host language integrates queries as syntactic elements.

.NET based languages support this in a statically typed setting :

//C# code out of the box! //(if you use Microsoft SQL Server) void printPeopleFromCity(DB db, string city) { var q = from p in db.people where p.city == city select new Person(p.firstname, p.lastname); foreach(p in q) { Console.WriteLine(p.toString ()); } }

Query avalanche

function allCapitals (db) { res = db.query ("SELECT capital FROM COUNTRY"); return res; } function printPeopleFromCity(db, city) { … } // unchanged function printPeopleFromCapital(db) { caps = allCapitals(db); while ((row = next(caps)) != null) { printPeopleFromCity(db, row.get("capital")); } } //What you really want is SELECT firstname,lastname // FROM COUNTRY C, PEOPLE P // WHERE P.city = C.capital

UDF

User-Defined Functions (UDF) encapsulate application logic, but …

function isGood (p) { … } //intricate condition function printGoodPeople (db) { //can't call isGood() in a string ! q = "SELECT * FROM PEOPLE P "; result = db.query (q); while ((row = next(result)) != null) { //filtering is done on the client ! if (isGood(row)) println (row.get("lastname") + " " + row.get("firstname")); } }

What about LINQ ?

May introduces query avalanches silently :

IEnumerable<City> allCapitals(DB db) { return (from c in db.country select new City(c.capital)); } void printPeopleFromCityList(DB db, IEnumerable<City> cities) { var q = from p in db.people //has the syntactic from c in cities //structure of a join where p.city == c.name select new Person(p.firstname, p.lastname); foreach (p in q) { Console.WriteLine(p.toString()); } } void test (db) { printPeopleFromCityList(allCapitals(db)); }

What about LINQ (2)?

Does not help with other kinds of UDFs :

void printGoodPeople(db) { var q = from p in db.people where isGood(p); select new Person (p.firstname, p.lastname); foreach(p in q) { Console.WriteLine(p.toString()); } } Unable to cast object of type 'System.Data.Linq.SqlClient.SqlColumn' to type 'System.Data.Linq.SqlClient.SqlMethodCall'.

Stored Procedure

Database side functions

-- in the database CREATE OR REPLACE FUNCTION isGood(p IN PEOPLE%ROWTYPE) IS BOOLEAN BEGIN … END; //in your programming language function printGoodPeople (db, city) { q = "SELECT * FROM PEOPLE P WHERE isGood(P)"; result = db.query (q); while ((row = next(result))!= null) { println (row.get("lastname") + " " + row.get("firstname")); } }

Problems with Stored Procedures

The BOLDR project

Breaking Boundaries Between Languages and Databases runtimes

A common framework to express query fragments in host languages :

BOLDR

Uses a common Query Intermediate Representation (QIR)

BOLDR
  1. At runtime, isolate query fragments
  2. Translate that part into a QIR term
  3. Apply some highlevel optimisation to produce ``good´´ queries
  4. Translate from QIR to your database dialect
  5. Execute the query
  6. Translate the result back to QIR
  7. Translate the result back to the host
  8. Profit !

QIR

Query Intermediate Representation

Untyped λ-calculus with data-operators and data-constructors:

  q ∷= x   | λx.q   | q @ q   | c   | if q then q else q   | { x : q, … , x : q }   | q · x   | q :: q   | ldestr q q q   | Oq, …, q(q, …, q)   | ■H(t)   | DB variable function application constants conditional record constructor record destructor list constructor ([ q1, …, qn]≡q1::…::qn::[]) list destructor data operator foreign H-language code data reference to a database B

Data extension

Data reference D : a database-side value. LD is the query language of the database (e.g. SQL).


Data operators :

Projectq1(q2) ⇝ [ q1@v | v q2 ] (a.k.a List.map) Filterq1(q2) ⇝ [ v | v q2, q1@v ⇝ true ] (a.k.a List.filter) Joinq(q1, q2) ⇝ [ v1⋈v2 | v1q1, v2q2, q @ v1 @ v2 ⇝ true ] ScanD : returns the sequence of elements in D

AST node (H(t)) : is a term from a host language H (e.g. Python, Javascript, …)

An interpreter for the QIR

(values) v ∷= c | { x : v, …, x : v } | [ v, …, v ] | λΓx.q
(x, v) Γ

Γ⊢ x ⇝ v
Γ⊢ q ⇝ true   Γ⊢ q1 ⇝ v1

Γ⊢ if q then q1 else q2 ⇝ v1
Γ⊢ q ⇝ false   Γ⊢ q2 ⇝ v2

Γ⊢ if q then q1 else q2 ⇝ v2
Γ⊢ q1 ⇝ λΓx.q   (x,v)::Γ'⊢ q2 ⇝ v'

Γ⊢ q1 @ q2 ⇝ v'

Γ⊢ λx.q ⇝ λΓx.q
Γ⊢ qi ⇝ vi   v2≡[…]

Γ⊢ q1 :: q2 ⇝ v1 :: v2
Γ⊢ q ⇝ { …, x : v, … }

Γ⊢ q · x ⇝ v
Γ⊢ x1 :qi ⇝ vi

Γ⊢ { x1 : q1, … , xn : qn } ⇝ { x1 : v1, … , xn : vn }
Γ⊢ q1 ⇝ []   Γ⊢ q2 @ [] ⇝ v1

Γ⊢ ldestr q1 q2 q3 ⇝ v
Γ⊢ q1 ⇝ v1::v2   Γ⊢ q3 @ v1 @ v2 ⇝ v

Γ⊢ ldestr q1 q2 q3 ⇝ v

Γ⊢ oq1,…,qn(q'1,…,q'm) ⇝o v
t ⇝H v

Γ⊢ ■H(t) ⇝ HtoQIR(v)

From Host language to QIR

Requirement for the host language

Given a host language H, we need :

HtoQIR(_)
translates values of H into QIR values
QIRtoH(_)
translates values of QIR into values of H
freeze(t)
freezes a term and returns ■(t)
_⇝H_
evaluation of (frozen) terms

… and of course, its abstract syntax

Simple Language

``JavaScripty´´ language (statements, closures, side-effects, …)

  e ::= c SL-constants   | x SL-variables   | fun (x,…,x) { s } SL-functions   | e (e,…,e) SL-application   | { e : e, …, e : e } SL-records   | e.e SL-field access   | e := e SL-assignment   s ::= e SL-expression   | s;s SL-sequence   | if (e) { s } else { s } SL-conditional   | while (e) { s } SL-loop   | return e SL-r​eturn

Translating SL-fragments to QIR

We define a translation s from SL programs to QIR terms :

c x fun (x₁,…,xₙ) { s } e₀ (e₁,…,eₙ) o (e₁,…,eₙ)   { "l₁" : e₁, …, "lₙ" : eₙ } e."l" x = e;s s₁;s₂ if (e) { s₁ } else { s₂ } return e s = = = = = = = = = = = = HtoQIR(c) x (λx₁.….λxₙ. s ) e₀ @ e₁ @ … @ eₙ oe₀,…,eₖ(eₖ₊₁,…,eₙ) o { Filter, Scan, Project, … } { l₁ : e₁ , …, lₙ : eₙ } e·l (λxs) @ e (x is in the local scope and …) (λ_s₂) @ s₁ if e then s₁ else s₂ e freeze(s) otherwise

When to translate ?

The semantics of H is modified as follows :

In a nutshell

function printPeopleFromCity (db, city) { q = Filter(fun (p) { return p.city == city; }, Scan(DPEOPLE)); result = force(q); while ((row = next(result)) != null) { println (row.lastname + " " + row.firstname); } }

becomes similar to:

function printPeopleFromCity (db, city) { result = QIRtoSL(eval((λq.q)@Filterλp. p·city == city(Scan(DPEOPLE))));   while ((row = next(result)) != null) { println (row.lastname + " " + row.firstname); } }

Translating to SQL/Hadoop/…

Handling several backends

We want to translate QIR terms

Joinλa.λb a·id == b·id(Scan(DPEOPLE),Scan(DACTOR))

should become (e.g.) :

SELECT * FROM PEOPLE A, ACTOR B WHERE A.id == B.id

if both tables are in a SQL database

Joinλa.λb a·id == b·id(HBaseToQIR@(get 'PEOPLE'), HBaseToQIR@(get 'ACTOR'))

if both tables are in a HBase database (which does not support joins)

Generic / Specific translations

Generic translation : q → t, LD where t is a term in LD
Specific translations : (Q[t1,…,tn])⇒LDt where Q is a QIR term context
Example of rules

q1t1,LD    q2t2,LD    Filtert1(t2)⇒LDt

Filterq1(q2) → t,LD
q1t1,LD    q2t2,LE    D ≠ E
Filterq1(q2) → FilterDtoQIR(evalD(t1))(EtoQIR(evalE(t2 )))
ScanDLDt

ScanD() → t,LD


Requirements for a database D

Query avalanche in QIR

Consider the QIR term :

(λa.λb. Joinλx.λy x·id = y·id (a, b)) @ (ScanSQL.emp()) (ScanSQL.dept())


Translation output :

(λa.λb. Joinλx.λy x·id = y·id (a, b)) @ (SQLtoQIR @ (evalSQL @ (SELECT * FROM emp))) @ (SQLtoQIR @ (evalSQL @ (SELECT * FROM dept)))


If we had performed the β-reduction :

(SQLtoQIR @ (evalSQL @ (SELECT * FROM (SELECT * FROM emp) AS x JOIN (SELECT * FROM dept) AS y ON x.id = y.id)))

β-reduction to the rescue !

Given a QIR term, we want to capture the following intuition

Reductions yielding ``good´´ terms depend on the target database language

Compatible operators and fragments

A data-operator or expression o is compatible with a database D if there exists a rule for o in the LD specific translation.

A fragment F is a subterm of a QIR term q such that

q ≡ C[T(q1,…,F[e1,…,ek],…,qn)]
and

Fragments

fragment
(λa.λb. Joinλx.λy x·id = y·id (Filterλx. x.age < 100(a,b))) @ (ScanSQL.emp()) @ (ScanSQL.dept())

A mesure of ``good´´ QIR terms

We define the measure M(q) on a QIR term q by :

M(q) = (Op(q) - Comp(q), Frag(q))lex

  1. Let f = F (initial fuel)
  2. If f = 0 or input q has no β-redex, stop
  3. Choose a β-redex in q
  4. Reduce q to q'
  5. If M(q') < M(q) set f = F, repeat from 2
  6. Else set f = f - 1,repeat from 2

Lot of heuristics, optimality of the strategy for some class of terms,…

Putting it into practice

The only thing you have to do is change the semantics of the host language […]

me, section 2 of this talk.

Truffle/Graal

Available languages : SimpleLanguage, R, Python, Ruby, JavaScript (closed-source)

BOLDR now supports SimpleLanguage and R (alpha-version), and main-memory, SQL and HBase as backends

Demo

Related work

Conclusion and future work

BOLDR

A modular framework that supports

What's next ?