Chapter 21. Rax SQL back ends

Table of Contents

Lazy execution
Handling differences in SQL dialects
Client-side processing

In this chapter, we will dive into some technical details to explain how Rax uses an SQL-based relational database back end. The knowledge of this chapter is not strictly necessary to use Rax, but it will help to understand some performance considerations.

Rax stores sets in an SQL database and translates set operations into SQL queries and executes them in the underlying SQL database. Rax can run on top of several brands of relational databases. Currently, Rax supports the following database systems:

Lazy execution

Since Rax is a functional language, it can use lazy execution to speed up execution. Lazy execution is a strategy of executing programs, in which evaluation of expressions is delayed until the value of an expression is actually needed. In the case of Rax, the execution of (possibly expensive) SQL statements is delayed until the user actually wants to see their results, typically until one of the output procedures is used: `print, `export, `barplot, etc. As long as the user doesn't require output, set expressions are only translated into SQL queries. If multiple transformations are applied to the same set, the SQL queries are combined into a single, nested SQL query, which will be executed in one go. For example, the following script:

    {[#:deptId, #:empId, $:name, &:salary]} :
    employees := import [
        (#)"DEPTID",
        (#)"EMPID",
        ($)"NAME",
        (&)"SALARY"
      ] "EMPLOYEES";

    \employees: newEmployees :=
      select [.empId > 7600]
        employees;

    \employees: convertedSalaries :=
      project [.deptId .. .name, .salary * 0.84]
        newEmployees;

    `print
      fold [.deptId, /sum(.salary)]
        convertedSalaries;
       

is translated into a single SQL query:

    SELECT C01, C02 FROM (
      SELECT 0 AS rownum, C01 AS C01, SUM(C02) AS C02
      FROM (
        SELECT A.rownum AS rownum, C01 AS C01, C04 AS C02
        FROM (
          SELECT A.rownum AS rownum, C01 AS C01, C02 AS C02,
                 C03 AS C03, (C04) * (0.840000::FLOAT(53)) AS C04
          FROM (
            SELECT A.rownum AS rownum, C01 AS C01, C02 AS C02,
                   C03 AS C03, C04 AS C04
            FROM (
              SELECT 0 AS rownum, DEPTID AS C01, EMPID AS C02,
                     NAME AS C03, SALARY AS C04 FROM EMPLOYEES
              ) AS A
            WHERE (CASE WHEN((C02) > (7600::BIGINT))THEN(1)ELSE(0)END)=1
          ) AS A
        ) AS A
      ) AS PRO
      GROUP BY C01
    ) AS A
    ORDER BY C01, C02
       

However, if we add `print or `log statements to print the intermediate results:

    %logfile stdout;

    {[#:deptId, #:empId, $:name, &:salary]} :
    employees :=
      import [
        (#)"DEPTID",
        (#)"EMPID",
        ($)"NAME",
        (&)"SALARY"
      ] "EMPLOYEES";

    `log employees;

    \employees: newEmployees :=
      select [.empId > 7600]
        employees;

    `log employees;

    \employees: convertedSalaries :=
      project [.deptId .. .name, .salary * 0.84]
        newEmployees;

    `log employees;

    `print
      fold [.deptId, /sum(.salary)]
        convertedSalaries;
       

Rax will execute one SQL query per statement. Note that if you switch logging off by changing the top line to

    %logfile off;
       

Rax will generate only one SQL statement again.

This combining of multiple SQL statements into one delivers typically a huge performance advantage, up to two orders of magnitude. The combined queries give the SQL optimizer many opportunities to optimize the execution. In the example above, the optimizer can direct the IO module to read only part of the data from the disk since it knows that it will need only two out of four columns and only rows meeting the criterion .empid > 7600. In fact, combining SQL queries is one of the first things that SQL-savvy analysts do when they want to speed up their SQL script. They have to do it by hand, though, while Rax does it automatically. What is worse, when such a large, combined query needs to be debugged, the analyst needs to decompose it again. This is a costly and error-prone process. Debugging Rax scripts is just a matter of adding `log statements. The folding and unfolding of SQL queries happens automatically.