Rax/MySQL vs. Rax/Azure vs. Rax/Redshift

Rax/MySQL vs. Rax/Azure vs. Rax/Redshift
By | 2018-01-17T16:59:55+01:00 November 14th, 2014|Categories: Coding, Rax, SQL Rants|

Our SQL-backend family is contantly growing. Rax could already connect to SQLite, MySQL and PostgreSQL databases. Now we have also ported Rax to two major cloud databases: Microsoft Azure and AWS Redshift. The port to Azure gave us some headache due to problems with their ODBC driver for Linux. The port to Redshift was straightforward, as it’s using PostgreSQL’s syntax and ODBC driver.

Before Rax, it was difficult to compare the performance of various database systems on behavioral data, as each database vendor provides a completely different set of date and time functions, which are essential when analyzing behavioral data. With Rax, we can easily do this comparison, as Rax offers a uniform interface for time-related operations.

For testing, we use a real-life analysis: a computation of the TV reach. Below is a Rax script computing the reach of various TV channels:

We’ve run this script on Rax/MySQL, Rax/PostgreSQL, Rax/Azure and Rax/Redshift. The script runs on all backends essentially without changes. For Postgres-like dialects (PostgreSQL, Redshift), we had to change one of the lines in the import statement (since import contains snippets of SQL):

was changed to:

The data size in this experiment was rather small: the TvExposures table has approximately 400000 rows. The runtimes of the script on various database systems are plotted below (the chart was, naturally, generated by Rax). For Redshift, we tried various cluster configurations.

Somewhat surprisingly, the Azure’s SQL Server was actually slower on this query than MySQL on my local machine (Mac mini with 2.3 GHz Intel Core i7 and 8GB of memory). PostgreSQL performs rather badly. It seems to spend a lot of time on the non-equi join between TvExposures and RespondentWeights. Azure and a single-node Redshift configuration perform similarly. Increasing Redshift’s cluster size and using larger machines gave a clear performance benefit. However, the largest cluster size that we used on Redshift was 2. Scaling beyond that simply didn’t make sense for the data size in this demo.

All in all, it was an interesting experiment. However, it will be even more interesting to compare the performance of this script on real Big Data. You will read about it in one of the following blog posts. Stay put.

And for the curious, this is the query generated from this script by Rax/Azure (don’t get scared):