Sort by Regex

Sort by Regex
By | 2018-01-17T16:21:58+01:00 June 6th, 2016|Categories: Coding|

Rax string reduction, /cat(), on the IMPALA backend was not trivial to implement. The IMPALA documentation states:

Effectively this means no ORDER BY on GROUP_CONCAT in IMPALA. This must be implemented eventually, because the whole GROUP_CONCAT has limited use without it. Meantime, regexes can be used to put things back in order like so:

Which would place eve first, anna second, and dave third. The rank number rnum of each name can be expressed like this:

The GROUP_CONCAT from rnum belonging to (anna)(dave)(eve) would be 231, which can be translated to \3\1\2 using regexp_replace and translate like so:

Giving us the third argument of the sorting regex. The first and second argument are simple. Likewise an SQL SELECT statement like this:

could be rewritten in IMPALA like (untested) so:

Your milage may vary using this code, but if you, like Rax, need a solution right now, try a variant of this code. (You might need to double up the \\.) Also note that it might be safer to use an escaped version of child to use in the second argument of the outer regexp_replace like so:

Finally, note, this only works with group sizes below ten. To up that, replace 123456789 by abcdefghijklmnop (or longer), replace CAST(rnum AS STRING) with something like substr('abcdefghijklmnop', rnum, 1) and add an extra regexp_replace in front of strleft to transform \a\b\c...\o\p into \1\2\3..\15\16.