The key to efficient data processing is handling rows of data in batches, rather than one row at a time. Older, file-oriented databases utilized the latter method, to their detriment. When SQL relational databases came on the scene, they provided a query grammar that was set-based, declarative and much more efficient. That was an improvement that's stuck with us.
But as evolved as we are at the query level, when we go all the way down to central processing units (CPUs) and the native code that runs on them, we are often still processing data using the much less-efficient row-at-a-time approach. And because so much of analytics involves applying calculations over huge (HUGE) sets of data rows, this inefficiency has a massive, negative impact on the performance of our analytics engines.
Bundle up
So what do we do? Analytics platform company Dremio[1] is today announcing a new Apache-licensed open source technology, officially dubbed the "Gandiva Project for Apache Arrow," that can evaluate data expressions and compile them into efficient native code that processes data in batches.
Dremio has been working hard on this problem for a while, actually. Even before the company emerged out of stealth, it captained the development of Apache Arrow[2] to solve one part of the problem. Arrow helps with representation of data in columnar format, in memory. This, in turn, allows whole series of like numbers to processed in bulk, by a class of CPU instructions called SIMD (single instruction, multiple data), using an approach to working with data called vector processing.
Also read: Apache Arrow unifies in-memory Big Data systems
Also read: Startup Dremio emerges from stealth, launches memory-based BI query engine[3][4]
Efficiency experts
Even though SIMD instructions were introduced by