Spark Summit 2018 Preview: Putting AI up front, and giving R and Python programmers more respect

: Written by: J D; Category: News; Published: 05 June 2018; Hits: 583

Video: Machine learning: What it is and why it matters

It shouldn't be surprising given the media spotlight on artificial intelligence ^[1], but AI will be all over the keynote and session schedule ^[2] for this year's Spark Summit ^[3].

The irony, of course, is that while Spark has become known as a workhorse for data engineering workloads, its original claim to fame was that it put machine learning ^[4] on the same engine as SQL, streaming, and graph. But Spark has also had its share of impedance mismatch issues, such as making R ^[5] and Python programs ^[6] first-class citizens, or adapting to more compute-intensive processing of AI models ^[7]. Of course, that hasn't stopped adventurous souls from breaking new ground ^[8].

Hold those thoughts for a moment.

Databricks ^[9], the company whose founders created the Apache Spark project ^[10], has sought to ride Spark's original claim to fame as a unified compute engine by billing itself as a unified analytics platform. Over the past year, Databricks has addressed some gaps -- such as Delta, which added the long-missing persistence layer to its cloud analytics service -- and expanded its reach with Azure Databricks.

This week at Spark Summit, Databricks is announcing that Delta ^[11] will hit general release later this month. The guiding notion for Delta improving reliability is that it provides the landing zone for data pipelines, providing a more scalable option for staging and manipulating data compared to DataFrame or RDD constructs, which were never meant for anything except marshalling data for processing.

Delta is not a data warehouse in that it stores data as columnar

Synaptic Web News

Spark Summit 2018 Preview: Putting AI up front, and giving R and Python programmers more respect