Video: Machine learning: What it is and why it matters
It shouldn't be surprising given the media spotlight on artificial intelligence[1], but AI will be all over the keynote and session schedule[2] for this year's Spark Summit[3].
The irony, of course, is that while Spark has become known as a workhorse for data engineering workloads, its original claim to fame was that it put machine learning[4] on the same engine as SQL, streaming, and graph. But Spark has also had its share of impedance mismatch issues, such as making R[5] and Python programs[6] first-class citizens, or adapting to more compute-intensive processing of AI models[7]. Of course, that hasn't stopped adventurous souls from breaking new ground[8].
Hold those thoughts for a moment.
Databricks[9], the company whose founders created the Apache Spark project[10], has sought to ride Spark's original claim to fame as a unified compute engine by billing itself as a unified analytics platform. Over the past year, Databricks has addressed some gaps -- such as Delta, which added the long-missing persistence layer to its cloud analytics service -- and expanded its reach with Azure Databricks.
This week at Spark Summit, Databricks is announcing that Delta[11] will hit general release later this month. The guiding notion for Delta improving reliability is that it provides the landing zone for data pipelines, providing a more scalable option for staging and manipulating data compared to DataFrame or RDD constructs, which were never meant for anything except marshalling data for processing.
Delta is not a data warehouse in that it stores data as columnar