Databricks, the company set up by the inventors of Apache Spark to commercialise their creation, has announced a new addition to their Databricks Unified Analytics Platform called Delta, which it claims combines the advantages of data lake and datawarehousing approaches to storing data at scale, while eliminating some of the disadvantages of each.
Until fairly recently a central store of trustworthy data was generally maintained in a centralised enterprise datawarehouse (EDW). However, this approach suffers from a lack of scalability, proprietary data formats and slow ETL processes, making them unsuitable for real-time analytics and machine learning.
These problems were tackled by the more modern data lake approach which allowed all sorts of data to be stored in the same ‘bucket’ (often on Hadoop), could scale elastically and inexpensively being based on distributed architecture, used open formats and allowed for real-time querying.
The downside of data lakes, however, is that data is generally inconsistent when stored, requiring processing after the fact. The lack of a schema also makes it unreliable for some sorts of analytics and the performance is frequently slower than an EDW. Meanwhile, streaming analytics allows for fast analysis and ingestion of data, but with obvious issues when it comes to historical data.
Combining these systems has always required something of a fudge.
“For a while now people have been boasting about the amount of data in their data lakes,” Databricks CEO and co-founder Ali Ghodsi told Computing at the Spark Summit in Dublin. “‘Oh, I’ve got one petabyte, hey I’ve got two petabytes,’ but the problem comes when management says ‘hey, what are you going to do with that data, what sort of insight can you give me?'”
For applications designed to extract insight, and for programming AI and machine learning applications in particular (because of the number of iterations required to test and train the algorithms), it is vital that information can be both stored and extracted quickly and that the data is clean and accurate.
Delta is aimed at addressing the problems of the data engineer, the person responsible for cleaning, verifying, sorting, extracting and transforming data to make it ready for analysis.
“The hard thing about doing AI isn’t writing the algorithms, it’s preparing all the data to make sure they work properly and reliably. It’s very hard to build production AI on top of exisiting data lakes.” Ghodsi said, pointing at a statistic that 60 per cent of a data science team’s time is spent cleaning data and only a tiny fraction on coding algorithms. “They spend far too much time cleaning data”.
Asked about the name Delta, Ghodsi laughed. “We spend far to much time deciding on that,” he said, explaining that a river delta fits in with the analogy of a lake, that storing metadata along with the data allows for only changed information to be processed rather than having to crunch through the whole lot, and that the triangle of the Greek symbol represents the three principles behind the technology.
“It combines the reliability and performance of data warehouses with the scale of data lakes and low-latency of streaming systems”.
Ghodsi maintains that this combination is unique in an increasingly crowded field of big data platforms that all aim, in different ways, to bring data processing and analytics at scale into real-time (see also MapR, Cloudera, Hortonworks, Datastax and cloud offerings from all the big players). The aim of Databricks from the start has been to bring the capabilities of “one per cent – the Googles and Facebooks and Twitters – to the 99 per cent who don’t have thousands of engineers to work on this stuff,” he said.
“We’ve been looking at solving some of the issues with the way that Spark is deployed. There are other ways of doing what we’ve done here, and enterprises have been putting the different pieces together in their own way, but none of them is simple.”
Save this article