Simplify your data lake. Simplify your data architecture. Simplify your data engineering.
Powered by Delta Lake, Databricks combines the best of data warehouses and data lakes into a lakehouse architecture, giving you one platform to collaborate on all of your data, analytics and AI workloads.
Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. This architecture is further integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions:
Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)
Microsoft Mechanics learns how UK-based data engineering consultant, endjin, is evaluating Azure Synapse for on-demand serverless compute and querying.
Endjin specializes in big data analytics solutions for customers across a range of different industries such as ocean research, financial services, and retail industries.
Host Jeremy Chapman speaks with Jess Panni, Principal and Data Architect at endjin, to discuss how they’re using SQL serverless for on-demand compute as well as visualization capabilities to help customers with big data challenges. If you are new to Azure Synapse, it’s Microsoft’s limitless analytics platform that brings enterprise data warehousing and big data processing together into a single service, removing the traditional constraints for analyzing data of all shapes and sizes.
For more information on endjin and how they help small teams achieve big things, check out their website at https://endjin.com
Databricks recently streamed this tech chat on SCD, or Slowly Changing Dimensions.
We will discuss a popular online analytics processing (OLAP) fundamental – slowly changing dimensions (SCD) – specifically Type-2.
As we have discussed in various other Delta Lake tech talks, the reliability brought to data lakes by Delta Lake has brought a resurgence of many of the data warehousing fundamentals such as Change Data Capture in data lakes.
Type 2 SCD within data warehousing allows you to keep track of both the history and current data over time. We will discuss how to apply these concepts to your data lake within the context of the market segmentation of a climbing eCommerce site.
ThorogoodBI explores the use of Databricks for data engineering purposes in this webinar.
Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.
In the second of 2 recorded webcasts Thorogood Consultants Jon Ward and Robbie Shaw showcase Databricks’ data transformation and data movement capabilities, how the tool aligns with cloud computing services, and highlight the security, flexibility and collaboration aspects of Databricks. We’ll also look at Databricks Delta Lake, and how it offers improved storage for both large-scale datasets and real-time streaming data.Whether you’re looking to transform and clean large volumes of data or collaborate with colleagues to build advanced analytics jobs that can be scaled and run automatically, Databricks offers a Unified Analytics Platform that promises to make your life easier.
Here’s a question for the ages and the wise old sages.
Although there are lots of similarities across Software Development and Data Science , they also have three main differences: processes, tooling and behavior. Find out. In my previous article , I talked about model governance and holistic model management. I received great response, along with some questions about the […]
Here’s an interesting article from CodeProject defining the cycles of data science and how it relates to business cycles and the fairly well established framework of SDLC. Although some will argue that data science is “pure science” and this cycle belongs to the “data engineering” label, organizations that fail to move innovations efficiently from “the lab” to production are not going to be competitive.
By its simple definition, Data Science is a multi-disciplinary field that contains multiple processes to extract knowledge or useful output from Input Data. The output may be Predictive or Descriptive analysis, Report, Business Intelligence, etc. Data Science has well-defined lifecycles similar to any other projects and CRISP-DM and TDSP are some of the proven standards.
Data integration is complex with many moving parts. It helps organizations to combine data and complex business processes in hybrid data environments. Failures are very common in data integration workflows. This can happen due to data not arriving on time, functional code issues in your pipelines, infrastructure issues, etc.
A common requirement is the ability to rerun failed activities within data integration workflows. In addition, sometimes you need to rerun activities to re-process data due to an error upstream in data processing. Azure Data Factory now enables you to rerun the entire pipeline or choose to rerun downstream from a particular activity inside a pipeline.