ADF mapping Dataflows for the impatient — Introduction
TL;DR ADF Mapping Data Flow is a new feature of Azure data factory (ADF) which allows users build data transformation pipelines (ELT jobs) using a graphical user interface. Essentially Data Flow mapping generates Spark code for the pipeline to be executed on Spark at scale without needing to write a line of code and with the advantage of a GUI for pipeline management, data lineage, query push down and most importantly embedding within the current ADF pipelines like any other activity.
How is data flow mapping different to ADF?
To be able to answer this question we need to first briefly look at the idea behind ADF and how it works.
ADF is built with the sole responsibility of orchestration and scheduling of data pipelines rather than a data transformation tool. What this means is Microsoft intentionally decided to separate the role of a job scheduler from the data processing platform. The advantages of this approach are:
- Leave every task to the platform that is best at it. Scheduling and orchestrating is done by ADF while depending on the type of transformation you can choose which ever tool does that best.
- Provides a full flexibility of combining various processing engines in the same pipeline depending on the nature of the task so that it is both technically and commercially optimum. For example if a data pipeline starts with an event driven data source the preliminary data acquisition and processing can be done in an Azure Function and the primary data transformation happens in a dedicated data platform like Azure Databricks or Azure Data Lake Analytics.
At the end of the day what this means is ADF would become a very light, non-compute intensive application and as a result very cost effective.
How about the copy activity?
The only exception where ADF requires to bring in its own compute power and rely on its Integration Run Time is for copy data activity. Copy Activity is when ADF is taking data from a source to a target. The source and target could be any data source whether on-premises, in Azure or in other clouds. Once the data arrives at the destination the rest of the process must be handed over to one of the dedicated data processing platforms as described above.
Now it is very easy to answer to the question above: Data Flow Mapping is just another “dedicate” data processing activity as far as ADF is concerned.
ADF Data Flow Mapping concepts
Data Flow Mapping is built around the concept of data streams (the same concept used in Apache NiFi). Simply put, every data stream starts with a Source, then data flows through as series of transformations and finally arrives in data sink.
Source and sink
Data source is where data first is read from and staged in Spark for further transformation. At this time there are 4 data sets available as part of the private preview but there are many more to be released in near future.
- Azure SQL DB
- Azure SQL DW
- Parquet: If you are not familiar with Parquet file format I suggest have a look at some of the resources available on the web. In a nutshell Apache Parquet is a binary column oriented file format. The primary advantage of binary file formats over delimited text or flat text files is they include metadata (column names and types) as well as the data. For instance if we are offloading information from a database to Parquet files the column names and types are also included in the data file. Also being a binary format the major problem of separator being in the data field and the need for escape characters are eliminated.
- Delimited Text (e.g. CSV)
Similarly data sinks are the data sets for writing the transformed data to the target data store.
This is where the actual magic of Data Flow Mappings takes place. Transformations, as the name suggests are the activities that perform the actual data transformation logic on the data stream. below is a list of transformations but if you need to read more you can refer to this page.
Transform Description New Branch Create a new flow branch with the same data. Join Join data from two streams based on a condition. Conditional Split Route data into different streams based on conditions. Union Collect data from multiple streams. Lookup Lookup additional data from another stream. Derived Column Compute new columns based on existing ones. Aggregate Calculate aggregations on the stream. Surrogate Key Adds a surrogate key column to output stream from a specific value. Pivot Row values transformed into individual columns. Unpivot Column values transformed into individual rows. Exists Check the existence of data in another stream. Select Choose columns to flow to the next stream. Filter Filter rows in the stream based on a condition. Sort Order data in the stream based on column(s). Alter rowMark rows as insert, update, delete or upsert based on conditions.
Getting started with Data Flow Mappings
Before proceeding, you will need to request access to the limited private preview of data flow mapping using the web form (http://aka.ms/dataflowpreview). Once your Azure subscription is white listed for data flow mapping you will need to create an Azure Data Factory V2 instance in order to start building you data flow mapping pipelines.
Note: You may have noticed previously you needed to create a ADV v2. with data flows in order to access data flows mapping but this is no longer the case and it is now included in ADF V2.
When you enter the edit and monitor GUI of ADF you will notice the new data flows in the side bar similar to below. Once you create a new dataflow it will open a new canvas to start building your dataflow streams.
From here you can start creating a new dataflow by adding a data source and data transformation.
Azure Data Factory Mapping Data Flow has a debug mode, which can be switched on with the Debug button at the top of the design surface. When designing data flows, setting debug mode on will allow you to interactively watch the data shape transform while you build and debug your data flows.
When you turn on debug mode ADF under the hood will create a Spark cluster to harness its compute power for letting the developer preview the transformations in real time.
Note: In previous versions of Dataflows you needed to create a Azure Databricks cluster and attach it to ADF through a linked service. You will no longer have to bring your own Databricks clusters. ADF will manage cluster creation and tear-down. What this means is you will not even need to have an Azure Databricks workspace in order to work with Dataflows mapping.
How I built an SCD type 2 data pipeline using Azure mapping dataflows
with all the concepts covered about dataflows mapping I thought a very good practice would be to build my own slowly changing dimension type 2 using this new shiny toy. Below is an overview of the pipeline and in the next post I will go deep in to how to build a full data pipeline like this using mapping dataflows so stay tuned!