Azure-Databricks

Introduction:
What did we have till now on Azure?

As part of auto-deployment or long-running clusters, we are habituated to work with azure-VMs that have HDP installed, aka HDInsight clusters. These provide more control for configuration and compute capacity management. The cluster is fairly static and auto-scaling feature if rudimentary where only data-nodes of a pre-specified size can be added/removed manually. Quick overview of azure offerings and the scale for ease-of-use and reduced administration (read cluster control)

What is this Azure-Databricks now?
-Imagine a world with no hadoop and a holistic data-compute architecture which decouples storage and compute for cloud based applications. I think, you are now imagining azure-databricks. Built for spark-on-cloud and just that, azure-databricks serves as a good start for compute-only (ephemeral, if you may) clusters. A snippet from the azure-blob on the benefits of using this offering:

It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost. Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. Note that all clusters within the same workspace share data among all of those clusters.

Architecture for Azure-Databricks

Key things to note (pros & cons)

Quick cluster setup	It takes about 3-5 mins to spin up a databricks cluster. Has the semantics of 'pausing' the cluster when not in use and programmatically resume. Pricing is per minute. Two cluster types: Interactive cluster (targeted towards data-science applications) Job cluster (batch jobs)	Does not have notion of default/primary storage for cluster Only spark services are installed and active Databricks spark is 40% faster than open source spark
Cluster Usability	Each user has their own container - SQLContext is not shared, sparkContext is. Blob/ADLS storage can be mounted to databricks-file-system to abstract file-system calls Automatic Spark UI & Spark log aggregation. We currently lose the Spark UI when we do ephemeral cluster for EMR/HDInsight Security through AD directory integration only. No Ranger.	VNET support to come in coming quarter Relies on spark-datasources for the types of sources/targets it can support Recommend one cluster per job.
Serverless	Pessimistic approach for interactive-cluster type, wherein node dissociation is aggressively managed. Optimistic approach for job type cluster, wherein node association is aggressive. Query Watch Dog Driver fault isolation Preemption of jobs Auto-scaling reference: https://docs.azuredatabricks.net/user-guide/clusters/sizing.html	VM-size of the node that gets attached cannot be changed on the fly Auto-scaling is totally managed by databricks-workload algorithm and it is not possible to provide a policy to override.

Total Azure Integration (a snippet from databricks docs)

Diversity of VM types: Customers can use all existing VMs: F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.
Security and Privacy: In Azure, ownership and control of data is with the customer. We have built Azure Databricks to adhere to these standards. We aim for Azure Databricks to provide all the compliance certifications that the rest of Azure adheres to.
Flexibility in network topology: Customers have a diversity of network infrastructure needs. Azure Databricks supports deployments in customer VNETs, which can control which sources and sinks can be accessed and how they are accessed.
Azure Storage and Azure Data Lake integration: these storage services are exposed to Databricks users via DBFS to provide caching and optimized analysis over existing data.
Azure Power BI: Users can connect Power BI directly to their Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools.
Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions so naturally AAD can be used to control access to sources, results and jobs.
Azure SQL Data Warehouse, Azure SQL DB and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.

Internally, we use Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.
Accelerated Networking provides the fastest virtualized network infrastructure in the cloud. Azure Databricks utilizes this to further improve Spark performance.
The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO. These make Databricks I/O performance even better.

References:
1) https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-databricks/
2) https://azure.microsoft.com/en-us/services/databricks/
3) https://databricks.com/product/azure
4) https://docs.azuredatabricks.net/user-guide/clusters/configure.html

Brain Dumps

Search This Blog