Introduction:
What did we have till now on Azure?
What is this Azure-Databricks now?
-Imagine a world with no hadoop and a holistic data-compute architecture which decouples storage and compute for cloud based applications. I think, you are now imagining azure-databricks. Built for spark-on-cloud and just that, azure-databricks serves as a good start for compute-only (ephemeral, if you may) clusters. A snippet from the azure-blob on the benefits of using this offering:
It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost. Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. Note that all clusters within the same workspace share data among all of those clusters.
Key things to note (pros & cons)
1) https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-databricks/
2) https://azure.microsoft.com/en-us/services/databricks/
3) https://databricks.com/product/azure
4) https://docs.azuredatabricks.net/user-guide/clusters/configure.html
What did we have till now on Azure?
- As part of auto-deployment or long-running clusters, we are habituated to work with azure-VMs that have HDP installed, aka HDInsight clusters. These provide more control for configuration and compute capacity management. The cluster is fairly static and auto-scaling feature if rudimentary where only data-nodes of a pre-specified size can be added/removed manually. Quick overview of azure offerings and the scale for ease-of-use and reduced administration (read cluster control)
What is this Azure-Databricks now?
-Imagine a world with no hadoop and a holistic data-compute architecture which decouples storage and compute for cloud based applications. I think, you are now imagining azure-databricks. Built for spark-on-cloud and just that, azure-databricks serves as a good start for compute-only (ephemeral, if you may) clusters. A snippet from the azure-blob on the benefits of using this offering:
It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost. Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. Note that all clusters within the same workspace share data among all of those clusters.
Architecture for Azure-Databricks
Key things to note (pros & cons)
Quick cluster setup |
|
|
Cluster Usability |
|
|
Serverless |
|
|
Total Azure Integration (a snippet from databricks docs)
- Diversity of VM types: Customers can use all existing VMs: F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.
- Security and Privacy: In Azure, ownership and control of data is with the customer. We have built Azure Databricks to adhere to these standards. We aim for Azure Databricks to provide all the compliance certifications that the rest of Azure adheres to.
- Flexibility in network topology: Customers have a diversity of network infrastructure needs. Azure Databricks supports deployments in customer VNETs, which can control which sources and sinks can be accessed and how they are accessed.
- Azure Storage and Azure Data Lake integration: these storage services are exposed to Databricks users via DBFS to provide caching and optimized analysis over existing data.
- Azure Power BI: Users can connect Power BI directly to their Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools.
- Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions so naturally AAD can be used to control access to sources, results and jobs.
- Azure SQL Data Warehouse, Azure SQL DB and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.
- Internally, we use Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.
- Accelerated Networking provides the fastest virtualized network infrastructure in the cloud. Azure Databricks utilizes this to further improve Spark performance.
- The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO. These make Databricks I/O performance even better.
1) https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-databricks/
2) https://azure.microsoft.com/en-us/services/databricks/
3) https://databricks.com/product/azure
4) https://docs.azuredatabricks.net/user-guide/clusters/configure.html
Comments
Post a Comment