Security, Governance and CI/CD in Databricks

By:   |   Updated: 2022-02-14   |   Comments   |   Related: > Azure Databricks


Free SQL Server Performance and Monitoring Report


Dear Database Professional,

Download your free copy of the MSSQLTips.com SQL Server Performance and Monitoring Report. This survey was conducted in 2022 and polled 588 database professionals about various aspects of tuning and optimizing SQL Server.

Click here to download the free report

Problem

As big datasets within Lakehouses continue to grow, challenges around enforcing reliable data governance, security, and CI CD become more prevalent. The lowest levels of access controls within Data Lakes are typically granted at the file level which makes it difficult to provide row or columns level access to certain stakeholders. Also, when there is a need to alter the layout of the data, the corresponding security model will also need to be altered and updated. Some Lakehouses have corresponding meta stores such as the Hive Meta-store which may also end up being out of sync with data in the Lakehouse. SQL databases and ML Models may also be introduced into the Lakehouse as assets that will also have their own independent governance and security models which will need to be maintained and in-sync with up-to-date security models. As data products within the lake grows, organizations will need a way to share, govern and apply CI CD best promotion patterns for tables, files, dashboards, and models within the Lakehouse.

Solution

Security and Data Governance in Databricks includes the policies for securely managing data within your Lakehouse. Databricks offers robust capabilities around governing data at the row and column levels, and extends to covering cluster, pool, workspace, and job access controls which you will learn more about in this article. You will also learn about other access and visibility features including workspace, cluster, and job visibility controls. Finally, you will learn about Azure Purview and how it integrates with Databricks from a Data Governance standpoint along with tools that can be used to promote continuous integration and deployment best processes and best practices for Databricks.

Security & Governance

Databricks Unity Catalog solves many organizational challenges around sharing and governing of data in the Lakehouse by providing an interface to govern all Lakehouse data assets with a security model that is based on ANSI SQL GRANT command, along with federated access and easy integration with existing catalogs. It offers fine grained permissions on tables, views, rows, and columns from files persisted in the Lakehouse. It streamlines access to a variety of Lakehouse assets including Data Lake files, SQL Databases, Delta Shares, and ML Models. It also offers centralized audit logs for best practice compliance management standards.

Databricks offers numerous security features including tightly a coupled integration with the Azure platform with access through Azure Active Directory (AAD) credential pass-through. Databricks also supports SSO authentication for other identity and authentication methods for technologies such as Snowflake. Audit logging can be enabled to gain access to robust audit logs on actions and operations on the workspace.

Databricks offers numerous tools to safeguard and encrypt data securely through row and column level encryption, along with a variety of functions to sensitize PII data. Keys and secrets can be integrated with Azure Key Vault. Databricks also supports creating secret scopes which are stored in an encrypted database owned and managed by Databricks.

Databricks supports a number of compliance standards including GDPR, HIPAA, HITRUST and more. With its support for virtual networks (vNets) your infrastructure will have full control of network security rules and with Private Link, no data will be transmitted through public IP addresses.

When working with SQL in Databricks, the security model follows a similar pattern as you would find with traditional SQL databases which allows for setting of fine-grained access permissions using standard SQL statements such as GRANT REVOKE. With Access control lists (ACLs), administrators can configure and manage permissions to access Databricks SQL alerts, dashboards, data, queries, and SQL endpoints. The figure below shows a list of how the available visibility and access controls are presented within the Databricks Admin Console. The following list outlines some of these visibility and access controls:

  • Workspace Visibility Control: Prevents users from seeing objects in the workspace that they do not have access to.
  • Cluster Visibility Control: Prevents users from seeing clusters they do not have access to.
  • Job Visibility Control: Prevents users from seeing jobs they do not have access to.
  • Cluster, Pool, and Jobs Access Control: Applies controls related to who can create, attach, manage, and re-start clusters, who can attach and manage pools, and who can view and manage jobs.
  • Workspace Access Control: Applies controls related to who can view, edit, and run notebooks in a workspace.
  • Table Access Control: Applies controls related to who can select, create, and databases, views, and functions from Python and SQL. When enabled, users can set permissions for data objects on that cluster.
  • Alert Access Control: Applies controls related to who can create, run, and manage alerts in Databricks SQL.
  • Dashboard Access Control: Applies controls related to who can create, run, and manage dashboards in Databricks SQL.
  • Data Access Control: Applies data object related controls using SQL commands such as GRANT, DENY, REVOKE etc. to manage access to data objects.
  • Query Access Control: Applies controls related to who can run, edit, and manage queries in Databricks SQL.
  • SQL Endpoint Access Control: Applies controls related to who can use and manage SQL Endpoints in Databricks SQL. A SQL endpoint is a computation resource that lets you run SQL commands on data objects within the Databricks environment.
AccessVisibility Databricks Admin Console Access and Visibility controls

Azure Purview is a data governance service that helps manage and govern a variety of data sources including Azure Databricks and Spark Tables. Using Purview’s Apache Atlas API, developers can programmatically register data lineage, glossary terms, and much more into Purview directly from a Databricks notebook. The figure below illustrates how lineage from a Databricks transformation notebook could potentially be captured in Purview as a result of the Apache Atlas API running code to capture and register lineage programmatically.

Purview Purview Lineage tracking for Spark Applications

Continuous Integration and Deployment

Continuous integration and deployment standards are prevalent across a variety of Azure data services which use Azure DevOps (ADO) for this CI and CD process. Similarly, Databricks supports CI / CD within Azure DevOps by relying on Repos which store code which needs to be promoted to higher Databricks environments. Once these repositories are connected and synced to Azure DevOps, the CI and CD pipelines can be built using YAML code or the classic editor. The build pipeline will use the integrated source repo to build and publish the artifact based on automated continuous integration, and the release pipeline will continuously deploy the changes to the specified higher environments. The Databricks release pipeline tasks shown in the figure below requires the installation of the Databricks Script Deployment Task by Data Thirst from the Visual Studio Marketplace. The marketplace link can be found here. These tasks support the deployment of Databricks files such as .py, .csv, .jar, .whl etc. to DBFS. In addition, there are tasks available for the deployment of Databricks notebooks, secrets, and clusters to higher environments. As with any ADO CI / CD process, once the pipelines are built there is also the capability of adding manual approval gates, code quality tests, and more within the pipelines to ensure that the best quality code is being promoted to higher environments.

DatabricksCICD Databricks CI CD tasks available in Azure DevOps
Next Steps





get scripts

next tip button



About the author
MSSQLTips author Ron L'Esteve Ron L'Esteve is a seasoned Data Architect who holds an MBA and MSF. Ron has over 15 years of consulting experience with Microsoft Business Intelligence, data engineering, emerging cloud and big data technologies.

View all my tips


Article Last Updated: 2022-02-14

Comments For This Article

















get free sql tips
agree to terms