Git Integration Repo in Databricks Workspaces for Developer Collaboration

By:   |   Updated: 2023-11-10   |   Comments   |   Related: > Azure Databricks


Problem

In today's world, having an environment where developers can collaborate and have code reviews is essential for most software personnel and technological companies. Being able to vet developer code before being pushed to the production environment is immensely important. The issue of manually moving notebooks from one workspace or folder can be tiring, and a solution is needed.

Solution

With the introduction of Git Integration Repo in Databricks workspaces, a developer can now collaborate with other developers for their data engineering, science, and analytic project in a single workspace and provides version control for different stages of code.

What is GitHub?

GitHub is a cloud-based hosting platform that enables developers to store and manage their code and monitor and manage changes over time. GitHub is built on top of Git, a distributed version control system that offers an intuitive graphical user interface (GUI).

GitHub Features

GitHub is a version control platform that helps developers improve their code using the best software practices:

  1. Version Control: This platform makes it easy for developers to keep track of changes in multiple code versions, backtrack, or resort to a previous version and see changes being made.
  2. Code Review: This feature enables developers to review and modify code before being merged to the main branch.
  3. Collaboration: This feature makes working together on code projects simple for developers. Developers can create branches to work on new features or problem fixes without affecting the main source. Once their changes are made, they can be merged back to the main branch with the pull request command.
  4. Continuous Integration and Continuous Delivery (CI/CD): This is by far the most important feature of GitHub. It helps automate code building, testing, and deployment, making it possible to release new features rapidly and safely, always keeping the code in a deployable state.

Databricks Repos

Azure Databricks Repos provides a graphical Git client and APIs. This enables standard Git activities such as cloning repositories, pushing and pulling, branch management, and visual comparison between different commits.

Within the Databricks Repos, code developed for different data-related projects can follow the best practices using Git for version control, collaboration, and CI/CD.

Possibilities with Databricks Repos

Databricks Repos comes with all the functionalities of Git:

  • Remotely clone, push to, and retrieve the Git repository.
  • Developers can create, work, and maintain feature branches before merging to the main branch. In this way, it resolves the problem of conflicts between multiple branches.
  • Provide the ability to create, edit, and modify notebooks, including IPYNB notebooks.

Databricks Supported Git Provider

Azure Databricks supports the following providers:

  • GitHub and GitHub AE
  • GitLab
  • Azure DevOps
  • Bitbucket Cloud
  • Bitbucket Server
  • AWS CodeCommit

We will use the GitHub provider for this article; subsequent articles will explain the other providers.

Configure Git Integration for Databricks Workspace

Get Username and Personal Token Account

Step 1: Personal Token Account. To get the personal access token, log in to your GitHub.com account. On your GitHub homepage, click your profile icon at the top right corner and select Settings.

Personal Token Account

Step 2: Generate Token. In your settings environment, at the left pane, scroll to the bottom and select Developer Settings. This should open another window.

Generate Token

In the Developer Settings window, click on the Personal access tokens and select Tokens(classic). This should open a new pane where you are expected to Generate a new token.

Generate Token

Note: You may be prompted to authenticate your login credentials at this stage. For this article, I used the GitHub mobile version for the authentication.

Step 3: Setting New Personal Access Token. In the new window, fill in the following information:

  • Note: Provide a name to identify your token easily.
  • Expiration: Choose a timeframe. This is a tradeoff between convenience and security. The longer the expiration days, the more risks you may encounter if it falls into the wrong hands.
  • Repo: Check the repo box.

Scroll to the bottom and select Generate token.

New personal access token

In the new window, copy the generated personal access token and paste it to a private and secure place, as you will not see it again.

Personal access tokens

Integrate Workspace with GitHub

Now that we have generated our personal access token, we need to integrate Databricks workspace with GitHub.

Use the following steps to integrate GitHub to the Databricks workspace:

Step 1: Link Account. To link an account in the Databricks workspace, from your workspace, click User Settings at the top right corner and select Linked accounts.

Link account

Step 2: Git Provider and Activate. For the next step, fill in the following configuration:

  • Git Provider: We will use the GitHub provider for this.
  • Link: Select the Personal access token.
  • Git provider username or email: Use the same email or username as your GitHub account.
  • Token: Paste the generated token from the GitHub account.
Git provider and activate

Now, click Save to fully integrate GitHub with Databricks workspace.

Create GitHub Repository

GitHub Repository is a central storage for code, documents, and other related project assets. It usually serves as a hub for developers to collaborate, keep track of changes, and control code versions. Each Databricks repo is marked as a GitHub repository.

Step 1: Add Repo. To add a new Repo, click Add Repo and fill in the information in the image below. We will be using a private repo as it will be for organizational use, and we do not want such a repository to be in public view.

Add repo

Step 2: Copy Repo Link. Click the Code icon in the just created repo, copy the URL (HTTPS) link, and head back to your Databricks workspace.

Copy repo link

Add Databricks Repo

In your Databricks workspace, click Repos and create a new Repo.

Create new repo

In the new window, fill in the Repo link (HTTPS) you copied from GitHub and click Create Repo. This will create an underlying repo in your Databricks workspace.

Add repo

Create Branch

In standard practice, it is best to create a development branch where code is developed before moving it to the main branch. Click the main icon. This will open another window.

Main branch

In the new window, click Create Branch, name it Dev, and switch to the Dev branch. Click Create.

Create a new branch

Create Notebook

Before creating a Notebook in Databricks workspace, create a Folder to house your different notebooks.

There are three ways to create notebooks in the Databricks Repo folder: creating a new notebook, importing a notebook, or cloning an existing notebook. Let's try cloning an existing Repo from our Databricks workspace for this article.

Clone Existing Notebook. To clone an existing notebook to the Dev Repo environment, navigate to the notebook you want to use, click on the three dots, and clone to the Repo directory.

Clone existing notebook

You can rename the clone notebook and then click Clone.

Clone existing notebook

Commit & Push

Commit and Push are two key features in the version control system in GitHub.

  • Commit is used to save a snapshot of your code change in your local repository. This enables you to monitor your development over time and time travel, if necessary, to a previous version.
  • Push is used to send commit to the GitHub repository branch. This allows other collaborators with access to the repository to see your modifications.

To commit and push your code, click the Dev icon (image below). This will take you to another window.

Commit & push

In the new window, you will see some changes. Click Commit & Push. This will take the code to the Dev branch.

Commit & push

Compare & Pull Request

This GitHub feature allows users to compare changes with the other branches before being requested to merge with the main branch.

To perform this function, head to your GitHub.com site. Locate the repo we created earlier. Click on the Compare & pull request tab. This should take you to another window where you will perform the pull request function.

Compare & Pull Request

In the new window, we are comparing the Dev branch and the main branch. Click Create Pull request.

Create pull request

Merge Pull Request

Now that we have successfully created a Pull request, we need to merge it to the main branch by clicking Merge Pull Request. Add a Comment if needed.

Merge pull request

After successfully merging the notebook with the main branch, head back to your Databricks Repo and switch to the main branch. You will notice the notebook has been added to the main branch.

Notebook added to main branch

Conclusion

This article taught us how to generate a personal token in GitHub and integrate it with the Databricks workspace. We also discussed the importance of GitHub and developer best practices for moving the codebase from the development to the production stage. In our next article, we will discuss Databricks workflow and how to integrate our different GitHub Repo to create a complete ETL pipeline.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Temidayo Omoniyi Temidayo Omoniyi is a Microsoft Certified Data Analyst, Microsoft Certified Trainer, Azure Data Engineer, Content Creator, and Technical writer with over 3 years of experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2023-11-10

Comments For This Article

















get free sql tips
agree to terms