Connecting Azure Databricks to a Storage Account
Azure Databricks can be connected to an Azure Storage account using several methods, including OAuth 2.0 with a Microsoft Entra ID service principal, Shared Access Signatures (SAS), and storage account keys. Here’s how you can do it:
Method 1: Using OAuth 2.0 with a Microsoft Entra ID Service Principal
This method is recommended for connecting to Azure Data Lake Storage Gen2. To use it, you need to create a Microsoft Entra ID service principal and grant it access to your storage account.
- Create a Service Principal: In the Azure portal, go to Azure Active Directory > App registrations > New registration. Follow the prompts to create a new application.
- Generate a Client Secret: In the application settings, find the “Certificates & secrets” section and add a new client secret.
- Assign Roles: Ensure the service principal has the necessary roles on the storage account, such as “Storage Blob Data Contributor” for Data Lake Storage Gen2.
- Configure in Databricks: Use the service principal credentials to configure Spark properties in your Databricks cluster or notebook.
Method 2: Using Shared Access Signatures (SAS)
SAS tokens provide temporary access to specific resources in your storage account without needing a service principal.
- Generate a SAS Token: In the Azure portal, navigate to your storage account > Shared access signatures. Generate a token with the desired permissions.
- Mount Storage in Databricks: Use the SAS token to mount the storage container in Databricks using the `dbutils.fs.mount` command.
Method 3: Using Storage Account Keys
Though not recommended for production due to security concerns, you can use storage account keys for testing purposes.
- Retrieve Account Keys: In the Azure portal, go to your storage account > Access keys.
- Configure in Databricks: Use the account key to set Spark properties for accessing the storage account.
Frequently Asked Questions
- Q: What is the recommended method for connecting to Azure Data Lake Storage Gen2?
A: Using OAuth 2.0 with a Microsoft Entra ID service principal is recommended for Azure Data Lake Storage Gen2. - Q: Can I use Azure AD Passthrough for authentication?
A: Yes, if Azure AD Passthrough is enabled in your Databricks environment, you can authenticate using your Azure AD identity without needing a service principal. - Q: How do I manage credentials securely in Databricks?
A: Use secret scopes to store and manage credentials securely in Databricks. - Q: What are the benefits of using the `abfss` driver over `wasbs`?
A: The `abfss` driver offers greater security and is recommended over the legacy `wasbs` driver. - Q: Can I use a SAS token for multiple storage accounts?
A: Yes, you can configure SAS tokens for multiple storage accounts in the same Spark session. - Q: How do I troubleshoot the “Filesystem not found” error in Azure Data Lake Storage Gen2?
A: Ensure that the hierarchical namespace is enabled, and avoid creating containers through the Azure portal if possible. - Q: Is it possible to connect to Azure Blob Storage using a service principal?
A: Blob storage does not support Microsoft Entra ID service principals directly; use SAS tokens or account keys instead.
Bottom Line: Connecting Azure Databricks to a storage account can be achieved through various methods, each with its own advantages and use cases. Choosing the right method depends on your specific security requirements and the type of storage you are using.