To read a file from Azure Data Lake Storage (ADLS) Gen2 in Databricks, you can follow these steps:
Set Up Authentication
First, you need to configure authentication to access your ADLS Gen2 account.
The recommended method is using OAuth 2.0 with a Microsoft Entra ID service principal:
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net",
"<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net",
dbutils.secrets.get(scope="<secret-scope>", key="<service-credential-key>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net",
"https://login.microsoftonline.com/<directory-id>/oauth2/token")
Replace <storage-account>
, <application-id>
, <secret-scope>
, <service-credential-key>
, and <directory-id>
with your specific values.
Read the File
Once authentication is set up, you can read the file using Spark:
For CSV files:
file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-file>/file.csv"
df = spark.read.csv(file_path, header=True, inferSchema=True)
For JSON files:
file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-file>/file.json"
df = spark.read.json(file_path)
For Parquet files:
file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-file>/file.parquet"
df = spark.read.parquet(file_path)
Replace <container-name>
, <storage-account-name>
, and <path-to-file>
with your specific values.
Display the Data
After reading the file, you can display the contents:
display(df)
Additional Notes
- You can use
dbutils.fs.ls()
to list files in a directory:pythondbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")
- If you’re using Unity Catalog volumes (recommended), you can simplify file access using paths like
/Volumes/my_catalog/my_schema/my_volume/data.csv
. - For better security, store sensitive information like access keys in Azure Key Vault and access them using Databricks secret scopes.
Remember to replace placeholder values with your actual ADLS Gen2 account details, container names, and file paths.