To read an Excel file in PySpark Databricks, you can use the following methods:
Using com.crealytics.spark.excel Library
-
Install the library on your Databricks cluster:
-
Navigate to Compute > Select your cluster > Libraries
-
Click “Install New” > Choose “Maven”
-
Search for “spark-excel” and select the appropriate version
-
Install the library
-
-
Read the Excel file:
spark_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/dbfs/FileStore/your_excel_file.xlsx")
spark_df.show()
Using Pandas with OpenPyxl
-
Install required libraries:
%pip install pandas openpyxl
-
Read the Excel file:
import pandas as pd
from pyspark.sql import SparkSession
# Read Excel file into Pandas DataFramepandas_df = pd.read_excel(“/dbfs/FileStore/your_excel_file.xlsx”, engine=‘openpyxl’)
# Convert Pandas DataFrame to PySpark DataFrame
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()
Using PySpark Pandas API
For Databricks Runtime 7.3 LTS and above:
from pyspark.pandas import read_excel
pandas_df = read_excel(“/dbfs/FileStore/your_excel_file.xlsx”, “Sheet1”)
spark_df = pandas_df.to_spark()
spark_df.show()
Remember to replace “/dbfs/FileStore/your_excel_file.xlsx” with the actual path to your Excel file in DBFS.
These methods allow you to read Excel files directly into PySpark DataFrames in Databricks, enabling further data processing and analysis using Spark’s distributed computing capabilities.