Reading CSV Files in PySpark on Databricks
PySpark provides two primary methods to read CSV files into a DataFrame: using the csv("path")
method or the format("csv").load("path")
method. Both methods allow you to specify various options to handle different aspects of the CSV file, such as headers, delimiters, and schema inference.
Method 1: Using csv("path")
This method is straightforward and can be used with options like header
and delimiter
to customize the reading process.
# Create a SparkSession spark = SparkSession.builder.appName('PySpark Read CSV').getOrCreate() # Read CSV file with header dataframe = spark.read.option("header", True).csv("/FileStore/tables/zipcodes-2.csv") # Print schema dataframe.printSchema()
Method 2: Using format("csv").load("path")
This method offers more flexibility by allowing additional options like multiline
and schema
to be specified.
# Read CSV file with specified options df = ( spark.read.format("csv") .option("header", "true") .option("quote", '"') .option("delimiter", ",") .option("nullValue", "") .option("emptyValue", "NULL") .option("multiline", True) .schema(schema) .load(f"{bronze_folder_path}/Test.csv") )
Frequently Asked Questions
- Q: How do I handle malformed CSV records?
A: You can use the
mode
option to handle malformed records. The modes includePERMISSIVE
,DROPMALFORMED
, andFAILFAST
. For example, use.option("mode", "PERMISSIVE")
to insert nulls for unparsable fields. - Q: Can I read multiple CSV files at once?
A: Yes, you can read multiple CSV files by passing a comma-separated list of file paths to the
csv()
method. For example,spark.read.csv("path/file1.csv,path/file2.csv")
. - Q: How do I specify a custom schema for a CSV file?
A: You can define a custom schema using
StructType
and pass it to theschema
option when reading the CSV file. - Q: What is the default delimiter for CSV files in PySpark?
A: The default delimiter for CSV files in PySpark is a comma (,).
- Q: Can I automatically infer the schema of a CSV file?
A: Yes, you can use the
inferSchema
option set toTrue
to automatically infer the schema. However, this requires reading the data twice. - Q: How do I handle CSV files with multiline records?
A: You can handle multiline records by setting the
multiline
option toTrue
. - Q: What is the purpose of the
badRecordsPath
option?A: The
badRecordsPath
option allows you to specify a path where malformed records will be written, providing a way to inspect and handle corrupted data.
Bottom Line
Reading CSV files in PySpark on Databricks is efficient and flexible, allowing users to handle various file formats and structures. By leveraging options like headers, delimiters, and schema inference, you can tailor the reading process to fit your specific data needs.