Reading CSV Files in PySpark on Databricks

PySpark provides two primary methods to read CSV files into a DataFrame: using the csv("path") method or the format("csv").load("path") method. Both methods allow you to specify various options to handle different aspects of the CSV file, such as headers, delimiters, and schema inference.

Method 1: Using csv("path")

This method is straightforward and can be used with options like header and delimiter to customize the reading process.

      # Create a SparkSession
      spark = SparkSession.builder.appName('PySpark Read CSV').getOrCreate()

      # Read CSV file with header
      dataframe = spark.read.option("header", True).csv("/FileStore/tables/zipcodes-2.csv")

      # Print schema
      dataframe.printSchema()
    

Method 2: Using format("csv").load("path")

This method offers more flexibility by allowing additional options like multiline and schema to be specified.

      # Read CSV file with specified options
      df = (
        spark.read.format("csv")
        .option("header", "true")
        .option("quote", '"')
        .option("delimiter", ",")
        .option("nullValue", "")
        .option("emptyValue", "NULL")
        .option("multiline", True)
        .schema(schema)
        .load(f"{bronze_folder_path}/Test.csv")
      )
    

Frequently Asked Questions

Bottom Line

Reading CSV files in PySpark on Databricks is efficient and flexible, allowing users to handle various file formats and structures. By leveraging options like headers, delimiters, and schema inference, you can tailor the reading process to fit your specific data needs.


👉 Hop on a short call to discover how Fog Solutions helps navigate your sea of data and lights a clear path to grow your business.