Reading a CSV File from DBFS in Databricks

To read a CSV file from the Databricks File System (DBFS), you can use either PySpark or SQL. Here are the steps for both methods:

Using PySpark

First, ensure you have a SparkSession initialized. Then, you can read a CSV file using the `spark.read.csv()` method.

      from pyspark.sql import SparkSession

      # Initialize SparkSession
      spark = SparkSession.builder.appName('Read CSV').getOrCreate()

      # Read CSV file
      df = spark.read.csv("dbfs:/path/to/your/file.csv", header=True, inferSchema=True)

      # Display the DataFrame
      df.show()
    

Using SQL with `read_files` Function

Databricks recommends using the `read_files` table-valued function for SQL users. This function is available in Databricks Runtime 13.3 LTS and above.

      SELECT * FROM read_files(
        'dbfs:/path/to/your/file.csv',
        format => 'csv',
        header => true,
        mode => 'PERMISSIVE'
      )
    

Frequently Asked Questions

Bottom Line: Reading CSV files from DBFS in Databricks is straightforward using either PySpark or SQL with the `read_files` function. Both methods offer flexibility in handling malformed records and specifying custom schemas or delimiters.


👉 Hop on a short call to discover how Fog Solutions helps navigate your sea of data and lights a clear path to grow your business.