Reading a CSV File in Databricks

Databricks provides several ways to read CSV files, including using Python, Scala, R, and SQL. Here are some examples:

Python Example

To read a CSV file in Python, you can use the following code:

      df = spark.read.format("csv") 
        .option("header", "true") 
        .option("delimiter", ",") 
        .option("nullValue", "") 
        .option("emptyValue", "NULL") 
        .load(f"{bronze_folder_path}/Test.csv")
    

Scala Example

In Scala, you can read a CSV file as follows:

      val df = spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("delimiter", ",")
        .csv(s"${pathVolume}/${fileName}")
    

R Example

For R, use the following code:

      library(SparkR)
      df <- read.df(paste(path_volume, "/", file_name, sep=""), source="csv", header = TRUE, inferSchema = TRUE, delimiter = ",")
    

SQL Example Using `read_files`

SQL users can leverage the `read_files` function:

      SELECT * FROM read_files(
        's3:////.csv',
        format => 'csv',
        header => true,
        mode => 'FAILFAST')
    

Handling Malformed Records

Databricks allows you to handle malformed CSV records using modes like `PERMISSIVE`, `DROPMALFORMED`, and `FAILFAST`. For example:

      df = spark.read.format("csv") 
        .option("mode", "PERMISSIVE") 
        .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")
    

Frequently Asked Questions

Bottom Line

Reading CSV files in Databricks is straightforward and flexible, allowing you to handle various file formats and data quality issues efficiently. Whether you're using Python, Scala, R, or SQL, Databricks provides robust tools to manage and analyze your data.


👉 Hop on a short call to discover how Fog Solutions helps navigate your sea of data and lights a clear path to grow your business.