Checking File Size in Databricks

To check the size of a file in Databricks, you can use the `dbutils.fs.ls()` command. This command lists the files in a specified directory and provides details such as the file name and size.

Here’s an example of how to use it:

      dbutils.fs.ls("/mnt/path/to/your/file")
    

This will return a list of files with their sizes. If you want to get the total size of a directory, including all its subdirectories and files, you can use a Python script to sum up the sizes of all files.

For instance, you can use the following Python function:

      import glob
      from dbruntime.dbutils import FileInfo

      def get_directory_size_in_bytes(source_path):
        source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
        files = glob.glob(f'{source_path}/**/*.parquet')
        directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/','')).size for path in files])
        return directory_size
    

Alternatively, you can use Unix commands in a Databricks notebook to get the directory size:

      %sh du -h /dbfs/mnt/path/to/your/directory
    

Frequently Asked Questions

Bottom Line

Checking file sizes in Databricks is straightforward using `dbutils.fs.ls()`, and optimizing file sizes in Delta Lake tables can be achieved by setting specific properties. For more complex operations like calculating directory sizes, using Python scripts or Unix commands is recommended.


👉 Hop on a short call to discover how Fog Solutions helps navigate your sea of data and lights a clear path to grow your business.