BRIEF OVERVIEW
To check memory usage during a Databricks job, you can utilize several methods and tools provided by Databricks.
Here’s a comprehensive guide on how to monitor memory usage:
Using the Compute Metrics UI
Databricks offers a native compute metrics tool in the UI that provides real-time and historical data on hardware and Spark metrics, including memory usage.
- Access the Compute Metrics UI:
- Click on “Compute” in the sidebar
- Select the compute resource you want to monitor
- Click on the “Metrics” tab
- View Memory Metrics:
- By default, you’ll see hardware metrics, including memory usage
- You can switch to Spark metrics for more detailed information
- Filter by Time Period:
- Use the date picker to select a specific time range
- Metrics are collected every minute and stored for up to 30 days
Monitoring Job Cluster Utilization
For a more programmatic approach, you can query the Ganglia API on the Spark driver to obtain raw memory usage data:
import requests
metric = “mem_report”
payload = requests.get(f”http://localhost/graph.php?c=cluster&json=1&r=4hr&g={metric}“)
mem_report = payload.json()
# Extract memory usage from the report
bmem_used = mem_report[‘bmem_used’]
bmem_total = mem_report[‘bmem_total’]
This method allows you to capture memory usage at the system level, including both PySpark workloads in the JVM and Python workloads outside of it.
Using PySpark Memory Profiler
For more detailed memory profiling of PySpark User-Defined Functions (UDFs), you can use the PySpark Memory Profiler:
- Enable Memory Profiling:
- Install the Memory Profiler library on your cluster
- Set the Spark configuration
spark.python.profile.memory
totrue
- Profile UDF Memory Usage:
- Use the
@profile
decorator on your UDF - Run your job and collect the memory profile results
- Use the
Example:
from memory_profiler import profile
@profile
def my_udf(pdf: pd.DataFrame) –> pd.DataFrame:
# Your UDF code here
return result
# Use the UDF in your Spark job
result = df.groupby(“id”).applyInPandas(my_udf, schema=df.schema)
This will provide a detailed breakdown of memory usage by line of code within your UDF.
Best Practices for Memory Monitoring
- Regular Monitoring: Implement routine checks of memory usage across your Databricks jobs.
- Optimize Cluster Size: Use the memory usage data to right-size your clusters, avoiding over-provisioning or under-provisioning.
- Identify Memory Leaks: Look for patterns of increasing memory usage over time that might indicate memory leaks.
- Profile UDFs: Use the PySpark Memory Profiler to optimize memory-intensive UDFs.
- Clear Caches: If memory usage is high, consider clearing Spark caches using
spark.catalog.clearCache()
or unmounting file systems withdbutils.fs.unmount()
. - Monitor Multiple Metrics: In addition to memory, keep an eye on CPU utilization, disk I/O, and network usage for a comprehensive view of your job’s performance.
Using these tools and practices, you can effectively monitor and optimize memory usage in your Databricks jobs, leading to better performance and cost-efficiency.