Uploading a PDF file to Databricks can be done through the Databricks workspace or by using the Databricks CLI.
Both methods provide an easy way to store and access your PDF files within your Databricks environment.
DATABRICKS WORKSPACE METHOD
- Login to your Databricks account and navigate to the workspace.
- Create a new notebook or open an existing one where you want to upload the PDF file.
- In the notebook, click on “File” in the top menu and select “Upload Data”.
- Select the desired PDF file from your local machine and click “Open” or “Choose”.
- The PDF file will be uploaded to your current working directory in the workspace.
DATABRICKS CLI METHOD
- Install and configure the Databricks CLI on your local machine following their official documentation.</lIi > < li > Open a terminal or command prompt window.
</lIi > < li > Navigate to the directory where you have saved your PDF file locally.
</lIi > < li > Run this command: databricks fs cp [local-path] dbfs:[dbfs-path]</lIi > < p > Replace [local-path] with the path of your local PDF file, and [dbfs-path] with thhe desired path in Databricks DBFS (Databricks File System).
Using DBFS File Browser
Here are the key steps to upload a PDF file to Databricks:
- Enable the DBFS File Browser in your workspace settings if not already enabled:
- Go to Admin Console > Workspace Settings
- Enable “DBFS File Browser”
- Use the DBFS File Browser to upload the PDF:
- Go to Data > DBFS
- Click “Upload” and select your PDF file
- Choose a location in DBFS to upload to, like /FileStore/
- Click “Upload” to complete the process
- Access the uploaded PDF file in a notebook:
- Use the file path starting with “/dbfs”, e.g.:
python
pdf_path = "/dbfs/FileStore/your_pdf_file.pdf"
- Use the file path starting with “/dbfs”, e.g.:
- To read the PDF contents, you can use a library like PyPDF2:
- Install PyPDF2 using pip:
text
%pip install PyPDF2
- Read the PDF:
python
import PyPDF2
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
text = “”
for page in reader.pages:
text += page.extract_text()print(text)
- Install PyPDF2 using pip:
- Alternatively, you can use the Databricks File System (DBFS) utilities:
python
pdf_content = dbutils.fs.head("dbfs:/FileStore/your_pdf_file.pdf")
Remember that the maximum file size for uploading through the UI is typically 2GB.
For larger files, you may need to use alternative methods like the Databricks CLI or directly uploading to cloud storage and then accessing from Databricks.
Also note that while you can upload and store PDF files in Databricks, processing them may require additional libraries depending on your specific needs.
BOTTOM LINE
Uploading a PDF file to Databricks is simple and can be done through either the workspace interface or by using the CLI.
Choose whichever method suits your needs best and enjoy seamless access to your PDF files within your Databricks environment.