Back to skills

Pdf Processing

Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

5 stars
0 votes
0 copies
0 views
Added 12/19/2025
data-aipythonperformance

Install via CLI

$openskills install Justdvp/claude-code-templates
Download Zip
Files
SKILL.md
---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Use pdfplumber to extract text from PDFs:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

## Extracting tables

Extract tables from PDFs with automatic detection:

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)
```

## Extracting all pages

Process multi-page documents efficiently:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)
```

## Form filling

For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.

## Merging PDFs

Combine multiple PDF files:

```python
from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()
```

## Splitting PDFs

Extract specific pages or ranges:

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)
```

## Available packages

- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)

## Common patterns

**Extract and save text:**
```python
import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)
```

**Extract tables to CSV:**
```python
import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)
```

## Error handling

Handle common PDF issues:

```python
import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Performance tips

- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use

Comments (0)

No comments yet. Be the first to comment!