Grep PDF | pdf-mcp Documentation

Search for text patterns within PDF documents. This endpoint works like grep but for PDFs - it searches text content per-page and returns matching text with page numbers, character positions, and surrounding context. Ideal for coding agents and automated document analysis.

Endpoint

POST /grepPdf

Authentication

Requires a valid API key or OAuth token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

See Authentication for details.

Request Body

Content-Type: application/json or multipart/form-data

Field	Type	Required	Description
`pdf_base64`	string	Conditional	Base64-encoded PDF file
`pdf_url`	string	Conditional	URL to fetch the PDF from
`file`	file	Conditional	PDF file upload (multipart only)
`pattern`	string	Yes	Search pattern (plain text or regex)
`regex`	boolean	No	Treat pattern as regex (default: false)
`ignore_case`	boolean	No	Case-insensitive search (default: true)
`pages`	array/string	No	Page numbers to search (1-indexed)
`context`	integer	No	Characters of context around each match (default: 100, max: 500)
`count_only`	boolean	No	Only return match counts, not text context (default: false)

Note: You must provide exactly one of: pdf_base64, pdf_url, or file.

Field Details

pdf_base64 (conditional)

A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.

pdf_url (conditional)

A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.

file (conditional, multipart only)

Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.

pattern (required)

The search pattern to look for. By default, this is treated as a literal string. Set regex: true to use regular expression syntax.

regex (optional)

When true, the pattern is interpreted as a regular expression. When false (default), the pattern is treated as a literal string and special characters are escaped automatically.

ignore_case (optional)

When true (default), matching is case-insensitive. Set to false for case-sensitive searches.

pages (optional)

Specify which pages to search:

JSON format: Array of integers [1, 2, 5]
Multipart format: Comma-separated string "1,2,5"

Page numbers are 1-indexed (first page is 1, not 0). If omitted, all pages are searched.

context (optional)

Number of characters to include before and after each match for context. Default is 100, maximum is 500. Set to 0 to return only the matched text without surrounding context.

count_only (optional)

When true, only return match counts per page without the actual match text and context. Useful for quickly determining if and where matches exist.

Example Request

Basic Search (JSON with Base64)

# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'"$PDF_BASE64"'",
    "pattern": "invoice"
  }'

Search from URL

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pattern": "revenue"
  }'

Case-Sensitive Regex Search

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/code-review.pdf",
    "pattern": "TODO|FIXME|HACK",
    "regex": true,
    "ignore_case": false
  }'

Search Specific Pages with Extended Context

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/manual.pdf",
    "pattern": "error",
    "pages": [1, 5, 10],
    "context": 200
  }'

Count Matches Only

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pattern": "confidential",
    "count_only": true
  }'

Using File Upload (Multipart)

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=search term"

File Upload with Regex

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=\$[0-9,]+\.[0-9]{2}" \
  -F "regex=true"

File Upload with All Options

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=important" \
  -F "ignore_case=false" \
  -F "pages=1,2,3" \
  -F "context=150"

Response

Success

Returns JSON with matches organized by page.

{
  "matches": [
    {
      "page": 1,
      "match_count": 2,
      "matches": [
        {
          "match": "invoice",
          "position": 156,
          "context_before": "Please find attached the ",
          "context_after": " for services rendered in Q4 2024."
        },
        {
          "match": "Invoice",
          "position": 892,
          "context_before": "Terms and Conditions: ",
          "context_after": " must be paid within 30 days."
        }
      ]
    },
    {
      "page": 3,
      "match_count": 1,
      "matches": [
        {
          "match": "invoice",
          "position": 45,
          "context_before": "Reference this ",
          "context_after": " number for all inquiries."
        }
      ]
    }
  ],
  "total_matches": 3,
  "pages_with_matches": 2,
  "total_pages": 5,
  "pattern": "invoice",
  "flags": {
    "regex": false,
    "ignore_case": true
  }
}

Response with count_only

When count_only: true, matches are returned without text context:

{
  "matches": [
    {
      "page": 1,
      "match_count": 2
    },
    {
      "page": 3,
      "match_count": 1
    }
  ],
  "total_matches": 3,
  "pages_with_matches": 2,
  "total_pages": 5,
  "pattern": "invoice",
  "flags": {
    "regex": false,
    "ignore_case": true
  }
}

Response Fields:

Field	Type	Description
`matches`	array	Array of page objects with match information
`matches[].page`	integer	Page number (1-indexed)
`matches[].match_count`	integer	Number of matches found on this page
`matches[].matches`	array	Array of match details (omitted if count_only)
`matches[].matches[].match`	string	The matched text
`matches[].matches[].position`	integer	Character position in page text
`matches[].matches[].context_before`	string	Text before the match
`matches[].matches[].context_after`	string	Text after the match
`total_matches`	integer	Total number of matches across all pages
`pages_with_matches`	integer	Number of pages containing at least one match
`total_pages`	integer	Total number of pages in the PDF
`pattern`	string	The search pattern used
`flags`	object	Search flags applied
`flags.regex`	boolean	Whether regex mode was enabled
`flags.ignore_case`	boolean	Whether case-insensitive mode was enabled

Error

{
  "error": "Failed to grep PDF",
  "message": "Error description"
}

Invalid Regex Pattern:

{
  "error": "Invalid regex pattern",
  "message": "unbalanced parenthesis at position 5",
  "pattern": "test(("
}

Status Codes:

Code	Description
200	Success - Matches returned as JSON
400	Bad Request - Invalid regex pattern or missing required fields
401	Unauthorized - Missing or invalid Authorization header
403	Forbidden - Invalid API key or OAuth token
500	Internal Server Error - Search failed

How Grep Works

Search Process

The API uses PyPDF to extract text from each page, then applies pattern matching:

Extract text content from each target page
Apply the search pattern (literal or regex) with specified flags
For each match, capture position and surrounding context
Return structured results organized by page

Pattern Matching

Literal mode (default): Special regex characters are escaped, pattern matches exactly as typed
Regex mode: Full Python regex syntax supported (uses re module)
Case sensitivity: Controlled by ignore_case flag

Context Extraction

Context characters are extracted from the page text surrounding each match:

context_before: Characters immediately preceding the match
context_after: Characters immediately following the match
Context is truncated at page boundaries (won’t wrap to adjacent pages)

Use Cases

Find All Mentions of a Term

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/contract.pdf",
    "pattern": "liability"
  }'

Quickly locate all occurrences of a specific term in a legal document.

Extract Dollar Amounts

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/financial-report.pdf",
    "pattern": "\\$[0-9,]+\\.?[0-9]*",
    "regex": true
  }'

Use regex to find all monetary values in a financial document.

Find Email Addresses

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/contacts.pdf",
    "pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
    "regex": true
  }'

Extract email addresses from a document.

Check for Sensitive Information

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/document.pdf",
    "pattern": "SSN|social security|password|secret",
    "regex": true,
    "count_only": true
  }'

Quickly scan for potential sensitive information without retrieving full context.

Search Table of Contents Only

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/manual.pdf",
    "pattern": "chapter",
    "pages": [1, 2, 3]
  }'

Limit search to the first few pages where the table of contents typically appears.

Code Examples

Python

import requests
import base64

# Using URL
response = requests.post(
    "https://api.pdf-mcp.io/grepPdf",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "pdf_url": "https://example.com/document.pdf",
        "pattern": "search term",
        "ignore_case": True,
        "context": 100
    }
)

result = response.json()
print(f"Found {result['total_matches']} matches across {result['pages_with_matches']} pages")

for page in result['matches']:
    print(f"\nPage {page['page']} ({page['match_count']} matches):")
    for match in page.get('matches', []):
        print(f"  ...{match['context_before']}{match['match']}{match['context_after']}...")

# Using file upload
with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://api.pdf-mcp.io/grepPdf",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": f},
        data={
            "pattern": "important",
            "regex": "false",
            "ignore_case": "true"
        }
    )

result = response.json()

# Using base64
with open("document.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.pdf-mcp.io/grepPdf",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "pdf_base64": pdf_base64,
        "pattern": r"\d{3}-\d{2}-\d{4}",  # SSN pattern
        "regex": True
    }
)

JavaScript (Node.js)

const fs = require('fs');

// Using URL
const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pdf_url: 'https://example.com/document.pdf',
    pattern: 'search term',
    ignore_case: true,
    context: 100
  })
});

const result = await response.json();
console.log(`Found ${result.total_matches} matches across ${result.pages_with_matches} pages`);

result.matches.forEach(page => {
  console.log(`\nPage ${page.page} (${page.match_count} matches):`);
  page.matches?.forEach(match => {
    console.log(`  ...${match.context_before}${match.match}${match.context_after}...`);
  });
});

// Using file upload with FormData
const FormData = require('form-data');

const form = new FormData();
form.append('file', fs.createReadStream('document.pdf'));
form.append('pattern', 'important');
form.append('ignore_case', 'true');

const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: form
});

const result = await response.json();

// Using base64
const pdfBuffer = fs.readFileSync('document.pdf');
const pdfBase64 = pdfBuffer.toString('base64');

const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pdf_base64: pdfBase64,
    pattern: '\\d{3}-\\d{2}-\\d{4}',  // SSN pattern
    regex: true
  })
});

Tips and Best Practices

Choosing Input Method

File upload: Best for local files, simplest to implement
Base64: Best for programmatic access when PDF is already in memory
URL: Best for processing PDFs already hosted online

Pattern Design

Start with literal search for exact terms
Use regex for pattern matching (dates, numbers, emails)
Test regex patterns locally before API calls
Escape special characters when searching for literal symbols like $, ., (, etc.

Performance Optimization

Use count_only: true for initial scans to quickly identify relevant pages
Limit pages with the pages parameter when you know where to look
Reduce context if you don’t need surrounding text
For large documents, consider searching in page batches

Common Regex Patterns

Use Case	Pattern
Email addresses	`[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
Phone numbers	`\d{3}[-.\s]?\d{3}[-.\s]?\d{4}`
Dollar amounts	`\$[0-9,]+\.?[0-9]*`
Dates (MM/DD/YYYY)	`\d{1,2}/\d{1,2}/\d{4}`
URLs	`https?://[^\s]+`

Error Handling

Validate regex patterns before sending to avoid 400 errors
Handle cases where no matches are found (empty matches array)
Check total_pages to verify the PDF was parsed correctly

Extract Text - Extract all text from a PDF
Page Count - Get the number of pages in a PDF
Extract Pages - Extract specific pages as a new PDF

Credit Usage

0.01 credits per request, regardless of PDF size or number of matches.