Search for text patterns within PDF documents. This endpoint works like grep but for PDFs - it searches text content per-page and returns matching text with page numbers, character positions, and surrounding context. Ideal for coding agents and automated document analysis.

Endpoint

POST /grepPdf

Authentication

Requires a valid API key or OAuth token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

See Authentication for details.


Request Body

Content-Type: application/json or multipart/form-data

FieldTypeRequiredDescription
pdf_base64stringConditionalBase64-encoded PDF file
pdf_urlstringConditionalURL to fetch the PDF from
filefileConditionalPDF file upload (multipart only)
patternstringYesSearch pattern (plain text or regex)
regexbooleanNoTreat pattern as regex (default: false)
ignore_casebooleanNoCase-insensitive search (default: true)
pagesarray/stringNoPage numbers to search (1-indexed)
contextintegerNoCharacters of context around each match (default: 100, max: 500)
count_onlybooleanNoOnly return match counts, not text context (default: false)

Note: You must provide exactly one of: pdf_base64, pdf_url, or file.

Field Details

pdf_base64 (conditional)

A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.

pdf_url (conditional)

A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.

file (conditional, multipart only)

Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.

pattern (required)

The search pattern to look for. By default, this is treated as a literal string. Set regex: true to use regular expression syntax.

regex (optional)

When true, the pattern is interpreted as a regular expression. When false (default), the pattern is treated as a literal string and special characters are escaped automatically.

ignore_case (optional)

When true (default), matching is case-insensitive. Set to false for case-sensitive searches.

pages (optional)

Specify which pages to search:

  • JSON format: Array of integers [1, 2, 5]
  • Multipart format: Comma-separated string "1,2,5"

Page numbers are 1-indexed (first page is 1, not 0). If omitted, all pages are searched.

context (optional)

Number of characters to include before and after each match for context. Default is 100, maximum is 500. Set to 0 to return only the matched text without surrounding context.

count_only (optional)

When true, only return match counts per page without the actual match text and context. Useful for quickly determining if and where matches exist.


Example Request

Basic Search (JSON with Base64)

# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'"$PDF_BASE64"'",
    "pattern": "invoice"
  }'

Search from URL

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pattern": "revenue"
  }'
curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/code-review.pdf",
    "pattern": "TODO|FIXME|HACK",
    "regex": true,
    "ignore_case": false
  }'

Search Specific Pages with Extended Context

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/manual.pdf",
    "pattern": "error",
    "pages": [1, 5, 10],
    "context": 200
  }'

Count Matches Only

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pattern": "confidential",
    "count_only": true
  }'

Using File Upload (Multipart)

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=search term"

File Upload with Regex

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=\$[0-9,]+\.[0-9]{2}" \
  -F "regex=true"

File Upload with All Options

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pattern=important" \
  -F "ignore_case=false" \
  -F "pages=1,2,3" \
  -F "context=150"

Response

Success

Returns JSON with matches organized by page.

{
  "matches": [
    {
      "page": 1,
      "match_count": 2,
      "matches": [
        {
          "match": "invoice",
          "position": 156,
          "context_before": "Please find attached the ",
          "context_after": " for services rendered in Q4 2024."
        },
        {
          "match": "Invoice",
          "position": 892,
          "context_before": "Terms and Conditions: ",
          "context_after": " must be paid within 30 days."
        }
      ]
    },
    {
      "page": 3,
      "match_count": 1,
      "matches": [
        {
          "match": "invoice",
          "position": 45,
          "context_before": "Reference this ",
          "context_after": " number for all inquiries."
        }
      ]
    }
  ],
  "total_matches": 3,
  "pages_with_matches": 2,
  "total_pages": 5,
  "pattern": "invoice",
  "flags": {
    "regex": false,
    "ignore_case": true
  }
}

Response with count_only

When count_only: true, matches are returned without text context:

{
  "matches": [
    {
      "page": 1,
      "match_count": 2
    },
    {
      "page": 3,
      "match_count": 1
    }
  ],
  "total_matches": 3,
  "pages_with_matches": 2,
  "total_pages": 5,
  "pattern": "invoice",
  "flags": {
    "regex": false,
    "ignore_case": true
  }
}

Response Fields:

FieldTypeDescription
matchesarrayArray of page objects with match information
matches[].pageintegerPage number (1-indexed)
matches[].match_countintegerNumber of matches found on this page
matches[].matchesarrayArray of match details (omitted if count_only)
matches[].matches[].matchstringThe matched text
matches[].matches[].positionintegerCharacter position in page text
matches[].matches[].context_beforestringText before the match
matches[].matches[].context_afterstringText after the match
total_matchesintegerTotal number of matches across all pages
pages_with_matchesintegerNumber of pages containing at least one match
total_pagesintegerTotal number of pages in the PDF
patternstringThe search pattern used
flagsobjectSearch flags applied
flags.regexbooleanWhether regex mode was enabled
flags.ignore_casebooleanWhether case-insensitive mode was enabled

Error

{
  "error": "Failed to grep PDF",
  "message": "Error description"
}

Invalid Regex Pattern:

{
  "error": "Invalid regex pattern",
  "message": "unbalanced parenthesis at position 5",
  "pattern": "test(("
}

Status Codes:

CodeDescription
200Success - Matches returned as JSON
400Bad Request - Invalid regex pattern or missing required fields
401Unauthorized - Missing or invalid Authorization header
403Forbidden - Invalid API key or OAuth token
500Internal Server Error - Search failed

How Grep Works

Search Process

The API uses PyPDF to extract text from each page, then applies pattern matching:

  1. Extract text content from each target page
  2. Apply the search pattern (literal or regex) with specified flags
  3. For each match, capture position and surrounding context
  4. Return structured results organized by page

Pattern Matching

  • Literal mode (default): Special regex characters are escaped, pattern matches exactly as typed
  • Regex mode: Full Python regex syntax supported (uses re module)
  • Case sensitivity: Controlled by ignore_case flag

Context Extraction

Context characters are extracted from the page text surrounding each match:

  • context_before: Characters immediately preceding the match
  • context_after: Characters immediately following the match
  • Context is truncated at page boundaries (won’t wrap to adjacent pages)

Use Cases

Find All Mentions of a Term

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/contract.pdf",
    "pattern": "liability"
  }'

Quickly locate all occurrences of a specific term in a legal document.

Extract Dollar Amounts

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/financial-report.pdf",
    "pattern": "\\$[0-9,]+\\.?[0-9]*",
    "regex": true
  }'

Use regex to find all monetary values in a financial document.

Find Email Addresses

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/contacts.pdf",
    "pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
    "regex": true
  }'

Extract email addresses from a document.

Check for Sensitive Information

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/document.pdf",
    "pattern": "SSN|social security|password|secret",
    "regex": true,
    "count_only": true
  }'

Quickly scan for potential sensitive information without retrieving full context.

Search Table of Contents Only

curl -X POST https://api.pdf-mcp.io/grepPdf \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/manual.pdf",
    "pattern": "chapter",
    "pages": [1, 2, 3]
  }'

Limit search to the first few pages where the table of contents typically appears.


Code Examples

Python

import requests
import base64

# Using URL
response = requests.post(
    "https://api.pdf-mcp.io/grepPdf",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "pdf_url": "https://example.com/document.pdf",
        "pattern": "search term",
        "ignore_case": True,
        "context": 100
    }
)

result = response.json()
print(f"Found {result['total_matches']} matches across {result['pages_with_matches']} pages")

for page in result['matches']:
    print(f"\nPage {page['page']} ({page['match_count']} matches):")
    for match in page.get('matches', []):
        print(f"  ...{match['context_before']}{match['match']}{match['context_after']}...")
# Using file upload
with open("document.pdf", "rb") as f:
    response = requests.post(
        "https://api.pdf-mcp.io/grepPdf",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        files={"file": f},
        data={
            "pattern": "important",
            "regex": "false",
            "ignore_case": "true"
        }
    )

result = response.json()
# Using base64
with open("document.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.pdf-mcp.io/grepPdf",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "pdf_base64": pdf_base64,
        "pattern": r"\d{3}-\d{2}-\d{4}",  # SSN pattern
        "regex": True
    }
)

JavaScript (Node.js)

const fs = require('fs');

// Using URL
const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pdf_url: 'https://example.com/document.pdf',
    pattern: 'search term',
    ignore_case: true,
    context: 100
  })
});

const result = await response.json();
console.log(`Found ${result.total_matches} matches across ${result.pages_with_matches} pages`);

result.matches.forEach(page => {
  console.log(`\nPage ${page.page} (${page.match_count} matches):`);
  page.matches?.forEach(match => {
    console.log(`  ...${match.context_before}${match.match}${match.context_after}...`);
  });
});
// Using file upload with FormData
const FormData = require('form-data');

const form = new FormData();
form.append('file', fs.createReadStream('document.pdf'));
form.append('pattern', 'important');
form.append('ignore_case', 'true');

const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY'
  },
  body: form
});

const result = await response.json();
// Using base64
const pdfBuffer = fs.readFileSync('document.pdf');
const pdfBase64 = pdfBuffer.toString('base64');

const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    pdf_base64: pdfBase64,
    pattern: '\\d{3}-\\d{2}-\\d{4}',  // SSN pattern
    regex: true
  })
});

Tips and Best Practices

Choosing Input Method

  • File upload: Best for local files, simplest to implement
  • Base64: Best for programmatic access when PDF is already in memory
  • URL: Best for processing PDFs already hosted online

Pattern Design

  • Start with literal search for exact terms
  • Use regex for pattern matching (dates, numbers, emails)
  • Test regex patterns locally before API calls
  • Escape special characters when searching for literal symbols like $, ., (, etc.

Performance Optimization

  • Use count_only: true for initial scans to quickly identify relevant pages
  • Limit pages with the pages parameter when you know where to look
  • Reduce context if you don’t need surrounding text
  • For large documents, consider searching in page batches

Common Regex Patterns

Use CasePattern
Email addresses[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Phone numbers\d{3}[-.\s]?\d{3}[-.\s]?\d{4}
Dollar amounts\$[0-9,]+\.?[0-9]*
Dates (MM/DD/YYYY)\d{1,2}/\d{1,2}/\d{4}
URLshttps?://[^\s]+

Error Handling

  • Validate regex patterns before sending to avoid 400 errors
  • Handle cases where no matches are found (empty matches array)
  • Check total_pages to verify the PDF was parsed correctly

Credit Usage

0.01 credits per request, regardless of PDF size or number of matches.