Search for text patterns within PDF documents. This endpoint works like grep but for PDFs - it searches text content per-page and returns matching text with page numbers, character positions, and surrounding context. Ideal for coding agents and automated document analysis.
Endpoint
POST /grepPdf
Authentication
Requires a valid API key or OAuth token in the Authorization header:
Authorization: Bearer YOUR_API_KEY
See Authentication for details.
Request Body
Content-Type: application/json or multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
pdf_base64 | string | Conditional | Base64-encoded PDF file |
pdf_url | string | Conditional | URL to fetch the PDF from |
file | file | Conditional | PDF file upload (multipart only) |
pattern | string | Yes | Search pattern (plain text or regex) |
regex | boolean | No | Treat pattern as regex (default: false) |
ignore_case | boolean | No | Case-insensitive search (default: true) |
pages | array/string | No | Page numbers to search (1-indexed) |
context | integer | No | Characters of context around each match (default: 100, max: 500) |
count_only | boolean | No | Only return match counts, not text context (default: false) |
Note: You must provide exactly one of: pdf_base64, pdf_url, or file.
Field Details
pdf_base64 (conditional)
A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.
pdf_url (conditional)
A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.
file (conditional, multipart only)
Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.
pattern (required)
The search pattern to look for. By default, this is treated as a literal string. Set regex: true to use regular expression syntax.
regex (optional)
When true, the pattern is interpreted as a regular expression. When false (default), the pattern is treated as a literal string and special characters are escaped automatically.
ignore_case (optional)
When true (default), matching is case-insensitive. Set to false for case-sensitive searches.
pages (optional)
Specify which pages to search:
- JSON format: Array of integers
[1, 2, 5] - Multipart format: Comma-separated string
"1,2,5"
Page numbers are 1-indexed (first page is 1, not 0). If omitted, all pages are searched.
context (optional)
Number of characters to include before and after each match for context. Default is 100, maximum is 500. Set to 0 to return only the matched text without surrounding context.
count_only (optional)
When true, only return match counts per page without the actual match text and context. Useful for quickly determining if and where matches exist.
Example Request
Basic Search (JSON with Base64)
# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_base64": "'"$PDF_BASE64"'",
"pattern": "invoice"
}'
Search from URL
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/report.pdf",
"pattern": "revenue"
}'
Case-Sensitive Regex Search
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/code-review.pdf",
"pattern": "TODO|FIXME|HACK",
"regex": true,
"ignore_case": false
}'
Search Specific Pages with Extended Context
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/manual.pdf",
"pattern": "error",
"pages": [1, 5, 10],
"context": 200
}'
Count Matches Only
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/report.pdf",
"pattern": "confidential",
"count_only": true
}'
Using File Upload (Multipart)
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "pattern=search term"
File Upload with Regex
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "pattern=\$[0-9,]+\.[0-9]{2}" \
-F "regex=true"
File Upload with All Options
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "pattern=important" \
-F "ignore_case=false" \
-F "pages=1,2,3" \
-F "context=150"
Response
Success
Returns JSON with matches organized by page.
{
"matches": [
{
"page": 1,
"match_count": 2,
"matches": [
{
"match": "invoice",
"position": 156,
"context_before": "Please find attached the ",
"context_after": " for services rendered in Q4 2024."
},
{
"match": "Invoice",
"position": 892,
"context_before": "Terms and Conditions: ",
"context_after": " must be paid within 30 days."
}
]
},
{
"page": 3,
"match_count": 1,
"matches": [
{
"match": "invoice",
"position": 45,
"context_before": "Reference this ",
"context_after": " number for all inquiries."
}
]
}
],
"total_matches": 3,
"pages_with_matches": 2,
"total_pages": 5,
"pattern": "invoice",
"flags": {
"regex": false,
"ignore_case": true
}
}
Response with count_only
When count_only: true, matches are returned without text context:
{
"matches": [
{
"page": 1,
"match_count": 2
},
{
"page": 3,
"match_count": 1
}
],
"total_matches": 3,
"pages_with_matches": 2,
"total_pages": 5,
"pattern": "invoice",
"flags": {
"regex": false,
"ignore_case": true
}
}
Response Fields:
| Field | Type | Description |
|---|---|---|
matches | array | Array of page objects with match information |
matches[].page | integer | Page number (1-indexed) |
matches[].match_count | integer | Number of matches found on this page |
matches[].matches | array | Array of match details (omitted if count_only) |
matches[].matches[].match | string | The matched text |
matches[].matches[].position | integer | Character position in page text |
matches[].matches[].context_before | string | Text before the match |
matches[].matches[].context_after | string | Text after the match |
total_matches | integer | Total number of matches across all pages |
pages_with_matches | integer | Number of pages containing at least one match |
total_pages | integer | Total number of pages in the PDF |
pattern | string | The search pattern used |
flags | object | Search flags applied |
flags.regex | boolean | Whether regex mode was enabled |
flags.ignore_case | boolean | Whether case-insensitive mode was enabled |
Error
{
"error": "Failed to grep PDF",
"message": "Error description"
}
Invalid Regex Pattern:
{
"error": "Invalid regex pattern",
"message": "unbalanced parenthesis at position 5",
"pattern": "test(("
}
Status Codes:
| Code | Description |
|---|---|
| 200 | Success - Matches returned as JSON |
| 400 | Bad Request - Invalid regex pattern or missing required fields |
| 401 | Unauthorized - Missing or invalid Authorization header |
| 403 | Forbidden - Invalid API key or OAuth token |
| 500 | Internal Server Error - Search failed |
How Grep Works
Search Process
The API uses PyPDF to extract text from each page, then applies pattern matching:
- Extract text content from each target page
- Apply the search pattern (literal or regex) with specified flags
- For each match, capture position and surrounding context
- Return structured results organized by page
Pattern Matching
- Literal mode (default): Special regex characters are escaped, pattern matches exactly as typed
- Regex mode: Full Python regex syntax supported (uses
remodule) - Case sensitivity: Controlled by
ignore_caseflag
Context Extraction
Context characters are extracted from the page text surrounding each match:
context_before: Characters immediately preceding the matchcontext_after: Characters immediately following the match- Context is truncated at page boundaries (won’t wrap to adjacent pages)
Use Cases
Find All Mentions of a Term
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/contract.pdf",
"pattern": "liability"
}'
Quickly locate all occurrences of a specific term in a legal document.
Extract Dollar Amounts
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/financial-report.pdf",
"pattern": "\\$[0-9,]+\\.?[0-9]*",
"regex": true
}'
Use regex to find all monetary values in a financial document.
Find Email Addresses
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/contacts.pdf",
"pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
"regex": true
}'
Extract email addresses from a document.
Check for Sensitive Information
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/document.pdf",
"pattern": "SSN|social security|password|secret",
"regex": true,
"count_only": true
}'
Quickly scan for potential sensitive information without retrieving full context.
Search Table of Contents Only
curl -X POST https://api.pdf-mcp.io/grepPdf \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/manual.pdf",
"pattern": "chapter",
"pages": [1, 2, 3]
}'
Limit search to the first few pages where the table of contents typically appears.
Code Examples
Python
import requests
import base64
# Using URL
response = requests.post(
"https://api.pdf-mcp.io/grepPdf",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"pdf_url": "https://example.com/document.pdf",
"pattern": "search term",
"ignore_case": True,
"context": 100
}
)
result = response.json()
print(f"Found {result['total_matches']} matches across {result['pages_with_matches']} pages")
for page in result['matches']:
print(f"\nPage {page['page']} ({page['match_count']} matches):")
for match in page.get('matches', []):
print(f" ...{match['context_before']}{match['match']}{match['context_after']}...")
# Using file upload
with open("document.pdf", "rb") as f:
response = requests.post(
"https://api.pdf-mcp.io/grepPdf",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"pattern": "important",
"regex": "false",
"ignore_case": "true"
}
)
result = response.json()
# Using base64
with open("document.pdf", "rb") as f:
pdf_base64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://api.pdf-mcp.io/grepPdf",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"pdf_base64": pdf_base64,
"pattern": r"\d{3}-\d{2}-\d{4}", # SSN pattern
"regex": True
}
)
JavaScript (Node.js)
const fs = require('fs');
// Using URL
const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
pdf_url: 'https://example.com/document.pdf',
pattern: 'search term',
ignore_case: true,
context: 100
})
});
const result = await response.json();
console.log(`Found ${result.total_matches} matches across ${result.pages_with_matches} pages`);
result.matches.forEach(page => {
console.log(`\nPage ${page.page} (${page.match_count} matches):`);
page.matches?.forEach(match => {
console.log(` ...${match.context_before}${match.match}${match.context_after}...`);
});
});
// Using file upload with FormData
const FormData = require('form-data');
const form = new FormData();
form.append('file', fs.createReadStream('document.pdf'));
form.append('pattern', 'important');
form.append('ignore_case', 'true');
const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
},
body: form
});
const result = await response.json();
// Using base64
const pdfBuffer = fs.readFileSync('document.pdf');
const pdfBase64 = pdfBuffer.toString('base64');
const response = await fetch('https://api.pdf-mcp.io/grepPdf', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
pdf_base64: pdfBase64,
pattern: '\\d{3}-\\d{2}-\\d{4}', // SSN pattern
regex: true
})
});
Tips and Best Practices
Choosing Input Method
- File upload: Best for local files, simplest to implement
- Base64: Best for programmatic access when PDF is already in memory
- URL: Best for processing PDFs already hosted online
Pattern Design
- Start with literal search for exact terms
- Use regex for pattern matching (dates, numbers, emails)
- Test regex patterns locally before API calls
- Escape special characters when searching for literal symbols like
$,.,(, etc.
Performance Optimization
- Use
count_only: truefor initial scans to quickly identify relevant pages - Limit pages with the
pagesparameter when you know where to look - Reduce
contextif you don’t need surrounding text - For large documents, consider searching in page batches
Common Regex Patterns
| Use Case | Pattern |
|---|---|
| Email addresses | [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} |
| Phone numbers | \d{3}[-.\s]?\d{3}[-.\s]?\d{4} |
| Dollar amounts | \$[0-9,]+\.?[0-9]* |
| Dates (MM/DD/YYYY) | \d{1,2}/\d{1,2}/\d{4} |
| URLs | https?://[^\s]+ |
Error Handling
- Validate regex patterns before sending to avoid 400 errors
- Handle cases where no matches are found (empty
matchesarray) - Check
total_pagesto verify the PDF was parsed correctly
Related Endpoints
- Extract Text - Extract all text from a PDF
- Page Count - Get the number of pages in a PDF
- Extract Pages - Extract specific pages as a new PDF
Credit Usage
0.01 credits per request, regardless of PDF size or number of matches.