Extract text content from PDF documents. This endpoint is ideal for converting PDF content to searchable text, processing scanned documents, or extracting data from structured PDFs like invoices and reports.
Endpoint
POST /extractText
Authentication
Requires a valid API key or OAuth token in the Authorization header:
Authorization: Bearer YOUR_API_KEY
See Authentication for details.
Request Body
Content-Type: application/json or multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
pdf_base64 | string | Conditional | Base64-encoded PDF file |
pdf_url | string | Conditional | URL to fetch the PDF from |
file | file | Conditional | PDF file upload (multipart only) |
pages | array/string | No | Page numbers to extract (1-indexed) |
Note: You must provide exactly one of: pdf_base64, pdf_url, or file.
Field Details
pdf_base64 (conditional)
A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.
pdf_url (conditional)
A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.
file (conditional, multipart only)
Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.
pages (optional)
Specify which pages to extract text from:
- JSON format: Array of integers
[1, 2, 5] - Multipart format: Comma-separated string
"1,2,5"
Page numbers are 1-indexed (first page is 1, not 0). If omitted, text is extracted from all pages.
Example Request
Basic Text Extraction (JSON with Base64)
# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_base64": "'"$PDF_BASE64"'"
}'
Extract from URL
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/report.pdf"
}'
Extract Specific Pages
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/report.pdf",
"pages": [1, 3, 5]
}'
Using File Upload (Multipart)
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf"
File Upload with Page Selection
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@document.pdf" \
-F "pages=1,2,3"
Response
Success
Returns JSON with extracted text organized by page.
{
"text": "Full combined text from all extracted pages...",
"pages": [
{
"page": 1,
"text": "Text content from page 1..."
},
{
"page": 2,
"text": "Text content from page 2..."
}
],
"total_pages": 5
}
Response Fields:
| Field | Type | Description |
|---|---|---|
text | string | Combined text from all extracted pages, separated by double newlines |
pages | array | Array of page objects with individual page text |
pages[].page | integer | Page number (1-indexed) |
pages[].text | string | Extracted text from this page |
total_pages | integer | Total number of pages in the PDF |
Error
{
"error": "Failed to extract text",
"message": "Error description"
}
Status Codes:
| Code | Description |
|---|---|
| 200 | Success - Text extracted and returned as JSON |
| 401 | Unauthorized - Missing or invalid Authorization header |
| 403 | Forbidden - Invalid API key or OAuth token |
| 500 | Internal Server Error - Text extraction failed |
Text Extraction Details
How Text Extraction Works
The API uses PyPDF to extract text from PDFs. This works best with:
- Native PDFs: PDFs created digitally (Word, LaTeX, HTML-to-PDF converters)
- Text-layer PDFs: Scanned documents with OCR text layer applied
- Structured PDFs: Documents with proper text encoding
Limitations
Text extraction may not work perfectly in all cases:
- Scanned images without OCR: Pure image scans won’t have extractable text
- Complex layouts: Multi-column layouts may have text in unexpected order
- Embedded fonts: Some custom fonts may not extract properly
- Encrypted PDFs: Password-protected PDFs cannot be processed
Text Ordering
Text is extracted in reading order as determined by the PDF structure. For complex layouts, the order may not match visual reading order. Consider post-processing if precise ordering is required.
Use Cases
Invoice Data Extraction
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/invoices/INV-2024-001.pdf"
}'
Process the response to extract invoice numbers, dates, amounts, and line items.
Document Search Indexing
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@legal-contract.pdf"
Use the extracted text to build a searchable index of your document library.
First Page Summary
curl -X POST https://api.pdf-mcp.io/extractText \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf_url": "https://example.com/documents/whitepaper.pdf",
"pages": [1]
}'
Extract just the first page to get title, abstract, or summary information.
Tips and Best Practices
Choosing Input Method
- File upload: Best for local files, simplest to implement
- Base64: Best for programmatic access when PDF is already in memory
- URL: Best for processing PDFs already hosted online
Performance Optimization
- Extract only the pages you need using the
pagesparameter - For large PDFs, consider extracting in batches
- Cache extracted text to avoid re-processing
Text Processing
- Extracted text may contain extra whitespace - normalize as needed
- Page breaks are indicated by the
pagesarray structure - Use the combined
textfield for full-document search or analysis
Error Handling
- Validate PDFs before sending (check file extension, magic bytes)
- Handle cases where pages contain no extractable text
- Implement retry logic for URL-based extraction (network failures)
Related Endpoints
- Page Count - Get the number of pages in a PDF
- Extract Pages - Extract specific pages as a new PDF
- PDF to Image - Convert PDF pages to images
Credit Usage
Approximately 1 credit per page processed.