Extract Text | pdf-mcp Documentation

Extract text content from PDF documents. This endpoint is ideal for converting PDF content to searchable text, processing scanned documents, or extracting data from structured PDFs like invoices and reports.

Endpoint

POST /extractText

Authentication

Requires a valid API key or OAuth token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

See Authentication for details.

Request Body

Content-Type: application/json or multipart/form-data

Field	Type	Required	Description
`pdf_base64`	string	Conditional	Base64-encoded PDF file
`pdf_url`	string	Conditional	URL to fetch the PDF from
`file`	file	Conditional	PDF file upload (multipart only)
`pages`	array/string	No	Page numbers to extract (1-indexed)

Note: You must provide exactly one of: pdf_base64, pdf_url, or file.

Field Details

pdf_base64 (conditional)

A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.

pdf_url (conditional)

A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.

file (conditional, multipart only)

Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.

pages (optional)

Specify which pages to extract text from:

JSON format: Array of integers [1, 2, 5]
Multipart format: Comma-separated string "1,2,5"

Page numbers are 1-indexed (first page is 1, not 0). If omitted, text is extracted from all pages.

Example Request

Basic Text Extraction (JSON with Base64)

# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'"$PDF_BASE64"'"
  }'

Extract from URL

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf"
  }'

Extract Specific Pages

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pages": [1, 3, 5]
  }'

Using File Upload (Multipart)

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"

File Upload with Page Selection

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pages=1,2,3"

Response

Success

Returns JSON with extracted text organized by page.

{
  "text": "Full combined text from all extracted pages...",
  "pages": [
    {
      "page": 1,
      "text": "Text content from page 1..."
    },
    {
      "page": 2,
      "text": "Text content from page 2..."
    }
  ],
  "total_pages": 5
}

Response Fields:

Field	Type	Description
`text`	string	Combined text from all extracted pages, separated by double newlines
`pages`	array	Array of page objects with individual page text
`pages[].page`	integer	Page number (1-indexed)
`pages[].text`	string	Extracted text from this page
`total_pages`	integer	Total number of pages in the PDF

Error

{
  "error": "Failed to extract text",
  "message": "Error description"
}

Status Codes:

Code	Description
200	Success - Text extracted and returned as JSON
401	Unauthorized - Missing or invalid Authorization header
403	Forbidden - Invalid API key or OAuth token
500	Internal Server Error - Text extraction failed

Text Extraction Details

How Text Extraction Works

The API uses PyPDF to extract text from PDFs. This works best with:

Native PDFs: PDFs created digitally (Word, LaTeX, HTML-to-PDF converters)
Text-layer PDFs: Scanned documents with OCR text layer applied
Structured PDFs: Documents with proper text encoding

Limitations

Text extraction may not work perfectly in all cases:

Scanned images without OCR: Pure image scans won’t have extractable text
Complex layouts: Multi-column layouts may have text in unexpected order
Embedded fonts: Some custom fonts may not extract properly
Encrypted PDFs: Password-protected PDFs cannot be processed

Text Ordering

Text is extracted in reading order as determined by the PDF structure. For complex layouts, the order may not match visual reading order. Consider post-processing if precise ordering is required.

Use Cases

Invoice Data Extraction

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/invoices/INV-2024-001.pdf"
  }'

Process the response to extract invoice numbers, dates, amounts, and line items.

Document Search Indexing

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@legal-contract.pdf"

Use the extracted text to build a searchable index of your document library.

First Page Summary

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/whitepaper.pdf",
    "pages": [1]
  }'

Extract just the first page to get title, abstract, or summary information.

Tips and Best Practices

Choosing Input Method

File upload: Best for local files, simplest to implement
Base64: Best for programmatic access when PDF is already in memory
URL: Best for processing PDFs already hosted online

Performance Optimization

Extract only the pages you need using the pages parameter
For large PDFs, consider extracting in batches
Cache extracted text to avoid re-processing

Text Processing

Extracted text may contain extra whitespace - normalize as needed
Page breaks are indicated by the pages array structure
Use the combined text field for full-document search or analysis

Error Handling

Validate PDFs before sending (check file extension, magic bytes)
Handle cases where pages contain no extractable text
Implement retry logic for URL-based extraction (network failures)

Page Count - Get the number of pages in a PDF
Extract Pages - Extract specific pages as a new PDF
PDF to Image - Convert PDF pages to images

Credit Usage

Approximately 1 credit per page processed.