Extract text content from PDF documents. This endpoint is ideal for converting PDF content to searchable text, processing scanned documents, or extracting data from structured PDFs like invoices and reports.

Endpoint

POST /extractText

Authentication

Requires a valid API key or OAuth token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

See Authentication for details.


Request Body

Content-Type: application/json or multipart/form-data

FieldTypeRequiredDescription
pdf_base64stringConditionalBase64-encoded PDF file
pdf_urlstringConditionalURL to fetch the PDF from
filefileConditionalPDF file upload (multipart only)
pagesarray/stringNoPage numbers to extract (1-indexed)

Note: You must provide exactly one of: pdf_base64, pdf_url, or file.

Field Details

pdf_base64 (conditional)

A base64-encoded PDF file. Use this when you have the PDF data in memory or need to send it as part of a JSON payload. The encoded string should not include the data URI prefix.

pdf_url (conditional)

A publicly accessible URL where the PDF can be fetched. The server will download the PDF from this URL before processing. Supports redirects.

file (conditional, multipart only)

Direct file upload via multipart form data. This is the simplest option when you have the PDF file available locally.

pages (optional)

Specify which pages to extract text from:

  • JSON format: Array of integers [1, 2, 5]
  • Multipart format: Comma-separated string "1,2,5"

Page numbers are 1-indexed (first page is 1, not 0). If omitted, text is extracted from all pages.


Example Request

Basic Text Extraction (JSON with Base64)

# First, encode your PDF to base64
PDF_BASE64=$(base64 -i document.pdf)

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'"$PDF_BASE64"'"
  }'

Extract from URL

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf"
  }'

Extract Specific Pages

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/report.pdf",
    "pages": [1, 3, 5]
  }'

Using File Upload (Multipart)

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf"

File Upload with Page Selection

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@document.pdf" \
  -F "pages=1,2,3"

Response

Success

Returns JSON with extracted text organized by page.

{
  "text": "Full combined text from all extracted pages...",
  "pages": [
    {
      "page": 1,
      "text": "Text content from page 1..."
    },
    {
      "page": 2,
      "text": "Text content from page 2..."
    }
  ],
  "total_pages": 5
}

Response Fields:

FieldTypeDescription
textstringCombined text from all extracted pages, separated by double newlines
pagesarrayArray of page objects with individual page text
pages[].pageintegerPage number (1-indexed)
pages[].textstringExtracted text from this page
total_pagesintegerTotal number of pages in the PDF

Error

{
  "error": "Failed to extract text",
  "message": "Error description"
}

Status Codes:

CodeDescription
200Success - Text extracted and returned as JSON
401Unauthorized - Missing or invalid Authorization header
403Forbidden - Invalid API key or OAuth token
500Internal Server Error - Text extraction failed

Text Extraction Details

How Text Extraction Works

The API uses PyPDF to extract text from PDFs. This works best with:

  • Native PDFs: PDFs created digitally (Word, LaTeX, HTML-to-PDF converters)
  • Text-layer PDFs: Scanned documents with OCR text layer applied
  • Structured PDFs: Documents with proper text encoding

Limitations

Text extraction may not work perfectly in all cases:

  • Scanned images without OCR: Pure image scans won’t have extractable text
  • Complex layouts: Multi-column layouts may have text in unexpected order
  • Embedded fonts: Some custom fonts may not extract properly
  • Encrypted PDFs: Password-protected PDFs cannot be processed

Text Ordering

Text is extracted in reading order as determined by the PDF structure. For complex layouts, the order may not match visual reading order. Consider post-processing if precise ordering is required.


Use Cases

Invoice Data Extraction

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/invoices/INV-2024-001.pdf"
  }'

Process the response to extract invoice numbers, dates, amounts, and line items.

Document Search Indexing

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@legal-contract.pdf"

Use the extracted text to build a searchable index of your document library.

First Page Summary

curl -X POST https://api.pdf-mcp.io/extractText \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_url": "https://example.com/documents/whitepaper.pdf",
    "pages": [1]
  }'

Extract just the first page to get title, abstract, or summary information.


Tips and Best Practices

Choosing Input Method

  • File upload: Best for local files, simplest to implement
  • Base64: Best for programmatic access when PDF is already in memory
  • URL: Best for processing PDFs already hosted online

Performance Optimization

  • Extract only the pages you need using the pages parameter
  • For large PDFs, consider extracting in batches
  • Cache extracted text to avoid re-processing

Text Processing

  • Extracted text may contain extra whitespace - normalize as needed
  • Page breaks are indicated by the pages array structure
  • Use the combined text field for full-document search or analysis

Error Handling

  • Validate PDFs before sending (check file extension, magic bytes)
  • Handle cases where pages contain no extractable text
  • Implement retry logic for URL-based extraction (network failures)

Credit Usage

Approximately 1 credit per page processed.