PDF to Text Advanced

Advanced PDF to text converter with page selection, formatting options, and metadata extraction

Advanced PDF to text conversion with extensive customization options.

Features:

  • Extract text from PDF with high fidelity
  • Select specific pages or page ranges
  • Include PDF metadata (title, author, creation date)
  • Add page headers and/or line numbers
  • Preserve paragraph structure
  • Multiple output formats (plain text, structured, JSON)
  • Aggressive or gentle text cleaning
  • Character encoding handling
  • Batch processing support

Example Results

2 examples

Extract Text with Page Range

Extract text from specific pages of a PDF document

pdf-to-text-output.txt View File
View input parameters
{ "sourceFile": "/public/samples/pdf/document.pdf", "pageRange": "1-5,10", "outputFormat": "structured", "includeMetadata": true }

Export to JSON

Export PDF content with metadata as JSON

pdf-to-text-output.json View File
View input parameters
{ "sourceFile": "/public/samples/pdf/book.pdf", "outputFormat": "json", "includeMetadata": true, "pageRange": "all" }

Click to upload file or drag and drop file here

Maximum file size: 100MB Supported formats: application/pdf

Key Facts

Category
Documents & PDF
Input Types
file, text, select, checkbox
Output Type
file
Sample Coverage
4
API Ready
Yes

Overview

PDF to Text Advanced is a powerful utility designed to extract text from PDF documents with high fidelity, offering customizable page ranges, metadata extraction, and multiple output formats like JSON and structured text.

When to Use

  • When you need to extract text from specific page ranges of a PDF rather than the entire document.
  • When you want to convert PDF content into structured JSON format for data analysis or programmatic ingestion.
  • When you need to clean up extracted text by removing unwanted formatting or preserving paragraph structures.

How It Works

  • Upload your PDF document using the file input field.
  • Configure extraction settings such as page range, output format (plain, structured, or JSON), and text cleaning level.
  • Toggle options to include metadata, page headers, line numbers, or preserve paragraph structure.
  • Click convert to process the file and download the extracted text output.

Use Cases

Converting academic papers or reports into clean plain text for research and analysis.
Parsing PDF invoices or books into structured JSON format for automated database entry.
Extracting specific chapters or sections from large PDF manuals using custom page ranges.

Examples

1. Extracting Specific Chapters for Research

Academic Researcher
Background
A researcher needs to extract text from only chapters 2 and 5 of a 300-page PDF report to run text analysis software.
Problem
Manually copying text from specific pages is tedious and loses paragraph formatting.
How to Use
Upload the PDF, set the Page Range to '15-45,90-120', select 'Structured' output format, and check 'Preserve Paragraph Structure'.
Example Config
Page Range: 15-45,90-120, Output Format: structured, Preserve Paragraph Structure: true
Outcome
A structured text file containing only the specified pages with paragraph layouts intact.

2. Converting PDF Manuals to JSON for AI Training

Data Engineer
Background
A data engineer needs to ingest technical manuals into a vector database for an AI chatbot.
Problem
Raw text lacks metadata and page boundaries, making it hard to chunk and tag the data properly.
How to Use
Upload the manual PDF, set the Output Format to 'JSON', and enable 'Include PDF Metadata' and 'Add Page Headers'.
Example Config
Output Format: json, Include PDF Metadata: true, Add Page Headers: true
Outcome
A clean JSON file containing the document text mapped to page numbers alongside metadata like title and author.

Try with Samples

pdf, text, barcode

Related Hubs

FAQ

Can I extract text from specific pages only?

Yes, you can specify individual pages or ranges, such as '1-5,7,10-12', in the Page Range field.

What output formats are supported?

The tool supports Plain Text, Structured text with separators, and JSON formats.

Can I extract PDF metadata like author and title?

Yes, checking the 'Include PDF Metadata' option will append the document's metadata to the output.

What does the text cleaning option do?

It offers gentle or aggressive cleaning to remove unwanted artifacts, or 'none' to keep the raw extracted text.

Does the tool preserve paragraph layouts?

Yes, enabling the 'Preserve Paragraph Structure' option helps maintain the original paragraph formatting.

API Documentation

Request Endpoint

POST /en/api/tools/pdf-to-text-advanced

Request Parameters

Parameter Name Type Required Description
sourceFile file (Upload required) Yes -
pageRange text No -
outputFormat select No -
cleanLevel select No -
includeMetadata checkbox No -
includePageHeaders checkbox No -
includeLineNumbers checkbox No -
preserveParagraphStructure checkbox No -

File type parameters need to be uploaded first via POST /upload/pdf-to-text-advanced to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "filePath": "/public/processing/randomid.ext",
  "fileName": "output.ext",
  "contentType": "application/octet-stream",
  "size": 1024,
  "metadata": {
    "key": "value"
  },
  "error": "Error message (optional)",
  "message": "Notification message (optional)"
}
File: File

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-pdf-to-text-advanced": {
      "name": "pdf-to-text-advanced",
      "description": "Advanced PDF to text converter with page selection, formatting options, and metadata extraction",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=pdf-to-text-advanced",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]