PDF to XML

Convert PDF documents to structured XML format with preserved content hierarchy

Convert PDF documents to structured XML format using pure Node.js.

Features:

  • Extracts text and structure from PDF files
  • Converts to well-formed, valid XML
  • Preserves document hierarchy (headings, paragraphs, lists)
  • Separates content into pages
  • Supports code blocks, blockquotes, and horizontal rules
  • Includes metadata attributes for document structure
  • Optimized for programmatic XML processing

Example Results

1 examples

PDF Document to XML

Convert a PDF document to structured XML format

pdf-to-xml-output.xml View File
View input parameters
{ "sourceFile": "/public/samples/pdf/document.pdf", "outputMode": "structured", "includeDeclaration": true }

Click to upload file or drag and drop file here

Maximum file size: 50MB Supported formats: application/pdf

Key Facts

Category
Documents & PDF
Input Types
file, select, checkbox
Output Type
file
Sample Coverage
4
API Ready
Yes

Overview

Convert your PDF documents into structured, well-formed XML files. This tool extracts text and preserves document hierarchy—including headings, paragraphs, lists, blockquotes, and code blocks—while separating content by pages for seamless programmatic processing.

When to Use

  • When you need to extract structured text data from PDF reports, manuals, or articles for downstream XML-based data pipelines.
  • When migrating legacy PDF documentation into structured content management systems that require XML input.
  • When parsing PDF content programmatically and you need to preserve structural elements like headings, lists, and page boundaries.

How It Works

  • Upload your PDF document using the file input field.
  • Choose your preferred output mode (Compact XML or Pretty-printed XML) and decide whether to include the XML declaration.
  • Click the convert button to process the document and download the generated XML file containing the structured content.

Use Cases

Converting technical manuals from PDF to XML to feed into documentation databases.
Extracting structured text from academic papers for text mining and natural language processing.
Automating the ingestion of PDF-based invoices or reports into enterprise resource planning systems.

Examples

1. Converting a Technical Manual to Structured XML

Technical Writer
Background
A technical writer needs to migrate a 50-page PDF user manual into a DITA-based content management system that imports XML.
Problem
Manually copying text loses the hierarchy of headings, lists, and code blocks, making migration tedious.
How to Use
Upload the manual PDF, select 'Pretty-printed XML' for easy reading, keep 'Include XML Declaration' checked, and run the conversion.
Example Config
Output Mode: Pretty-printed XML, Include XML Declaration: Enabled
Outcome
A well-formatted XML file is generated, preserving the headings, lists, and page structures, ready for direct import into the CMS.

2. Extracting Academic Paper Content for Data Mining

Data Engineer
Background
A research team needs to parse thousands of PDF research papers to extract text paragraphs and blockquotes for an NLP model.
Problem
Raw text extraction loses the distinction between paragraphs, headings, and blockquotes, which degrades model training quality.
How to Use
Upload the research paper PDF, select 'Compact XML' to minimize file size, and disable the XML declaration if integrating into a larger XML document.
Example Config
Output Mode: Compact XML, Include XML Declaration: Disabled
Outcome
A compact XML file containing structured tags for paragraphs, headings, and blockquotes, optimized for automated parsing.

Try with Samples

xml, pdf, file

Related Hubs

FAQ

Does this tool preserve the visual layout of the PDF?

No, it preserves the logical structure and content hierarchy, such as headings, paragraphs, lists, and page divisions, rather than the visual layout.

What output modes are supported?

You can choose between Compact XML for smaller file sizes or Pretty-printed XML for human-readable formatting.

Can I include or exclude the XML declaration?

Yes, you can toggle the 'Include XML Declaration' option to add or remove the standard XML header.

Is there a file size limit for the PDF?

Yes, the maximum supported file size for PDF uploads is 50 MB.

Does the tool support scanned PDFs with OCR?

No, this tool extracts text and structure from digital PDFs; it does not perform OCR on scanned images.

API Documentation

Request Endpoint

POST /en/api/tools/pdf-to-xml

Request Parameters

Parameter Name Type Required Description
sourceFile file (Upload required) Yes -
outputMode select No -
includeDeclaration checkbox No -

File type parameters need to be uploaded first via POST /upload/pdf-to-xml to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "filePath": "/public/processing/randomid.ext",
  "fileName": "output.ext",
  "contentType": "application/octet-stream",
  "size": 1024,
  "metadata": {
    "key": "value"
  },
  "error": "Error message (optional)",
  "message": "Notification message (optional)"
}
File: File

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-pdf-to-xml": {
      "name": "pdf-to-xml",
      "description": "Convert PDF documents to structured XML format with preserved content hierarchy",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=pdf-to-xml",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]