PDF Reading Order Debugger

Compare raw PDF draw order against XY-Cut++ reading order to spot multi-column and layout-related extraction issues

Run the same PDF through OpenDataLoader with readingOrder=off and readingOrder=xycut, then inspect the per-page text difference. This is useful for multi-column reports, scientific papers, brochures, and any layout where plain text draw order may scramble reading flow.

Example Results

1 examples

Debug a multi-column report before building citations

Compare raw extraction and XY-Cut++ so you can choose the safer reading-order mode for downstream RAG or summarization.

Reading order comparison report for 2 extracted pages with 0 changed pages between raw draw order and XY-Cut++.
View input parameters
{ "pdfFile": "/public/samples/pdf/ebook-navigation-example1.pdf", "useStructTree": false, "includeHeaderFooter": false, "pages": "1-4" }

Click to upload file or drag and drop file here

Maximum file size: 10MB Supported formats: application/pdf

Key Facts

Category
Developer & Web
Input Types
file, checkbox, text
Output Type
html
Sample Coverage
4
API Ready
Yes

Overview

The PDF Reading Order Debugger is a specialized tool for developers and data scientists to identify and resolve text extraction issues caused by complex document layouts. By comparing the raw draw order of a PDF against the advanced XY-Cut++ reading order algorithm, it provides a side-by-side HTML report that highlights where multi-column layouts, sidebars, or headers might scramble the logical flow of text.

When to Use

  • When extracted text from multi-column reports or academic papers appears out of order.
  • Before configuring RAG pipelines to ensure document context and citations remain logically sequenced.
  • When debugging whether structural tags or layout-aware algorithms are necessary for a specific set of documents.

How It Works

  • Upload your PDF file and optionally specify a page range to limit the analysis.
  • The tool processes the document twice: once using the raw draw order and once using the XY-Cut++ layout-aware algorithm.
  • It compares the results per page to detect differences in text sequencing and block identification.
  • An interactive HTML report is generated, visualizing the reading order differences to help you choose the best extraction settings.

Use Cases

Validating the extraction logic for two-column scientific journals to prevent sentence fragmentation.
Optimizing financial report processing by identifying where sidebars interfere with table data.
Troubleshooting layout-related extraction errors in brochures and marketing materials with non-linear text flows.

Examples

1. Multi-Column Research Paper Debugging

AI Engineer
Background
An engineer is building a citation-aware RAG system but finds that text from the left column is merging with the right column.
Problem
The raw extraction order is reading horizontally across the entire page width instead of following the columns.
How to Use
Upload the research PDF, set the page range to '1-5', and run the debugger with XY-Cut++ enabled.
Outcome
The HTML report shows that XY-Cut++ correctly isolates the columns, while the raw order fails, confirming that layout-aware extraction is required.

2. Financial Report Header Interference

Data Analyst
Background
A quarterly report contains page numbers and headers that appear in the middle of paragraphs when converted to plain text.
Problem
Headers are being injected into the text stream, breaking the continuity of financial narratives.
How to Use
Upload the report, run the debugger once with 'Include Header/Footer' checked and once without.
Outcome
The comparison identifies exactly which text blocks are headers, allowing the analyst to safely exclude them in the production extraction pipeline.

Try with Samples

pdf, file

Related Hubs

FAQ

What is XY-Cut++?

It is a layout-analysis algorithm that recursively partitions a page into horizontal and vertical blocks to determine the correct human reading sequence.

Why does the raw draw order often look scrambled?

PDFs store text based on the order it was added to the file, which frequently differs from the visual layout of columns and sidebars.

Can I test how headers and footers affect extraction?

Yes, you can toggle the 'Include Header/Footer' option to see if these elements disrupt the main content flow.

What does the 'Use Struct Tree' option do?

It attempts to use the internal structural tags (if present in the PDF) to determine the reading order instead of relying solely on visual layout.

What format is the final report?

The tool outputs an HTML file that provides a visual comparison of the text order differences for every processed page.

API Documentation

Request Endpoint

POST /en/api/tools/pdf-reading-order-debugger

Request Parameters

Parameter Name Type Required Description
pdfFile file (Upload required) Yes -
useStructTree checkbox No -
includeHeaderFooter checkbox No -
pages text No -

File type parameters need to be uploaded first via POST /upload/pdf-reading-order-debugger to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "result": "
Processed HTML content
", "error": "Error message (optional)", "message": "Notification message (optional)", "metadata": { "key": "value" } }
HTML: HTML

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-pdf-reading-order-debugger": {
      "name": "pdf-reading-order-debugger",
      "description": "Compare raw PDF draw order against XY-Cut++ reading order to spot multi-column and layout-related extraction issues",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=pdf-reading-order-debugger",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]