Key Facts
- Category
- Developer & Web
- Input Types
- file, checkbox, text
- Output Type
- html
- Sample Coverage
- 4
- API Ready
- Yes
Overview
The PDF Reading Order Debugger is a specialized tool for developers and data scientists to identify and resolve text extraction issues caused by complex document layouts. By comparing the raw draw order of a PDF against the advanced XY-Cut++ reading order algorithm, it provides a side-by-side HTML report that highlights where multi-column layouts, sidebars, or headers might scramble the logical flow of text.
When to Use
- •When extracted text from multi-column reports or academic papers appears out of order.
- •Before configuring RAG pipelines to ensure document context and citations remain logically sequenced.
- •When debugging whether structural tags or layout-aware algorithms are necessary for a specific set of documents.
How It Works
- •Upload your PDF file and optionally specify a page range to limit the analysis.
- •The tool processes the document twice: once using the raw draw order and once using the XY-Cut++ layout-aware algorithm.
- •It compares the results per page to detect differences in text sequencing and block identification.
- •An interactive HTML report is generated, visualizing the reading order differences to help you choose the best extraction settings.
Use Cases
Examples
1. Multi-Column Research Paper Debugging
AI Engineer- Background
- An engineer is building a citation-aware RAG system but finds that text from the left column is merging with the right column.
- Problem
- The raw extraction order is reading horizontally across the entire page width instead of following the columns.
- How to Use
- Upload the research PDF, set the page range to '1-5', and run the debugger with XY-Cut++ enabled.
- Outcome
- The HTML report shows that XY-Cut++ correctly isolates the columns, while the raw order fails, confirming that layout-aware extraction is required.
2. Financial Report Header Interference
Data Analyst- Background
- A quarterly report contains page numbers and headers that appear in the middle of paragraphs when converted to plain text.
- Problem
- Headers are being injected into the text stream, breaking the continuity of financial narratives.
- How to Use
- Upload the report, run the debugger once with 'Include Header/Footer' checked and once without.
- Outcome
- The comparison identifies exactly which text blocks are headers, allowing the analyst to safely exclude them in the production extraction pipeline.
Try with Samples
pdf, fileRelated Hubs
FAQ
What is XY-Cut++?
It is a layout-analysis algorithm that recursively partitions a page into horizontal and vertical blocks to determine the correct human reading sequence.
Why does the raw draw order often look scrambled?
PDFs store text based on the order it was added to the file, which frequently differs from the visual layout of columns and sidebars.
Can I test how headers and footers affect extraction?
Yes, you can toggle the 'Include Header/Footer' option to see if these elements disrupt the main content flow.
What does the 'Use Struct Tree' option do?
It attempts to use the internal structural tags (if present in the PDF) to determine the reading order instead of relying solely on visual layout.
What format is the final report?
The tool outputs an HTML file that provides a visual comparison of the text order differences for every processed page.