Key Facts
- Category
- Developer & Web
- Input Types
- file, text, checkbox
- Output Type
- html
- Sample Coverage
- 4
- API Ready
- Yes
Overview
The Tagged PDF Inspector evaluates the semantic quality of your PDF documents by comparing extraction results with and without StructTree support. By running OpenDataLoader in both modes, it highlights differences in node counts, headings, lists, and tables, helping developers and data engineers determine if a document's internal tagging is reliable for accessibility, content conversion, or RAG pipelines.
When to Use
- •When auditing a corpus of PDFs to determine if they contain reliable semantic tags for accessibility compliance.
- •Before building a Retrieval-Augmented Generation (RAG) pipeline to see if StructTree extraction yields better document chunking.
- •When troubleshooting missing headings or broken tables during automated PDF-to-HTML or PDF-to-Markdown conversion.
How It Works
- •Upload your target PDF file and optionally specify a page range to limit the processing time.
- •Choose whether to include headers and footers in the extraction analysis.
- •The tool processes the document twice using OpenDataLoader: once with StructTree enabled and once without.
- •Review the generated HTML report to compare semantic node counts, heading structures, and table recognition side-by-side.
Use Cases
Examples
1. Evaluating Brand Guidelines for RAG Ingestion
Data Engineer- Background
- A data engineer is preparing a set of corporate brand guidelines for a RAG-based chatbot.
- Problem
- The PDF has complex layouts, and standard text extraction merges columns and loses heading hierarchy.
- How to Use
- Upload the brand guidelines PDF, leave 'Include Header/Footer' unchecked, and run the inspector.
- Example Config
-
Pages: 1-10 - Outcome
- The report shows that StructTree extraction correctly identifies 20 semantic nodes with proper heading levels, whereas plain extraction yields 22 fragmented nodes, proving the tagged structure is beneficial for chunking.
2. Auditing Financial Reports for Accessibility
Accessibility Specialist- Background
- An accessibility specialist needs to verify if a newly published annual report meets basic tagging requirements for screen readers.
- Problem
- It is unclear if the tables and lists in the PDF are properly tagged or just visually formatted to look like tables.
- How to Use
- Upload the annual report PDF, specify the pages containing tables, and generate the comparison report.
- Example Config
-
Pages: 15-20 - Outcome
- The side-by-side HTML report reveals that the StructTree mode successfully groups table cells and list items, confirming the document is properly tagged for accessibility compliance.
Try with Samples
pdf, fileRelated Hubs
FAQ
What is a Tagged PDF?
A Tagged PDF contains hidden structural metadata (a StructTree) that defines reading order, headings, paragraphs, and tables, improving accessibility and data extraction.
Why compare extraction methods?
Many PDFs have poorly constructed or missing tags. Comparing the outputs reveals whether relying on the document's internal StructTree improves or degrades the extracted content.
Can I inspect specific pages instead of the whole document?
Yes, you can use the Pages input to specify a range, such as '1,3,5-7', to focus the analysis on specific sections and speed up processing.
What does the HTML report show?
The report displays a side-by-side comparison of semantic node counts, text differences, and how elements like headings, lists, and tables are recognized in both modes.
Does this tool modify my original PDF?
No, the tool only reads the PDF to extract and analyze its structure. Your original file remains completely unchanged.