Tagged PDF Inspector

Compare StructTree-enabled and plain PDF extraction to see whether a document behaves like a tagged PDF and how much semantic structure it exposes

Run OpenDataLoader with and without StructTree support, then compare semantic node counts, headings, lists, and table recognition. This helps you understand whether a PDF carries useful tagged structure for accessibility, content conversion, and RAG ingestion.

Example Results

1 examples

Inspect whether a brand PDF carries useful tagged structure

Run the same PDF with and without StructTree support, then compare semantic-node counts and sample headings.

Real sample report comparing StructTree and plain extraction; this sample produced 20 vs 22 semantic nodes and showed heading-text differences.
View input parameters
{ "pdfFile": "/public/samples/pdf/brand-guidelines-pdf-example1.pdf", "pages": "", "includeHeaderFooter": false }

Click to upload file or drag and drop file here

Maximum file size: 10MB Supported formats: application/pdf

Key Facts

Category
Developer & Web
Input Types
file, text, checkbox
Output Type
html
Sample Coverage
4
API Ready
Yes

Overview

The Tagged PDF Inspector evaluates the semantic quality of your PDF documents by comparing extraction results with and without StructTree support. By running OpenDataLoader in both modes, it highlights differences in node counts, headings, lists, and tables, helping developers and data engineers determine if a document's internal tagging is reliable for accessibility, content conversion, or RAG pipelines.

When to Use

  • When auditing a corpus of PDFs to determine if they contain reliable semantic tags for accessibility compliance.
  • Before building a Retrieval-Augmented Generation (RAG) pipeline to see if StructTree extraction yields better document chunking.
  • When troubleshooting missing headings or broken tables during automated PDF-to-HTML or PDF-to-Markdown conversion.

How It Works

  • Upload your target PDF file and optionally specify a page range to limit the processing time.
  • Choose whether to include headers and footers in the extraction analysis.
  • The tool processes the document twice using OpenDataLoader: once with StructTree enabled and once without.
  • Review the generated HTML report to compare semantic node counts, heading structures, and table recognition side-by-side.

Use Cases

Evaluating document accessibility readiness by verifying the presence and quality of internal PDF tags.
Optimizing data ingestion for LLMs by determining the best extraction strategy for complex, multi-column PDFs.
Quality assurance testing for document generation software to ensure exported PDFs contain valid semantic structures.

Examples

1. Evaluating Brand Guidelines for RAG Ingestion

Data Engineer
Background
A data engineer is preparing a set of corporate brand guidelines for a RAG-based chatbot.
Problem
The PDF has complex layouts, and standard text extraction merges columns and loses heading hierarchy.
How to Use
Upload the brand guidelines PDF, leave 'Include Header/Footer' unchecked, and run the inspector.
Example Config
Pages: 1-10
Outcome
The report shows that StructTree extraction correctly identifies 20 semantic nodes with proper heading levels, whereas plain extraction yields 22 fragmented nodes, proving the tagged structure is beneficial for chunking.

2. Auditing Financial Reports for Accessibility

Accessibility Specialist
Background
An accessibility specialist needs to verify if a newly published annual report meets basic tagging requirements for screen readers.
Problem
It is unclear if the tables and lists in the PDF are properly tagged or just visually formatted to look like tables.
How to Use
Upload the annual report PDF, specify the pages containing tables, and generate the comparison report.
Example Config
Pages: 15-20
Outcome
The side-by-side HTML report reveals that the StructTree mode successfully groups table cells and list items, confirming the document is properly tagged for accessibility compliance.

Try with Samples

pdf, file

Related Hubs

FAQ

What is a Tagged PDF?

A Tagged PDF contains hidden structural metadata (a StructTree) that defines reading order, headings, paragraphs, and tables, improving accessibility and data extraction.

Why compare extraction methods?

Many PDFs have poorly constructed or missing tags. Comparing the outputs reveals whether relying on the document's internal StructTree improves or degrades the extracted content.

Can I inspect specific pages instead of the whole document?

Yes, you can use the Pages input to specify a range, such as '1,3,5-7', to focus the analysis on specific sections and speed up processing.

What does the HTML report show?

The report displays a side-by-side comparison of semantic node counts, text differences, and how elements like headings, lists, and tables are recognized in both modes.

Does this tool modify my original PDF?

No, the tool only reads the PDF to extract and analyze its structure. Your original file remains completely unchanged.

API Documentation

Request Endpoint

POST /en/api/tools/tagged-pdf-inspector

Request Parameters

Parameter Name Type Required Description
pdfFile file (Upload required) Yes -
pages text No -
includeHeaderFooter checkbox No -

File type parameters need to be uploaded first via POST /upload/tagged-pdf-inspector to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "result": "
Processed HTML content
", "error": "Error message (optional)", "message": "Notification message (optional)", "metadata": { "key": "value" } }
HTML: HTML

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-tagged-pdf-inspector": {
      "name": "tagged-pdf-inspector",
      "description": "Compare StructTree-enabled and plain PDF extraction to see whether a document behaves like a tagged PDF and how much semantic structure it exposes",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=tagged-pdf-inspector",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]