Key Facts
- Category
- Developer & Web
- Input Types
- file, checkbox, text, select
- Output Type
- html
- Sample Coverage
- 4
- API Ready
- Yes
Overview
The PDF to JSON Structure Explorer allows developers and data engineers to extract OpenDataLoader JSON from PDF documents and visualize the semantic structure in an interactive HTML view. By rendering headings, paragraphs, tables, lists, and bounding boxes, this tool makes it easy to debug parser outputs, verify page metadata, and inspect the exact hierarchy of extracted document elements without manually reading raw JSON.
When to Use
- •When you need to debug the heading hierarchy and semantic parsing of a complex PDF document.
- •When verifying if tables and lists are correctly identified and extracted by the OpenDataLoader parser.
- •When inspecting bounding box coordinates and page metadata for specific text nodes within a PDF.
How It Works
- •Upload a PDF file to initiate the OpenDataLoader JSON extraction process.
- •Optionally specify page ranges, toggle the structural tree usage, or apply a node filter to isolate headings, tables, or lists.
- •Enter a search term to quickly locate specific content or enable sensitive data sanitization if required.
- •View the generated HTML explorer report to interactively browse the extracted semantic nodes, page metadata, and JSON previews.
Use Cases
Examples
1. Exploring a Brand Guidelines PDF
Data Engineer- Background
- A data engineer is building an ingestion pipeline for corporate brand guidelines and needs to ensure the parser correctly identifies section headers.
- Problem
- Reading raw JSON output to verify heading hierarchies is tedious and error-prone.
- How to Use
- Upload the brand guidelines PDF, leave 'Use Struct Tree' enabled, and set the Node Filter to 'Headings only'.
- Example Config
-
{ "useStructTree": true, "nodeFilter": "heading" } - Outcome
- An HTML explorer view is generated, displaying only the heading nodes, allowing quick verification of the document's structural hierarchy.
2. Verifying Table Extraction in Financial Reports
Financial Analyst- Background
- An analyst needs to extract quarterly earnings tables from a 50-page PDF report.
- Problem
- It is unclear if the parser is correctly identifying the complex financial tables on specific pages.
- How to Use
- Upload the PDF, specify the exact pages containing the tables (e.g., '12-15'), and set the Node Filter to 'Tables only'.
- Example Config
-
{ "pages": "12-15", "nodeFilter": "table" } - Outcome
- The explorer view isolates and displays only the table nodes from pages 12 to 15, confirming accurate tabular data extraction.
Try with Samples
json, pdf, fileRelated Hubs
FAQ
What formats are supported for upload?
The tool strictly accepts PDF files for extraction.
Can I filter the output to only show tables?
Yes, you can use the Node Filter option to display only tables, headings, or lists.
What is the 'Use Struct Tree' option?
It tells the parser to utilize the PDF's internal structural tags, if available, to improve the accuracy of semantic extraction.
Can I extract data from specific pages only?
Yes, you can input a page range like '1,3,5-7' in the Pages field to limit the extraction to those specific pages.
Does this tool output raw JSON?
The tool generates an interactive HTML explorer view that visualizes the semantic nodes, which includes previews of the underlying JSON data.