PDF RAG Chunker & Citation Pack

Convert a PDF into heading-aware RAG chunks with page numbers, bounding boxes, and citation metadata

Upload a PDF to generate retrieval-friendly chunks with page references, heading paths, and bounding boxes. The output is a JSON pack that works well for vector stores, answer citation, and PDF-grounded chat systems.

Example Results

1 examples

Prepare a financial report for RAG ingestion

Create chunks with page numbers and bounding boxes so answers can cite the original PDF precisely.

pdf-rag-chunker-citation-pack-example1.json View File
View input parameters
{ "pdfFile": "/public/samples/pdf/financial-report-example1.pdf", "chunkMode": "heading-aware", "maxChars": 900, "useStructTree": true, "sanitizeSensitiveData": false, "includeTableNodes": true }

Click to upload file or drag and drop file here

Maximum file size: 10MB Supported formats: application/pdf

Key Facts

Category
AI & Generators
Input Types
file, select, number, checkbox
Output Type
file
Sample Coverage
4
API Ready
Yes

Overview

The PDF RAG Chunker & Citation Pack is a specialized utility designed to prepare PDF documents for Retrieval-Augmented Generation (RAG) systems. By uploading a PDF, you can automatically generate a structured JSON file containing retrieval-friendly text chunks enriched with precise page numbers, bounding boxes, and heading paths. This tool is ideal for developers building PDF-grounded chat applications, ensuring accurate answer citations and seamless vector database ingestion.

When to Use

  • Preparing PDF documents for ingestion into vector databases for semantic search.
  • Building PDF-grounded AI chat systems that require precise source citations and bounding box highlights.
  • Extracting structured, heading-aware text chunks from complex reports while preserving document hierarchy.

How It Works

  • Upload your target PDF file into the tool.
  • Select your preferred chunking mode, such as heading-aware or element-per-chunk, and set the maximum character limit.
  • Toggle advanced options like structural tree usage, sensitive data sanitization, or table inclusion based on your needs.
  • Download the generated JSON pack containing the text chunks, page references, and bounding box metadata ready for your RAG pipeline.

Use Cases

Ingesting financial reports and earnings calls into a vector store for an AI financial analyst assistant.
Processing legal contracts to build a semantic search tool that links directly back to specific clauses in the original document.
Chunking technical manuals and product documentation to power an accurate, citation-backed customer support chatbot.

Examples

1. Prepare a financial report for RAG ingestion

AI Engineer
Background
An AI engineer is building a financial chatbot that answers questions based on quarterly earnings reports.
Problem
The chatbot needs to provide accurate answers and cite the exact page and paragraph in the original PDF to build user trust.
How to Use
Upload the financial report PDF, select 'Heading-aware' chunk mode, set the max characters to 900, and ensure 'Include Table Nodes' is checked.
Example Config
{
  "chunkMode": "heading-aware",
  "maxChars": 900,
  "useStructTree": true,
  "includeTableNodes": true
}
Outcome
A JSON file is generated containing text chunks grouped by financial headings, complete with page numbers and bounding boxes for precise frontend highlighting.

2. Chunking legal contracts with sensitive data sanitization

Legal Tech Developer
Background
A developer is creating a semantic search engine for a law firm's internal contract repository.
Problem
Contracts need to be broken down into granular, searchable elements while automatically redacting sensitive information before vectorization.
How to Use
Upload the contract PDF, choose 'Element per chunk' mode to isolate individual clauses, and enable the 'Sanitize Sensitive Data' option.
Example Config
{
  "chunkMode": "element-per-chunk",
  "sanitizeSensitiveData": true
}
Outcome
The tool outputs a JSON file where every clause is a distinct chunk with sanitized text, ready for secure vector database embedding.

Try with Samples

pdf, file

Related Hubs

FAQ

What format does this tool output?

The tool outputs a structured JSON file containing the text chunks along with their corresponding metadata, such as page numbers, heading paths, and bounding boxes.

What is the difference between heading-aware and element-per-chunk modes?

Heading-aware mode groups content under its respective section titles up to the maximum character limit, while element-per-chunk treats every individual paragraph, list, or table as a separate, isolated chunk.

Can I control the size of the generated chunks?

Yes, you can set a maximum character limit per chunk, ranging from 200 to 4000 characters, to optimize retrieval performance for your specific vector store.

Does the tool extract tables from the PDF?

Yes, as long as the 'Include Table Nodes' option is enabled, the tool will extract tables and include them in the generated RAG chunks.

What are bounding boxes used for in the output?

Bounding boxes provide the exact spatial coordinates of the text on the original PDF page, allowing frontend applications to visually highlight the cited source text for users.

API Documentation

Request Endpoint

POST /en/api/tools/pdf-rag-chunker-citation-pack

Request Parameters

Parameter Name Type Required Description
pdfFile file (Upload required) Yes -
chunkMode select No -
maxChars number No -
useStructTree checkbox No -
sanitizeSensitiveData checkbox No -
includeTableNodes checkbox No -

File type parameters need to be uploaded first via POST /upload/pdf-rag-chunker-citation-pack to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "filePath": "/public/processing/randomid.ext",
  "fileName": "output.ext",
  "contentType": "application/octet-stream",
  "size": 1024,
  "metadata": {
    "key": "value"
  },
  "error": "Error message (optional)",
  "message": "Notification message (optional)"
}
File: File

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-pdf-rag-chunker-citation-pack": {
      "name": "pdf-rag-chunker-citation-pack",
      "description": "Convert a PDF into heading-aware RAG chunks with page numbers, bounding boxes, and citation metadata",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=pdf-rag-chunker-citation-pack",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]