Dataset Quality Profiler

Profile CSV or JSON datasets for missing values, duplicate rows, format drift, type inference, and numeric outliers

Paste a CSV dataset into "Dataset Input" or upload a CSV/JSON file. The profiler inspects each column and gives you a quick quality snapshot before the data moves into BI, ETL, or machine-learning steps.

What the tool checks:

  • Missing values per column
  • Duplicate rows, or duplicate combinations based on the columns you list in "Duplicate Key Columns"
  • Column type inference: number, boolean, date, string, or empty
  • Numeric outliers using an IQR-style rule
  • Format drift for string/date-like columns, such as mixed date styles or code-vs-free-text inconsistencies

How to fill the fields:

  • Dataset Input: paste CSV text directly when you want a quick profile
  • Dataset File: upload CSV or JSON if the dataset is larger or already saved locally
  • Duplicate Key Columns: optional comma-separated keys such as id,email to detect duplicates by business key instead of whole-row matching
  • Sample Rows: controls how many example rows appear in the report preview

How to read the report:

  • Quality score is a simple 0-100 summary where more missing cells, duplicate rows, and anomaly signals reduce the score
  • Missing shows how many blank/null cells were found in that column
  • Distinct shows how many unique values appear in the sampled dataset
  • Anomalies highlights numeric outliers
  • Format drift highlights columns where values look structurally inconsistent

Current scope:

  • CSV and JSON are supported
  • JSON should be an array of objects or an object with a rows array
  • The score is meant as a quick operational signal, not a formal data-governance grade

Example Results

1 examples

Profile a transactional CSV before loading it into BI

Spot missing cells, outliers, duplicate records, and type drift before the dataset reaches dashboards.

Dataset quality report
View input parameters
{ "datasetInput": "id,name,email,amount,created_at\n1,Alice,[email protected],120,2026-03-01\n2,Bob,,85,2026-03-02\n2,Bob,[email protected],85,03/02/2026\n3,Charlie,[email protected],9999,2026-03-03", "datasetFile": "", "duplicateKeyColumns": "id", "sampleRows": 8 }

Click to upload file or drag and drop file here

Maximum file size: 15MB Supported formats: text/csv, application/json, text/plain

Key Facts

Category
Data & Tables
Input Types
textarea, file, text, number
Output Type
html
Sample Coverage
4
API Ready
Yes

Overview

The Dataset Quality Profiler is a fast, browser-based tool that inspects CSV and JSON files to generate an instant data quality report. It automatically detects missing values, duplicate records, numeric outliers, and format drift across your columns. Use it to get a quick operational snapshot of your dataset's health before moving data into BI dashboards, ETL pipelines, or machine learning models.

When to Use

  • Before loading raw data into a database or BI tool to catch structural errors and missing values.
  • When auditing a new dataset from a third-party vendor or client to quickly assess data completeness and anomalies.
  • During data preparation for machine learning to identify numeric outliers and inconsistent data types.

How It Works

  • Paste your CSV data directly into the input field or upload a CSV or JSON file.
  • Optionally specify duplicate key columns (like 'id,email') to check for business-logic duplicates instead of exact row matches.
  • Adjust the sample rows setting to control how many example records appear in the final preview.
  • View the generated HTML report, which includes a 0-100 quality score, missing value counts, outlier detection, and format drift alerts.

Use Cases

Profiling transactional sales data to ensure no missing revenue values or duplicate order IDs before building financial dashboards.
Inspecting customer lead exports from marketing platforms to spot invalid email formats or empty contact fields.
Validating sensor or IoT data logs to quickly identify extreme numeric outliers and missing timestamps.

Examples

1. Profile a transactional CSV before loading it into BI

Data Analyst
Background
An analyst receives a weekly CSV export of customer transactions that needs to be visualized in a BI dashboard.
Problem
The raw export often contains duplicate transaction IDs, missing amounts, and mixed date formats that break the dashboard.
How to Use
Paste the CSV into the Dataset Input, set 'Duplicate Key Columns' to 'id', and generate the report.
Example Config
Duplicate Key Columns: id
Sample Rows: 8
Outcome
The report flags duplicate 'id' rows, highlights missing values in the 'amount' column, and detects format drift in the 'created_at' dates.

2. Auditing a JSON user dataset for anomalies

Data Engineer
Background
A data engineer is integrating a new JSON feed of user profiles from a third-party API.
Problem
The engineer needs to quickly verify if the API is sending complete records without extreme outliers in the 'age' or 'score' fields.
How to Use
Upload the JSON file via the 'Dataset File' input and review the Anomalies and Missing metrics in the generated report.
Outcome
The profiler assigns a quality score, identifies numeric outliers in the 'score' column using IQR, and confirms no missing values in critical fields.

Try with Samples

json, csv, text

Related Hubs

FAQ

What file formats does the profiler support?

The tool supports CSV and JSON files. For JSON, the data should be formatted as an array of objects or an object containing a 'rows' array.

How is the overall quality score calculated?

The score is a 0-100 operational summary. It decreases based on the frequency of missing cells, duplicate rows, format drift, and numeric anomalies found in the dataset.

Can I check for duplicates using specific columns?

Yes. By entering comma-separated column names in the 'Duplicate Key Columns' field (e.g., 'id,email'), the tool will flag duplicate combinations based only on those business keys.

What does 'format drift' mean in the report?

Format drift highlights columns where the data structure is inconsistent, such as mixing different date formats or combining numeric codes with free-text strings.

How does the tool detect numeric outliers?

The profiler uses an Interquartile Range (IQR) style rule to identify and flag numeric values that fall significantly outside the normal distribution of a column.

API Documentation

Request Endpoint

POST /en/api/tools/dataset-quality-profiler

Request Parameters

Parameter Name Type Required Description
datasetInput textarea No -
datasetFile file (Upload required) No -
duplicateKeyColumns text No -
sampleRows number No -

File type parameters need to be uploaded first via POST /upload/dataset-quality-profiler to get filePath, then pass filePath to the corresponding file field.

Response Format

{
  "result": "
Processed HTML content
", "error": "Error message (optional)", "message": "Notification message (optional)", "metadata": { "key": "value" } }
HTML: HTML

AI MCP Documentation

Add this tool to your MCP server configuration:

{
  "mcpServers": {
    "elysiatools-dataset-quality-profiler": {
      "name": "dataset-quality-profiler",
      "description": "Profile CSV or JSON datasets for missing values, duplicate rows, format drift, type inference, and numeric outliers",
      "baseUrl": "https://elysiatools.com/mcp/sse?toolId=dataset-quality-profiler",
      "command": "",
      "args": [],
      "env": {},
      "isActive": true,
      "type": "sse"
    }
  }
}

You can chain multiple tools, e.g.: `https://elysiatools.com/mcp/sse?toolId=png-to-webp,jpg-to-webp,gif-to-webp`, max 20 tools.

Supports URL file links or Base64 encoding for file parameters.

If you encounter any issues, please contact us at [email protected]