DAVA Dedup.

Find duplicate rows in a messy dataset. Pick the columns that should be unique; Dedup groups likely-duplicate rows into clusters with a suggested canonical row per group.

Dedup runs on the Norm pipeline. After cleaning, rows with the same normalized key cluster together. The canonical row is the one with the fewest empty cells; ties broken by lowest row index.

Product · 02

What Dedup ships with.

Each capability runs in the shared engine — the Norm pipeline, the Trust audit chain, the Decisioning mode toggle. Same substrate as the other four products.

01

Pick your unique key

Choose any subset of columns. Email, customer ID, full-name + DOB — whatever the business says is the dedup key. Dedup builds a normalized hash from those columns.

02

Canonical row selection

Per cluster, the row with the fewest empty cells wins. You see the selection rationale alongside the cluster, so reviewers can override.

03

Audit trail per cluster

Every cluster decision writes a chain block. Auditors get a tamper-evident record of which rows were merged, when, and why.

04

Reviewer queue (optional)

In Decisioning mode, low-confidence clusters route to a human reviewer before merging. Staff can approve, reject, or split clusters. All actions audited.

SDK · python

Wire it in under a minute.

The Python SDK is the most mature. TypeScript follows the same shape. Both ship with strict types and async-first APIs.

Install
pip install dava-dedup
main.py
import asyncio
from dava_dedup import Client

async def main():
    async with Client(api_key="dava_live_…") as c:
        with open("customers.csv", "rb") as f:
            data = f.read()
        cols = await c.list_columns("customers.csv", data)
        keys = [name for name in cols if "email" in name.lower()]
        result = await c.find_duplicates("customers.csv", data, key_columns=keys)
        print(f"{result.duplicate_rows}/{result.rows_total} rows "
              f"in {len(result.clusters)} clusters")

asyncio.run(main())
API surface

Endpoints you'll actually hit.

Same auth as the rest of DAVA: bearer API key on Authorization: Bearer dava_live_… or session cookie + CSRF for browser flows.

MethodEndpointPurpose
POST/v1/dedup/columnsReturn the header row of an upload so the UI can render a key picker.
POST/v1/dedup/previewFind clusters keyed on the chosen columns. Form fields: upload (file), key_columns (JSON-encoded array).

Bring us a dedup problem we can prove.

We'll run your hardest dataset through DAVA Dedup during a 5-day pilot. You keep the cleaned output and the evidence pack either way.