Product

DAVA Dedup

Find duplicate rows in a messy dataset. Pick the columns that should be unique; Dedup groups likely-duplicate rows into clusters with a suggested canonical row per group.

Quickstart (Python)

bashpip install dava-dedup
pythonimport asyncio
from dava_dedup import Client

async def main():
    async with Client(api_key="dava_live_…") as c:
        with open("customers.csv", "rb") as f:
            data = f.read()
        cols = await c.list_columns("customers.csv", data)
        keys = [name for name in cols if "email" in name.lower()]
        result = await c.find_duplicates("customers.csv", data, key_columns=keys)
        print(f"{result.duplicate_rows}/{result.rows_total} rows "
              f"in {len(result.clusters)} clusters")
        for cluster in result.clusters:
            canonical = next(m for m in cluster.members if m.is_canonical)
            print(f"  cluster {cluster.cluster_id}: keep row {canonical.row_index}")

asyncio.run(main())

Quickstart (TypeScript)

typescriptimport { Client } from "@avaresearch/dava-dedup";

const c = new Client({ apiKey: process.env.DAVA_API_KEY! });
const cols = await c.listColumns("customers.csv", file);
const result = await c.findDuplicates("customers.csv", file, ["email"]);

How clustering works

For each row, Dedup builds a normalized key from the chosen key columns (lowercased, whitespace-collapsed, non-alphanumerics stripped for text fields). Rows with the same key form a cluster. The canonical row is the one with the fewest empty cells; ties broken by lowest row index.

The full Dedup pipeline (later release) adds blocking, embedding-based fuzzy match, and survivorship rules. The current SDK + API surface is the deterministic preview.

API surface

MethodEndpointPurpose
POST/v1/dedup/columnsReturn the header row of an upload so the UI can render a key picker.
POST/v1/dedup/previewFind clusters keyed on the chosen columns. Form fields: upload (file), key_columns (JSON-encoded array).