Product
DAVA Dedup
Find duplicate rows in a messy dataset. Pick the columns that should be unique; Dedup groups likely-duplicate rows into clusters with a suggested canonical row per group.
Quickstart (Python)
bashpip install dava-deduppythonimport asyncio
from dava_dedup import Client
async def main():
async with Client(api_key="dava_live_…") as c:
with open("customers.csv", "rb") as f:
data = f.read()
cols = await c.list_columns("customers.csv", data)
keys = [name for name in cols if "email" in name.lower()]
result = await c.find_duplicates("customers.csv", data, key_columns=keys)
print(f"{result.duplicate_rows}/{result.rows_total} rows "
f"in {len(result.clusters)} clusters")
for cluster in result.clusters:
canonical = next(m for m in cluster.members if m.is_canonical)
print(f" cluster {cluster.cluster_id}: keep row {canonical.row_index}")
asyncio.run(main())Quickstart (TypeScript)
typescriptimport { Client } from "@avaresearch/dava-dedup";
const c = new Client({ apiKey: process.env.DAVA_API_KEY! });
const cols = await c.listColumns("customers.csv", file);
const result = await c.findDuplicates("customers.csv", file, ["email"]);How clustering works
For each row, Dedup builds a normalized key from the chosen key columns (lowercased, whitespace-collapsed, non-alphanumerics stripped for text fields). Rows with the same key form a cluster. The canonical row is the one with the fewest empty cells; ties broken by lowest row index.
The full Dedup pipeline (later release) adds blocking, embedding-based fuzzy match, and survivorship rules. The current SDK + API surface is the deterministic preview.
API surface
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /v1/dedup/columns | Return the header row of an upload so the UI can render a key picker. |
| POST | /v1/dedup/preview | Find clusters keyed on the chosen columns. Form fields: upload (file), key_columns (JSON-encoded array). |