Product

DAVA Connect

Discover relationships across datasets. Heuristic structural matches (FK candidates, value overlap, name match) plus an opt-in LLM second pass for semantic links.

Quickstart (Python)

bashpip install dava-connect
pythonimport asyncio
from dava_connect import Client

async def main():
    async with Client(api_key="dava_live_…") as c:
        a = await c.upload_file(
            "customers.csv",
            b"customer_id,email\n1,a@b.c\n2,c@d.e\n",
        )
        b = await c.upload_file(
            "orders.csv",
            b"order_id,customer_id\n10,1\n11,2\n",
        )
        result = await c.discover([a.file_id, b.file_id])
        for edge in result.edges:
            print(
                f"{edge.source_column} ↔ {edge.target_column} "
                f"({edge.kind}, {edge.confidence:.0%})"
            )

asyncio.run(main())

Quickstart (TypeScript)

bashnpm install @avaresearch/dava-connect
typescriptimport { Client } from "@avaresearch/dava-connect";
import { readFile } from "node:fs/promises";

const c = new Client({ apiKey: process.env.DAVA_API_KEY! });

const customers = await c.uploadFile("customers.csv", await readFile("customers.csv"));
const orders = await c.uploadFile("orders.csv", await readFile("orders.csv"));

const result = await c.discover([customers.file_id, orders.file_id]);
for (const edge of result.edges) {
  console.log(`${edge.source_column} ↔ ${edge.target_column} (${edge.kind}, ${(edge.confidence * 100).toFixed(0)}%)`);
}

How discovery works

Connect runs in two passes. Pass 1 (always on) is a deterministic heuristic: column-name Jaccard, sample-value Jaccard, and inferred-type compatibility scored together into a confidence in [0, 1]. Pass 2 (opt-in via enable_semantic=true) is an LLM call that looks at the file profiles and proposes additional semantic relationships the heuristic misses.

Edge kinds

KindWhen the heuristic emits it
fk_candidateStrong value overlap AND at least one side reads as an ID. Most likely a real foreign key.
value_overlapStrong value overlap, neither side is ID-shaped. Could be a shared dimension.
name_matchNames align but values don't overlap. Possible rename, possible coincidence — surface for review.
compositeNo single signal dominates; the score got over the bar via combination. The catch-all.

LLM-proposed edges (when enable_semantic=true) come back tagged evidence.source = "llm_semantic" with a reason string the LLM produced.

API surface

MethodEndpointPurpose
POST/v1/connect/uploadMultipart upload of one CSV/TSV (≤ 10 MB).
POST/v1/connect/jobsSubmit discovery on ≥ 2 file_ids. Body: {file_ids, min_confidence?, enable_semantic?}.
GET/v1/connect/jobsList jobs for the active org.
GET/v1/connect/jobs/{id}Job header + status + counters.
GET/v1/connect/jobs/{id}/resultGraph payload — nodes (one per file) + edges, sorted by confidence.

Embed the graph

bashnpm install @avaresearch/dava-connect-react @avaresearch/dava-connect react
tsximport { ConnectGraph } from "@avaresearch/dava-connect-react";

export function MyView({ jobId }: { jobId: string }) {
  return <ConnectGraph apiKey={process.env.NEXT_PUBLIC_DAVA_API_KEY!} jobId={jobId} />;
}