Product

DAVA Norm

Drop a messy CSV, get a clean structured table back. Snake-case headers, type inference + coercion, whitespace trim, drop empty rows. Plus Smart Tables: per-column PII tags and outlier counts.

Quickstart (Python)

bashpip install dava-norm

pythonimport asyncio
from dava_norm import Client

async def main():
    async with Client(api_key="dava_live_…") as c:
        with open("messy.csv", "rb") as f:
            result = await c.preview("messy.csv", f.read())
        print(f"{result.rows_in} → {result.rows_out} rows "
              f"({result.dropped_rows_empty} empty rows dropped)")
        for col in result.columns:
            print(f"  {col.name_in!r} → {col.name_out!r}  "
                  f"({col.inferred_type}, {col.sensitivity_tag}, "
                  f"{col.outlier_count} outliers)")
        with open("clean.csv", "w") as f:
            f.write(result.cleaned_csv)

asyncio.run(main())

Quickstart (TypeScript)

typescriptimport { Client } from "@avaresearch/dava-norm";
import { readFile } from "node:fs/promises";

const c = new Client({ apiKey: process.env.DAVA_API_KEY! });
const result = await c.preview("messy.csv", await readFile("messy.csv"));
console.log(`${result.rows_in} rows → ${result.rows_out} rows`);

Smart Tables

Every column comes back tagged with a sensitivity inference (PII detection) and, for numeric columns, an outlier count. The sensitivity tag is a hint surfaced to the dashboard so customers can mask before exporting; it's not a security boundary — Trust Layer policies do the actual access enforcement.

Tag	Detector
`email`	Standard email regex; ≥ 80% of sample matches.
`phone`	10-15 digit phone-shaped values.
`ssn_us`	NNN-NN-NNNN.
`credit_card`	13-19 digits passing Luhn check.
`iban`	Country code + check digits + BBAN.
`dob`	ISO date or DD/MM/YYYY / MM/DD/YYYY.
`name_like`	1-3 capitalized words, conservative threshold.
`none`	Nothing matched. Most columns end up here.

API surface

Method	Endpoint	Purpose
POST	`/v1/norm/preview`	Multipart upload of one CSV/TSV (≤ 5 MB). Returns cleaned bytes inline + per-column stats.