dc43 Pipeline Demo

From contracts to trusted data

Why Data Contracts?

Without clear agreements, data pipelines rely on tribal knowledge and manual checks.

Manual Step 1: Infer Schema

  • Developers inspect files to guess structure
  • Ad-hoc scripts enforce types
  • Evolution requires email coordination

Manual Step 2: Validate Inputs

  • Custom validators scattered across jobs
  • Late discovery of wrong or missing fields

Manual Step 3: Compute Metrics

  • Separate jobs count rows and nulls
  • Hard to compare across runs

Manual Step 4: Track Versions

  • Spreadsheet or wiki for dataset history
  • No link between code and documentation

Manual Step 5: Communicate Changes

  • Emails and meetings to share updates
  • Consumers discover breaking changes too late

Manual Pipeline Pain

Error-prone, slow feedback, and little governance.

Enter dc43

A thin wrapper around Spark that enforces contracts and records lineage.

1. Define Data Contract

from open_data_contract_standard.model import OpenDataContractStandard

contract = OpenDataContractStandard(
    name="orders",
    version="1.0.0",
    fields=[
        {"name": "id", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "customer_id", "type": "string"}
    ],
    expectations=["amount > 0", "customer_id not null"]
)

Contract JSON

{
  "name": "orders",
  "version": "1.0.0",
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "customer_id", "type": "string"}
  ],
  "expectations": ["amount > 0", "customer_id not null"]
}

2. Read with Contract

orders_df, status = read_with_contract(
    spark,
    path="orders.json",
    contract=contract,
    dq_client=dq
)

Read Status

{
  "status": "fail",
  "violations": [
    {"row": 42, "field": "amount", "message": "amount must be > 0"}
  ]
}

Manual Alternative

df = spark.read.json("orders.json")
errors = validate_schema(df)
if errors:
    raise ValueError(errors)

3. Transform with Spark

enriched = orders_df.join(customers_df, "customer_id")\
    .withColumn("total", orders_df.amount * 1.2)

Transformation Output

[
  {"id": "1", "total": 12.0},
  {"id": "2", "total": -6.0}
]

Negative totals will trigger contract checks later.

4. Write with Contract

result, status, draft = write_with_contract(
    enriched,
    contract=contract,
    path="out/orders",
    dq_client=dq,
    draft_on_mismatch=True
)

Write Result

{
  "metrics": {"row_count": 2, "negative_total": 1},
  "draft": {
    "version": "1.1.0",
    "changes": ["allow negative total"]
  }
}

Metrics vs Manual

# manual
row_count = enriched.count()
negatives = enriched.filter("total < 0").count()

5. Inspect Violations

status = attach_failed_expectations(
    enriched,
    contract,
    status,
    collect_examples=True
)

Violation Report

[
  {
    "expectation": "amount > 0",
    "examples": [{"id": "2", "amount": -5.0}]
  }
]

6. Track Dataset Versions

records.append(DatasetRecord(
    name="orders_enriched",
    version=1,
    status=status.status,
    metrics=result.metrics
))
save_records(records)

Version History

[
  {"version": 1, "row_count": 2, "status": "fail"}
]

Pipeline Comparison

Manual

  • Separate scripts for validation and metrics
  • Manual tracking of versions
  • Inconsistent rules

dc43

  • Contracts enforce schema and rules
  • Metrics captured on write
  • History recorded automatically

With vs Without Contracts

  • Without: implicit schemas, late errors, manual docs
  • With: versioned definitions, early validation, consistent governance

Benefits for Data Engineers

  • Less boilerplate Spark code
  • Early detection of issues
  • Automatic metrics for monitoring

Benefits for Governance

  • Traceable changes across versions
  • Clear contracts between producers and consumers
  • Audit-friendly metrics and violations

Contract Evolution

# v1.0.0
{"fields": [{"name": "amount", "type": "double"}]}
# v1.1.0
{"fields": [{"name": "amount", "type": "double", "nullable": true}]}

Summary

  • Contracts define, validate, and document data
  • dc43 automates metrics and versioning
  • Manual steps shrink, reliability grows

Get Started

pip install dc43 → build your first contract today.