[Bug] `dataset_fields` with NULL type causes explosion in `column_lineage`


## Description
I have identified a critical bug where `dataset_fields` are duplicated indefinitely if the field `type` is null (or omitted). This behavior triggers a cascading, exponential explosion of rows in the `column_lineage` table, leading to severe database performance degradation.

## Root Cause
The issue stems from the `INSERT` statement for `dataset_fields` within the PostgreSQL `ON CONFLICT` clause:

```sql
INSERT INTO dataset_fields ...
ON CONFLICT(dataset_uuid, name, type) DO UPDATE ...
```

In SQL standards, `NULL != NULL`. Therefore, if a field's `type` is provided as `NULL`, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs an `INSERT` instead of an `UPDATE`.

## Impact: Exponential Data Explosion
This bug turns a linear data ingestion process into an exponential one.

1.  **Field Duplication:**
    - A single field (e.g., `modified_at`) can create **thousands of duplicate rows** in the `dataset_fields` table over time if the type is consistently null.

2.  **Lineage Explosion (The Multiplier Effect):**
    - When a job runs to link this duplicated input field to an output field, Marquez generates lineage edges for **ALL** duplicates.
    - **Expected:** 1 Input Field UUID → 1 Output Field UUID = **1 row** in `column_lineage`.
    - **Actual:** N Input Field UUIDs → 1 Output Field UUID = **N rows** in `column_lineage`.

3.  **Consequence:**
    - In my environment, I observed that the `column_lineage` table contained **over 1,000x more rows than necessary**.
    - Nearly **99.9% of the lineage data was garbage** caused by this duplication.
    - This massive overhead causes API timeouts when querying lineage and significantly increases database storage usage.

## Steps to Reproduce
1.  Emit an OpenLineage event for a dataset where a field's `type` is explicitly `null`.
2.  Emit the same event multiple times (e.g., 10 times).
3.  Check the `dataset_fields` table; you will see 10 duplicate rows for the same field name.
4.  Emit a `columnLineage` event involving that field.
5.  Check the `column_lineage` table; you will see 10 lineage rows created for what should be a single relationship.

## Suggested Fix
Modify the `INSERT` query in the DAO layer to handle `NULL` types explicitly.

Using **`COALESCE(:type, 'UNKNOWN')`** (or a similar default value) ensures that the `ON CONFLICT` clause correctly matches existing rows and performs an update instead of a duplicate insert.
https://github.com/MarquezProject/marquez/blob/a89b89cad27f87c90dd964c40dae8dfb3f64031f/api/src/main/java/marquez/db/DatasetFieldDao.java#L276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] `dataset_fields` with NULL type causes explosion in `column_lineage` #3083

Description

Root Cause

Impact: Exponential Data Explosion

Steps to Reproduce

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] dataset_fields with NULL type causes explosion in column_lineage #3083

Description

Description

Root Cause

Impact: Exponential Data Explosion

Steps to Reproduce

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] `dataset_fields` with NULL type causes explosion in `column_lineage` #3083