Description
I have identified a critical bug where dataset_fields are duplicated indefinitely if the field type is null (or omitted). This behavior triggers a cascading, exponential explosion of rows in the column_lineage table, leading to severe database performance degradation.
Root Cause
The issue stems from the INSERT statement for dataset_fields within the PostgreSQL ON CONFLICT clause:
INSERT INTO dataset_fields ...
ON CONFLICT(dataset_uuid, name, type) DO UPDATE ...
In SQL standards, NULL != NULL. Therefore, if a field's type is provided as NULL, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs an INSERT instead of an UPDATE.
Impact: Exponential Data Explosion
This bug turns a linear data ingestion process into an exponential one.
-
Field Duplication:
- A single field (e.g.,
modified_at) can create thousands of duplicate rows in the dataset_fields table over time if the type is consistently null.
-
Lineage Explosion (The Multiplier Effect):
- When a job runs to link this duplicated input field to an output field, Marquez generates lineage edges for ALL duplicates.
- Expected: 1 Input Field UUID → 1 Output Field UUID = 1 row in
column_lineage.
- Actual: N Input Field UUIDs → 1 Output Field UUID = N rows in
column_lineage.
-
Consequence:
- In my environment, I observed that the
column_lineage table contained over 1,000x more rows than necessary.
- Nearly 99.9% of the lineage data was garbage caused by this duplication.
- This massive overhead causes API timeouts when querying lineage and significantly increases database storage usage.
Steps to Reproduce
- Emit an OpenLineage event for a dataset where a field's
type is explicitly null.
- Emit the same event multiple times (e.g., 10 times).
- Check the
dataset_fields table; you will see 10 duplicate rows for the same field name.
- Emit a
columnLineage event involving that field.
- Check the
column_lineage table; you will see 10 lineage rows created for what should be a single relationship.
Suggested Fix
Modify the INSERT query in the DAO layer to handle NULL types explicitly.
Using COALESCE(:type, 'UNKNOWN') (or a similar default value) ensures that the ON CONFLICT clause correctly matches existing rows and performs an update instead of a duplicate insert.
Description
I have identified a critical bug where
dataset_fieldsare duplicated indefinitely if the fieldtypeis null (or omitted). This behavior triggers a cascading, exponential explosion of rows in thecolumn_lineagetable, leading to severe database performance degradation.Root Cause
The issue stems from the
INSERTstatement fordataset_fieldswithin the PostgreSQLON CONFLICTclause:In SQL standards,
NULL != NULL. Therefore, if a field'stypeis provided asNULL, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs anINSERTinstead of anUPDATE.Impact: Exponential Data Explosion
This bug turns a linear data ingestion process into an exponential one.
Field Duplication:
modified_at) can create thousands of duplicate rows in thedataset_fieldstable over time if the type is consistently null.Lineage Explosion (The Multiplier Effect):
column_lineage.column_lineage.Consequence:
column_lineagetable contained over 1,000x more rows than necessary.Steps to Reproduce
typeis explicitlynull.dataset_fieldstable; you will see 10 duplicate rows for the same field name.columnLineageevent involving that field.column_lineagetable; you will see 10 lineage rows created for what should be a single relationship.Suggested Fix
Modify the
INSERTquery in the DAO layer to handleNULLtypes explicitly.Using
COALESCE(:type, 'UNKNOWN')(or a similar default value) ensures that theON CONFLICTclause correctly matches existing rows and performs an update instead of a duplicate insert.marquez/api/src/main/java/marquez/db/DatasetFieldDao.java
Line 276 in a89b89c