-
Notifications
You must be signed in to change notification settings - Fork 2
Move taxonomy_id to TargetGene; populate via mapping for accession-based targets #697
Copy link
Copy link
Open
Open
Copy link
Labels
app: backendTask implementation touches the backendTask implementation touches the backendapp: databaseTask implementation requires database changesTask implementation requires database changesapp: frontendTask implementation touches the frontendTask implementation touches the frontendapp: mapperTask implementation touches the mapperTask implementation touches the mapperapp: workerTask implementation touches the workerTask implementation touches the workertype: enhancementEnhancement to an existing featureEnhancement to an existing featuretype: maintenanceMaintaining this projectMaintaining this project
Metadata
Metadata
Assignees
Labels
app: backendTask implementation touches the backendTask implementation touches the backendapp: databaseTask implementation requires database changesTask implementation requires database changesapp: frontendTask implementation touches the frontendTask implementation touches the frontendapp: mapperTask implementation touches the mapperTask implementation touches the mapperapp: workerTask implementation touches the workerTask implementation touches the workertype: enhancementEnhancement to an existing featureEnhancement to an existing featuretype: maintenanceMaintaining this projectMaintaining this project
Background
Taxonomycurrently lives onTargetSequence, but organism is a property of the gene target, not of how its sequence is represented. Accession-based targets have no taxonomy representation at all, despite every accession-based target being implicitly Homo sapiens by virtue of CDOT's current scope. This makes it impossible to query or filter accession-based score sets by organism in a structured way.Proposed Changes
Move
taxonomy_idtoTargetGenetaxonomy_idmoves fromtarget_sequencestotarget_genes, applying uniformly to both sequence and accession types.For accession-based targets, taxonomy is populated by the mapping job via CDOT lookup. It is never user-supplied. While CDOT is human-only this will always resolve to Homo sapiens, but the design is forward-compatible with (potential) future multi-organism support in accession based targets.
For sequence-based targets, taxonomy remains user-supplied but moves up to
TargetGeneCreaterather than being nested insideTargetSequenceCreate.Preserve non-breaking response shapes
The view model serialization layer absorbs the storage change so existing clients are unaffected:
target_gene.target_sequence.taxonomyis preserved by populatingtaxonomyinSavedTargetSequencefromTargetGene.taxonomyduring serialization rather than from the sequence row directly.target_gene.taxonomyis a new field populated after mapping. Clients that do not know about it are unaffected.target_gene.taxonomyis intentionally not added to sequence-based responses at this time to avoid taxonomy appearing in two places in the same response. Normalizing the response shape across both types is a separate future change.Breaking Changes
TargetSequenceCreatetaxonomyfieldTargetGeneCreatetaxonomyfield (sequence-based targets)TargetSequenceresponsetaxonomystays in place (serialized fromTargetGene)TargetGeneresponsetaxonomyfor accession-based targetsWhen Taxonomy Is Null vs. Populated
For sequence-based targets, taxonomy is user-supplied at creation time and will always be populated. A null value indicates a data integrity problem.
For accession-based targets, taxonomy is derived by the mapping job and will be null until mapping completes successfully. This is expected transient state. Consumers of the API should treat a null taxonomy on an accession-based target as "mapping has not yet run or has not yet succeeded" rather than as an absent or unknown organism. A null value on a published accession-based target is valid and simply means the mapping job has not yet run for that score set.
Migration Notes
taxonomy_idcolumn moves fromtarget_sequencestotarget_genes; data migration requirednulltaxonomy until their mapping jobs are re-run or a backfill migration runs a CDOT lookup for each accession