A compliance analysis tool for LinkML data files. Measures how well your data populates recommended: true slots defined in LinkML schemas.
- Hierarchical scoring: Calculate compliance at multiple levels (global, path-level, per-item)
- Aggregated list scoring: Roll up scores across list elements using jq-style
[]notation - Configurable weights: Assign importance weights to paths and slots
- Threshold violations: Set minimum compliance requirements and detect violations
- Multiple output formats: JSON, CSV, and human-readable text
- Multi-file reports: Aggregate compliance across an entire knowledge base
- HTML dashboards: Generate static dashboard sites for GitHub Pages (see dismech example)
pip install linkml-data-qcOr with uv:
uv add linkml-data-qcfrom linkml_data_qc import ComplianceAnalyzer
# Basic usage
analyzer = ComplianceAnalyzer("path/to/schema.yaml")
report = analyzer.analyze_file("path/to/data.yaml", "TargetClass")
print(f"Global compliance: {report.global_compliance:.1f}%")
print(f"Total checks: {report.total_checks}")
print(f"Total populated: {report.total_populated}")
# With configuration for weights and thresholds
from linkml_data_qc import QCConfig, SlotQCConfig
config = QCConfig(
default_weight=1.0,
slots={
"term": SlotQCConfig(weight=2.0, min_compliance=80.0),
"description": SlotQCConfig(weight=0.5)
}
)
analyzer = ComplianceAnalyzer("schema.yaml", config)
report = analyzer.analyze_file("data.yaml", "Disease")
if report.threshold_violations:
print(f"Found {len(report.threshold_violations)} violations!")# Single file analysis
linkml-data-qc data.yaml -s schema.yaml -t TargetClass -f text
# Analyze all files in a directory
linkml-data-qc data/ -s schema.yaml -t TargetClass -f json
# With configuration and threshold enforcement
linkml-data-qc data/ -s schema.yaml -t TargetClass \
-c qc_config.yaml --fail-on-violations| Option | Description |
|---|---|
DATA_PATH... |
Data file(s) or directory to analyze (positional) |
-s, --schema |
Path to LinkML schema YAML (required) |
-t, --target-class |
Target class name for validation (required) |
-c, --config |
Path to QC configuration YAML file |
-f, --format |
Output format: json, csv, text (default: text) |
-o, --output |
Output file path (default: stdout) |
--min-compliance |
Minimum global compliance percentage (exit 1 if below) |
--fail-on-violations |
Exit with error code if any threshold violations occur |
--pattern |
Glob pattern for directory search (default: *.yaml) |
The tool uses LinkML's SchemaView to identify slots marked with recommended: true:
# In your LinkML schema
slots:
description:
description: Human-readable description
recommended: true # This slot will be tracked
term:
description: Ontology term binding
recommended: true # This slot will be trackedThe analyzer recursively traverses your data, tracking:
- Which recommended slots are present at each location
- The path to each object (e.g.,
pathophysiology[0].cell_types[2]) - The LinkML class of each object
Results are computed at multiple levels:
- Per-item scores: Each object gets compliance scores for its recommended slots
- Aggregated list scores: Rolled up by normalized path with
[]notation - Global scores: Overall compliance across all paths
Create a YAML configuration file to customize weights and thresholds:
# qc_config.yaml
default_weight: 1.0
default_min_compliance: null
# Per-slot configuration
slots:
term:
weight: 2.0
min_compliance: 80.0
description:
weight: 0.5
# Per-path overrides
paths:
"phenotypes[].phenotype_term.term":
weight: 3.0
min_compliance: 95.0- Path-specific config (highest priority)
- Slot-specific config
- Default values
Compliance Report: data/Asthma.yaml
Target Class: Disease
Global Compliance: 65.3% (125/191)
Weighted Compliance: 71.2%
Summary by Slot:
description: 78.4%
term: 72.1%
Aggregated Scores by List Path:
pathophysiology[].description: 100.0% (5/5)
pathophysiology[].term: 80.0% (4/5)
{
"file_path": "data/Asthma.yaml",
"target_class": "Disease",
"global_compliance": 65.3,
"weighted_compliance": 71.2,
"total_checks": 191,
"total_populated": 125,
"summary_by_slot": {
"description": 78.4,
"term": 72.1
}
}Use exit codes for CI integration:
# Fail if global compliance is below 70%
linkml-data-qc data/ -s schema.yaml -t Disease --min-compliance 70
# Fail if any configured threshold is violated
linkml-data-qc data/ -s schema.yaml -t Disease \
-c qc_config.yaml --fail-on-violationsExit codes:
0: All checks passed1: Compliance below threshold or violations detected
https://linkml.github.io/linkml-data-qc
# Install dependencies
uv sync --group dev
# Run tests
just test
# Run doctests only
just doctest
# Run type checking
just mypy
# Run linting
just formatThis project uses the template monarch-project-copier