- Overview
- Quick Start
- Command Reference
- Advanced Configuration
- Metadata & Caching
- Advanced (optional)
- Developer Setup
- Contributing
A comprehensive Python package for analyzing eduGAIN federation metadata quality, privacy statement coverage, security compliance, and SIRTFI certification. Built following modern Python standards with PEP 517/518/621 compliance.
Version 3.0 Breaking Changes: Privacy statement tracking now includes Identity Providers (IdPs) alongside Service Providers (SPs). This affects CSV format, filter behavior, and statistics output. See Migration Guide and CHANGELOG for details.
- ๐ฐ SIRTFI Coverage Tracking: Comprehensive SIRTFI certification tracking across all CLI outputs (summary, CSV, markdown reports)
- ๐ SIRTFI Compliance Tools: Two specialized CLI tools for security contact and SIRTFI certification validation
- ๐ Privacy Statement Monitoring: HTTP accessibility validation for privacy statement URLs (both SPs and IdPs) with dedicated broken links detection tool
- ๐ Universal Privacy Tracking: Privacy statements tracked for both Service Providers and Identity Providers (v3.0+)
- ๐ Federation Intelligence: Automatic mapping from registration authorities to friendly names via eduGAIN API
- ๐พ XDG-Compliant Caching: Smart caching system with configurable expiry (metadata: 12h, federations: 30d, URLs: 7d)
- ๐ Multiple Output Formats: Summary statistics, detailed CSV exports, markdown reports, and visual PDF reports
- ๐๏ธ Modern Architecture: Modular design with comprehensive testing (100% for CLI, 90%+ for core modules)
- โก Fast Tooling: Ruff for linting and formatting
- ๐ Comprehensive Reporting: Split statistics for SPs vs IdPs with federation-level breakdowns
python -m pip install --upgrade pip
python -m pip install edugain-analysisPrefer running from a clone? Install in editable mode instead:
git clone https://github.com/maartenk/edugain-qual-improv.git
cd edugain-qual-improv
python -m pip install -e .After installation, your environment exposes the CLI entry points edugain-analyze, edugain-seccon, edugain-sirtfi, and edugain-broken-privacy. They live in the Python environmentโs bin/ (or Scripts\ on Windows) like any other console scripts.
Want to run from a clone without installing?
# Shell wrappers (from repo root)
./scripts/app/edugain-analyze.sh --report
./scripts/app/edugain-seccon.sh
./scripts/app/edugain-sirtfi.sh
./scripts/app/edugain-broken-privacy.sh
# Direct module invocation
python -m edugain_analysis.cli.main
# Convenience wrapper (repo root)
python analyze.py# Summary of privacy, security contacts, and SIRTFI coverage
edugain-analyze
# Detailed markdown report grouped by federation
edugain-analyze --report
# CSV export of entities missing privacy statements
edugain-analyze --csv missing-privacy
# Validate privacy statement URLs while producing the summary
edugain-analyze --validate| Command | Purpose | Helpful options |
|---|---|---|
edugain-analyze |
Main privacy/security/SIRTFI analysis | --report, --csv <type>, --validate, --source <file-or-url> |
edugain-seccon |
Entities with security contacts but no SIRTFI | --local-file, --no-headers |
edugain-sirtfi |
Entities with SIRTFI but no security contact | --local-file, --no-headers |
edugain-broken-privacy |
Entities (SPs and IdPs) with broken privacy links | --local-file, --no-headers, --url <metadata-url> |
All commands default to the live eduGAIN aggregate metadata. Supply --source path.xml or --source https://custom to work with alternative metadata files.
The primary CLI for day-to-day reporting. Generates summaries, CSV exports, and markdown reports in one pass.
# Quick snapshot with SIRTFI coverage in the summary
edugain-analyze
# Markdown report plus live privacy URL checks
edugain-analyze --report-with-validation
# Export only SPs missing both safeguards
edugain-analyze --csv missing-both --no-headers > missing.csvSurfaces entities that publish a security contact but have not completed SIRTFI certificationโideal for prioritizing follow-up.
# Live metadata
edugain-seccon
# Offline review against a cached aggregate
edugain-seccon --local-file reports/metadata.xmlFlags entities that claim SIRTFI compliance yet fail to list a security contact, highlighting policy violations.
# Focus on current gaps
edugain-sirtfi
# Custom feed (e.g., private federation snapshot)
edugain-sirtfi --url https://example/federation.xmlIdentifies entities (both SPs and IdPs) with privacy statement URLs that fail a lightweight accessibility check.
# Default live run with 10 parallel validators
edugain-broken-privacy
# Skip headers when piping into other tooling
edugain-broken-privacy --no-headers | tee broken-urls.csv
# Filter to SPs only if needed
edugain-broken-privacy | grep ',SP,' > sp-broken-privacy.csvedugain-analyze --csv supports the following types (all include SIRTFI columns):
entitiesโ complete view of all entities (both SPs and IdPs with privacy data)federationsโ per-federation roll-up (includes IdP privacy statistics)missing-privacyโ entities without privacy statements (both SPs and IdPs)missing-security,missing-bothโ security-focused exportsurlsโ privacy statement URLs for all entities (SPs and IdPs)urls-validatedโ includes HTTP status data (enables live validation)
Markdown reports are produced with --report (or --report-with-validation to perform live URL checks while generating the report).
CSV Columns (Entity Export)
| Column | Description |
|---|---|
Federation |
Friendly federation name (API-backed) |
EntityType |
SP or IdP |
OrganizationName |
Display name from metadata |
EntityID |
SAML entity identifier |
HasPrivacyStatement |
Yes/No (both SPs and IdPs) |
PrivacyStatementURL |
Declared privacy URL (both SPs and IdPs) |
HasSecurityContact |
Yes/No |
HasSIRTFI |
Yes/No |
(with validation) URLStatusCode, FinalURL, URLAccessible, RedirectCount, ValidationError |
CSV Columns (Federation Export)
Federation-level statistics include:
- Basic counts:
TotalEntities,TotalSPs,TotalIdPs - SP privacy:
SPsWithPrivacy,SPsMissingPrivacy - IdP privacy (v3.0+):
IdPsWithPrivacy,IdPsMissingPrivacy - Security contacts:
EntitiesWithSecurity,SPsWithSecurity,IdPsWithSecurity(andMissingvariants) - SIRTFI:
EntitiesWithSIRTFI,SPsWithSIRTFI,IdPsWithSIRTFI(andMissingvariants) - Combined SP metrics:
SPsWithBoth,SPsWithAtLeastOne,SPsMissingBoth
Enabling --validate, --report-with-validation, or --csv urls-validated triggers live HTTP checks for privacy statement URLs. Results are cached for seven days to avoid repeat lookups. When validation runs, extra columns are appended to CSV exports:
URLStatusCode,FinalURL,URLAccessible,RedirectCount,ValidationError
--validate-content goes beyond HTTP status checks and inspects the actual HTML of each privacy statement page. It detects soft-404s, thin content, and missing GDPR-related language โ problems that a plain HTTP 200 would never surface.
This flag implies --validate, so you do not need to pass both.
# Content quality analysis with terminal summary
edugain-analyze --validate-content
# Export per-URL content analysis to CSV
edugain-analyze --csv urls-content-analysisEach privacy page receives a score from 0 to 100:
| Tier | Score range | Meaning |
|---|---|---|
| Excellent | 90โ100 | Page passes all quality checks |
| Good | 70โ89 | Minor issues (e.g. few GDPR keywords, slightly slow) |
| Fair | 50โ69 | Notable issues needing attention |
| Poor | 30โ49 | Significant problems affecting compliance |
| Broken | 0โ29 | Soft-404, empty page, or severe combined issues |
| Issue string | What it means |
|---|---|
soft-404 |
Page returns HTTP 200 but displays error content (e.g. "Page not found") |
thin-content |
HTML content under 500 bytes โ almost certainly not a real privacy page |
empty-content |
HTML under 100 bytes โ effectively blank |
non-https |
Privacy URL uses plain HTTP instead of HTTPS |
no-gdpr-keywords |
Page contains fewer than 2 recognised GDPR-related terms |
few-gdpr-keywords |
Page contains only 2 keywords (3 or more expected) |
slow-response |
Page took longer than 5 seconds to respond |
very-slow-response |
Page took longer than 10 seconds to respond |
GDPR keyword detection covers five languages: English, German, French, Spanish, and Swedish. Language is auto-detected from the page's <html lang> attribute or by keyword frequency.
Exports one row per SP privacy URL with the following columns:
| Column | Description |
|---|---|
Federation |
Friendly federation name |
EntityID |
SAML entity identifier |
PrivacyURL |
Privacy statement URL |
StatusCode |
HTTP response code |
ContentQualityScore |
Score 0โ100 |
HTTPS |
True/False |
ContentLength |
Raw HTML size in bytes |
HasGDPRKeywords |
True if 2+ keywords found |
KeywordCount |
Number of GDPR keywords matched |
IsSoft404 |
True if a soft-404 was detected |
DetectedLanguage |
ISO 639-1 code (e.g. en, de) or blank |
ResponseTimeMs |
Fetch time in milliseconds |
QualityIssues |
Pipe-separated list of issue strings (e.g. thin-content|non-https) |
See docs/content-quality-analysis.md for a full reference including the scoring algorithm and remediation guidance.
Runtime defaults live in src/edugain_analysis/config/settings.py. Tweak them if you need to point at alternative metadata sources or adjust validation behaviour.
EDUGAIN_METADATA_URL,EDUGAIN_FEDERATIONS_API: swap these when working with staging aggregates or private federation indexes.- Cache knobs (
METADATA_CACHE_HOURS,FEDERATION_CACHE_DAYS,URL_VALIDATION_CACHE_DAYS) define how long downloads are reused; shorten them during rapid testing, extend them to reduce network traffic. - URL validation controls (
URL_VALIDATION_TIMEOUT,URL_VALIDATION_DELAY,URL_VALIDATION_THREADS,MAX_CONTENT_SIZE) let you balance accuracy with load on remote sites. REQUEST_TIMEOUTcovers metadata and federation lookups; bump it for slow-on-purpose mirrors.
After adjusting the settings file, re-run the CLIโchanges take effect immediately because the module is imported at runtime.
- Default source: eduGAIN aggregate metadata (
https://mds.edugain.org/edugain-v2.xml) - Overrides: Use
--source path.xmlfor local files or--url(where available) for alternate feeds - Caching: Metadata (12h), federation mapping (30d), and URL validation results (7d) are cached in the XDG cache directory (typically
~/.cache/edugain-analysis/)
- Run from source:
python analyze.pymirrorsedugain-analyze - Use Docker:
docker compose buildthendocker compose run --rm cli -- --help(defaults toedugain-analyze; swap inedugain-seccon,edugain-sirtfi, oredugain-broken-privacyas needed) - Batch everything locally:
scripts/dev/local-ci.shruns linting, tests, coverage, and Docker smoke tests - Tweak the helper via env vars:
SKIP_COVERAGE=1orSKIP_DOCKER=1to skip heavier steps
The Makefile now provides an end-user-friendly workflow for managing the virtual environment and tooling:
# Create/update .venv and install dev+test extras (Python 3.11+)
make install EXTRAS=dev,tests PYTHON=python3.11
# Drop into an activated shell (exit with 'exit' or Ctrl-D)
make shell
# Run linting or the full test suite
make lint
make testPrefer scripts? scripts/dev/dev-env.sh remains available if you prefer driving the setup script directly:
# Fresh environment with test tooling and coverage plugins
./scripts/dev/dev-env.sh --fresh --with-tests --with-coverage
# Add parallel pytest workers
./scripts/dev/dev-env.sh --with-parallelDEVENV_PYTHONoverrides the interpreter search (order:python3.14,python3.13,python3.12,python3.11,python3). Example:DEVENV_PYTHON=python3.11 ./scripts/dev/dev-env.sh.- Manual setup: create a virtualenv and run
pip install -e ".[dev]". Layer optional extras like.[tests],.[coverage], or.[parallel]as needed. scripts/maintenance/clean-env.shandscripts/maintenance/clean-artifacts.shremove virtualenvs and cached outputs when you need a clean slate. Usemake clean-artifactsormake clean-cachefor the common paths, or call the script with--artifacts-only/--cache-onlyyourself; add--reportsto prunereports/.- Test coverage outputs land in
artifacts/coverage/(HTML + XML). Thereports/directory is reserved for CLI exports or cached metadata snapshots and is only pruned when you runscripts/maintenance/clean-artifacts.sh --reports. - Script layout:
scripts/app/(CLI wrappers),scripts/dev/(developer tooling),scripts/maintenance/(cache & environment cleanup).
Developer Resources:
CLAUDE.md- Architecture, testing, and development workflows for contributors and AI assistantsdocs/ROADMAP.md- Comprehensive feature roadmap with prioritization and timelines (12-month plan)Dockerfile&docker-compose.yml- Containerized workflow for CLI and tests
Scripts & Utilities:
scripts/app/- CLI wrapper scripts for running without installationscripts/dev/- Developer tooling (env bootstrap, linting, local CI, Docker helpers)scripts/maintenance/- Cache and environment cleanup utilities
Artifacts:
artifacts/coverage/- Test coverage reports (HTML + XML)reports/- CLI exports and downloaded metadata snapshots
The package follows Python best practices with a modular structure:
src/edugain_analysis/
โโโ core/ # Core analysis logic
โ โโโ analysis.py # Main analysis functions
โ โโโ metadata.py # Metadata handling and XDG-compliant caching
โ โโโ validation.py # URL validation with parallel processing
โโโ formatters/ # Output formatting
โ โโโ base.py # Text, CSV, and markdown formatters
โโโ cli/ # Command-line interfaces
โ โโโ main.py # Primary CLI (edugain-analyze)
โ โโโ seccon.py # Security contact CLI (edugain-seccon)
โ โโโ sirtfi.py # SIRTFI compliance CLI (edugain-sirtfi)
โ โโโ broken_privacy.py # Broken privacy links CLI (edugain-broken-privacy)
โโโ config/ # Configuration and patterns
โโโ settings.py # Constants and validation patterns
The package includes a fast privacy statement URL validation system that checks link accessibility across eduGAIN federations for both SPs and IdPs. This helps identify broken privacy statement links that need attention.
- URL Collection: Extracts privacy statement URLs from entity metadata (both SPs and IdPs)
- Parallel Checking: Tests URLs concurrently using 16 threads for fast processing
- HTTP Status Validation: Simple status code check:
- 200-399: Accessible (working link) โ
- 400-599: Broken (needs fixing) โ
- Real-time Progress: Shows validation progress with visual indicators
- Smart Caching: Results cached for 1 week to avoid re-checking unchanged URLs
# Basic validation with user-friendly summary
python analyze.py --validate
# Get detailed CSV with HTTP status codes for each URL
python analyze.py --csv urls --validate
# Export entities missing privacy statements
python analyze.py --csv missing-privacyThe summary shows simple, actionable information:
๐ Privacy Statement URL Check:
๐ Checked 2,683 privacy statement links
โญโ ๐ LINK STATUS SUMMARY โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
2,267 links working (84.5%) โ
โ โ 416 links broken (15.5%) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
When using --csv urls --validate, you get detailed technical information:
| Field | Description | Example |
|---|---|---|
StatusCode |
HTTP response code | 200, 404, 500 |
FinalURL |
URL after redirects | https://example.org/privacy |
Accessible |
Working status | Yes / No |
ContentType |
MIME type | text/html |
RedirectCount |
Number of redirects | 0, 1, 2 |
ValidationError |
Error details | Connection timeout |
This gives technical staff the specific information needed to fix broken links.
The main analysis tools (edugain-analyze, python analyze.py) now include comprehensive SIRTFI (Security Incident Response Trust Framework for Federated Identity) certification tracking across all output formats.
SIRTFI is a framework that enables the coordination of incident response activities for federated identity services. Entities with SIRTFI certification have committed to specific incident response capabilities and communication practices.
Summary Statistics (includes SIRTFI section):
=== eduGAIN Quality Analysis: Privacy, Security & SIRTFI Coverage ===
Total entities analyzed: 10,234 (SPs: 6,145, IdPs: 4,089)
๐ฐ SIRTFI Certification Coverage:
โ
Total entities with SIRTFI: 4,623 out of 10,234 (45.2%)
โ Total entities without SIRTFI: 5,611 out of 10,234 (54.8%)
๐ SPs: 2,768 with / 3,377 without (45.0% coverage)
๐ IdPs: 1,855 with / 2,234 without (45.4% coverage)
CSV Entity Export (includes HasSIRTFI column):
Federation,EntityType,OrganizationName,EntityID,HasPrivacyStatement,PrivacyStatementURL,HasSecurityContact,HasSIRTFI
InCommon,SP,Example University,https://sp.example.edu,Yes,https://example.edu/privacy,Yes,Yes
DFN-AAI,IdP,Test Institute,https://idp.test.de,N/A,N/A,Yes,NoFederation Statistics CSV (includes SIRTFI columns):
Federation,TotalEntities,TotalSPs,TotalIdPs,SPsWithPrivacy,SPsMissingPrivacy,EntitiesWithSecurity,EntitiesMissingSecurity,SPsWithSecurity,SPsMissingSecurity,IdPsWithSecurity,IdPsMissingSecurity,EntitiesWithSIRTFI,EntitiesMissingSIRTFI,SPsWithSIRTFI,SPsMissingSIRTFI,IdPsWithSIRTFI,IdPsMissingSIRTFI,SPsWithBoth,SPsWithAtLeastOne,SPsMissingBoth
InCommon,3450,2100,1350,1890,210,2760,690,1680,420,1080,270,1552,1898,945,1155,607,743,1575,2058,42Markdown Reports (includes per-federation SIRTFI statistics):
## Federation Analysis
### InCommon (3,450 entities: 2,100 SPs, 1,350 IdPs)
**SIRTFI Certification:** ๐ข 1,552/3,450 (45.0%)
โโ SPs: ๐ก 945/2,100 (45.0%)
โโ IdPs: ๐ก 607/1,350 (45.0%)- Applies to Both SPs and IdPs: Unlike privacy statements (SP-only), SIRTFI applies to both entity types
- Always Included: The
HasSIRTFIcolumn is automatically included in all entity CSV exports - Federation Breakdown: Per-federation statistics show SIRTFI coverage at the federation level
- Color-Coded Status: ๐ข (โฅ80%), ๐ก (50-79%), ๐ด (<50%) for visual feedback
For SIRTFI compliance validation and finding violations, use the specialized tools:
edugain-seccon: Find entities with security contacts but without SIRTFI (potential candidates)edugain-sirtfi: Find entities with SIRTFI but without security contacts (compliance violations)
All data is stored in XDG-compliant cache directories:
Cache Location by Platform:
- macOS:
~/Library/Caches/edugain-analysis/ - Linux:
~/.cache/edugain-analysis/ - Windows:
%LOCALAPPDATA%\edugain\edugain-analysis\Cache\
Cache Files:
metadata.xml- eduGAIN metadata (expires after 12 hours)federations.json- Federation name mappings (expires after 30 days)url_validation.json- URL validation results (expires after 7 days)
Cache Management Commands:
# View cache location
python -c "from platformdirs import user_cache_dir; print(user_cache_dir('edugain-analysis', 'edugain'))"
# Clear cache to force fresh download
rm -rf ~/Library/Caches/edugain-analysis/metadata.xml # macOS
rm -rf ~/.cache/edugain-analysis/metadata.xml # Linux=== eduGAIN Quality Analysis: Privacy, Security & SIRTFI Coverage ===
Total entities analyzed: 8,234 (SPs: 3,849, IdPs: 4,385)
๐ Privacy Statement URL Coverage: ๐ก 3,720/8,234 (45.2%)
โโ SPs: ๐ก 2,681/3,849 (69.7%)
โโ IdPs: ๐ด 1,039/4,385 (23.7%)
โ Missing: 4,514/8,234 (54.8%)
๐ก๏ธ Security Contact Coverage: ๐ก 4,096/8,234 (49.8%)
โโ SPs: ๐ข 1,205/3,849 (31.3%)
โโ IdPs: ๐ข 2,891/4,385 (65.9%)
๐ฐ SIRTFI Certification Coverage: ๐ก 3,715/8,234 (45.1%)
โโ SPs: ๐ก 1,732/3,849 (45.0%)
โโ IdPs: ๐ก 1,983/4,385 (45.2%)
๐ Federation Coverage: 73 federations analyzed
Note: Privacy coverage now includes both SPs and IdPs. Previously (v2.x) only SPs were tracked.
- entities: All entities with privacy/security status (IdPs include privacy data)
- federations: Federation-level statistics (includes IdP privacy columns)
- missing-privacy: Entities without privacy statements (both SPs and IdPs)
- missing-security: Entities without security contacts
- missing-both: SPs missing both privacy and security
- urls: URL validation results (with
--validate) for all entity types
Breaking Changes:
- Privacy statement tracking extended to IdPs (previously SP-only)
- CSV format changes: IdP rows now show
Yes/Noinstead ofN/Afor privacy - Federation CSV adds
IdPsWithPrivacyandIdPsMissingPrivacycolumns - Filter
--csv missing-privacynow returns both SPs and IdPs - See Migration Guide for upgrade instructions
New Features:
- Universal privacy tracking across all entity types
- PDF reports include IdP privacy KPI and comparative charts
- Enhanced statistics with SP vs IdP breakdown
edugain-broken-privacyvalidates IdP privacy URLs
- Reorganized script structure for clarity
- Enhanced Makefile with user-focused guidance
- Local CI script for offline quality checks
The Docker image ships with the package pre-installed inside /opt/venv and exposes the CLI entry points through a lightweight entrypoint script:
# Build once (respects DEVENV_PYTHON and INSTALL_EXTRAS build args)
docker compose build
# Run the main analyzer (defaults to edugain-analyze when no args are given)
docker compose run --rm cli -- --report --output artifacts/report.md
# Run other CLIs by overriding the command
docker compose run --rm cli edugain-seccon --summary
docker compose run --rm cli edugain-broken-privacy --validatePip downloads and eduGAIN metadata caches are persisted in the named volumes declared in docker-compose.yml (pip-cache, edugain-cache). Remove them with docker volume rm if you need a cold start. The project folder is still bind-mounted into /app, so anything written to reports/ or artifacts/ is immediately available on the host.
- Python: 3.11 or later (tested on 3.11, 3.12, 3.13, 3.14)
- Dependencies:
requests(โฅ2.28.0) - HTTP requestsplatformdirs(โฅ3.0.0) - XDG-compliant directories
- Development Dependencies (install with
[dev]):- pytest, pytest-cov, pytest-xdist - Testing and coverage
- ruff - Linting and formatting
- pre-commit - Git hooks for quality assurance
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes following the coding standards
- Run tests and linting (
pytest && scripts/dev/lint.sh) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on the original work from eduGAIN contacts analysis
- Built for the eduGAIN community to improve federation metadata quality
- Follows Python packaging standards (PEP 517/518/561/621)
- Issues: GitHub Issues
- Documentation: See README.md for full user documentation
- Development: See CLAUDE.md for development guidelines and architecture
- Roadmap: See docs/ROADMAP.md for feature roadmap and planning