Skip to content

rowlesmr/cifflow

Repository files navigation

cifflow

Parse, store, validate, and emit Crystallographic Information Files (CIF).

PyPI CI PyPI - Python Version License


What it does

  • Focused on multi-block powder CIF files, but generally applicable
  • Constructs CIF files
    • Parses CIF 1.1 and CIF 2.0 files, including all string types (triple-quoted, multiline text fields, embedded quotes) and save frames
    • Constructs CifFile objects programmatically from Python values (CifWriter), and performs arbitrary edits: add/remove/rename tags, loops, blocks, and save frames
    • Populate default values as defined in the dictionary
  • Loads DDLm dictionaries with full _import.get resolution, producing a typed schema
  • Removes common parse-time artefacts automatically (clean): orphan error tags, duplicate blocks/save frames/tags, loop padding; for anything beyond these automatic fixes, use CifWriter
  • Ingests parsed or constructed CIF data into DuckDB using the dictionary-derived schema: one table per category, foreign keys enforced, unknown tags routed to a fallback tier
  • Emits valid CIF from a populated database in four modes: ORIGINAL, GROUPED, ONE_BLOCK, ALL_BLOCKS
  • Trusts the user — if you pass in multiple blocks, cifflow assumes they all belong together and, failing key value clashes, can be interpreted as a single database/experiment
  • Visualises a schema as a Graphviz DOT string or a self-contained interactive HTML file

Documentation

The full API reference is at rowlesmr.github.io/cifflow.


Key properties

Error-tolerant. The parser never raises on malformed input. Every structural problem produces an explicit error event; parsing continues and all recoverable data is preserved.

No silent data loss. Duplicate tag values are preserved. Tags not mapped by the dictionary go to a fallback table, not a discard pile.

Round-trip fidelity. For well-formed input, emitted CIF re-parses to the same data. All values are stored and emitted as raw strings; ValueType provenance (placeholder . and ? vs quoted equivalents) is preserved throughout.

Canonical caseless names. Block names, save frame names, and tag names are stored in Unicode canonical caseless form (NFC(casefold(NFD(x)))). Lookups are automatically casefolded: cif["ABC"] finds a block stored as "abc".

Streaming parser. The parser is event-driven. CIF source is consumed in a single pass; the IR accumulates events incrementally. The Rust extension provides high-throughput Arrow output without any Python file objects.


Installation

Prebuilt wheels are available for Python 3.10 -- 3.14 with MacOS, Windows, and Linux.

pip install cifflow

To install from source (includes the Rust extension):

git clone https://github.com/rowlesmr/cifflow.git
cd cifflow
pip install -e ".[dev]"
maturin develop

Quick start

Parse a CIF file

from cifflow import build

text = open('structure.cif', encoding='utf-8').read()
cif, errors = build(text)   # never raises; errors is a list[ParseError]

for block_name in cif.blocks:          # block names are always lowercase
    block = cif[block_name]
    print(f'{block_name}: {len(block.tags)} tags, {len(block.loops)} loops')

The best way to resolve errors is to inspect the list of errors, edit the file accordingly, and try again. No assumptions are made about how to correct errors automatically.

Full pipeline: dictionary → DuckDB → CIF

To see what is going on in the lexing, parsing, ingestion, and schema generation phases, see the inspect module.

import pathlib
from cifflow import (
    DictionaryLoader, directory_resolver,
    save_dictionary, load_dictionary,
    generate_schema,
    build, ingest, emit, EmitMode,
)
from cifflow.types import CifVersion

# 1. Load dictionary (with JSON cache to avoid re-parsing on every run)
cache = pathlib.Path('cif_pow_cache.json')
resolver = directory_resolver('data/dictionaries')
if cache.exists():
    dictionary = load_dictionary(cache)
else:
    dictionary = DictionaryLoader(resolver=resolver).load(
        open('data/dictionaries/cif_pow.dic', encoding='utf-8').read())
    save_dictionary(dictionary, cache)

# 2. Derive schema
schema = generate_schema(dictionary)

# 3. Parse CIF
cif, errors = build(open('all_the_data.cif', encoding='utf-8').read())

# # 3.5 Edit CIF - to fix errors or alter content
# writer = CifWriter(cif.version, cif)
# for block in writer.blocks:
#    # Do alterations
# cif = writer.build() # raises ValueError if any errors present

# 4. Ingest into an in-memory DuckDB database
#    Pass a file path string to persist: ingest(cif, 'output.db', schema=schema)
conn, warnings = ingest(cif, schema=schema)

# # 4.5 Generate any default values
# generate_defaults(conn, schema)

# 5. Emit CIF
output = emit(conn, schema, mode=EmitMode.ORIGINAL, version=CifVersion.CIF_2_0)
open('output.cif', 'w', encoding='utf-8').write(output)

See example_workflow.py in the repository root for a fully annotated end-to-end demonstration covering all four emission modes, type-cast export, and fidelity checking.

The full API reference is at rowlesmr.github.io/cifflow.


Architecture

Parser → Event Stream → IR → Dictionary-aware Mapping → DuckDB → Output/API
Layer Responsibility
Lexer Tokenisation, ValueType assignment
Parser Token sequence interpretation, error recovery, event emission
IR (CIFModel) Event accumulation, loop validation, multiline text transformation
Dictionary DDLm parsing, schema derivation
DuckDB Persistent storage: structured tables when a dictionary is present, fallback tier otherwise
Output Valid CIF regeneration

Layer responsibilities are strictly separated. The parser does not know about the dictionary. The dictionary does not know about the IR. The output layer only reads from DuckDB.


Status

Everything should work. Issues, suggestions, and requests gladly recieved.


Development

Run the fast test suite (excludes tests that load large real-world CIF files):

python -m pytest -m "not slow"

Run the full suite including slow tests:

python -m pytest

After modifying the Rust extension, recompile before running Python tests:

maturin develop

License

Apache 2.0. See LICENSE.

The bundled JavaScript files (viz.js 2.1.2 and svg-pan-zoom 3.6.1) used by visualise_schema_html are MIT-licensed. Licence notices are in src/cifflow/dictionary/js/LICENSES.txt.

About

Parse, store, validate, and emit CIF files.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors