Topics
Algorithmic notes and conventions used across the mc3d-source pipeline. Each section describes one stage or concept; the order roughly mirrors the order in which the corresponding CLI commands are typically run.
Semantics
The vocabulary used throughout this documentation:
- Curated structure: a
StructureDataproduced by aCifCleanWorkChainthat passed both checks — no partial occupancies and no formula mismatch. Curated structures are the input to the uniqueness analysis. - Duplicate family: a set of source strings whose curated structures are deemed structurally equivalent by
pymatgen'sStructureMatcher(see Uniqueness). - Golden structure: the one source string chosen to represent a duplicate family in the final MC3D set (see Select). All other sources in the family are listed in the golden structure's
duplicatesextra.
Data
The pipeline lives in an AiiDA database and is built around a few conventions on top of the AiiDA data types.
Group labels
Pipeline state is tracked through AiiDA groups. By convention, labels are slash-separated paths with a (mostly) stable shape:
<database>[/<version>]/<data_type>/<stage>
The <version> segment is only used for databases that have been re-imported (currently only MPDS, with v1 and v2). Using MPDS v2 as the running example, the labels written across the pipeline are:
| Label | Contents |
|---|---|
mpds/v2/cif/raw |
Raw CifData nodes fetched from the MPDS. Written by mc3d-source import. |
mpds/v2/workchain/clean |
The CifCleanWorkChain runs themselves, one per raw CIF. Written by pipeline/cif_clean/. |
mpds/v2/cif/clean |
The cleaned CifData outputs of the CifCleanWorkChains. Written by pipeline/cif_clean/. |
mpds/v2/structure/parsed |
The parsed StructureData outputs of the CifCleanWorkChains. Input to curate. Written by pipeline/cif_clean/. |
mpds/v2/structure/curated |
Parsed StructureData that survived curation (no partial occupancies, no formula mismatch). All parsed structures also gain extras at this stage; only the clean subset is added to this group. Written by mc3d-source curate. |
mpds/v2/structure/final |
Reconciled structures for v2, with structures from v1 carried over where possible. Only produced for re-imported databases. Written by mc3d-source update. |
global/uniques/new |
Golden StructureData selected as the new MC3D set. Default name; configurable via --new-uniques-group. Written by mc3d-source select. |
Group labels are passed explicitly to every CLI command — the package does not assume any defaults.
StructureData extras
After the CifCleanWorkChain runs and curate, every parsed StructureData carries extras from two sources: SeeKpath (via aiida-codtools) sets the structural / formula extras, and mc3d-source curate adds the provenance + curation flags.
Set by aiida-codtools
The CifCleanWorkChain calls the primitive_structure_from_cif calcfunction (in aiida_codtools.calculations.functions.primitive_structure_from_cif), which runs SeeKpath and attaches the following extras to the primitive StructureData it returns:
| Extra (with type) | Description |
|---|---|
formula_hill( str) |
Hill-notation chemical formula of the primitive cell. |
formula_hill_compact( str) |
Hill-compact formula; used for the sort key by uniq along with the CIF space group number. |
chemical_system( str) |
Hyphen-padded sorted set of element symbols, e.g. -Fe-O-. Used by uniq's --contains / --skip filters and by select's COD-hydrogen exclusion. |
spacegroup_international( str) |
SeeKpath international space-group symbol. |
spacegroup_number( int) |
SeeKpath space-group number. Read by select to populate the spglib_space_group field of new-mc3d-data.json. Distinct from cif_spacegroup_number (see below), which comes from the cleaned CIF. |
bravais_lattice( str) |
SeeKpath Bravais-lattice short label. |
bravais_lattice_extended( str) |
SeeKpath Bravais-lattice extended label. |
These extras are present on every parsed StructureData, regardless of whether curate later admits it to the curated group.
Set by mc3d-source curate
| Extra (with type) | Description |
|---|---|
source( dict with keys database, version, id) |
Provenance of the structure in the upstream database. Copied from the raw CifData, mapping the upstream db_name to a short code: Crystallography Open Database → cod, Icsd → icsd, Materials Platform for Data Science → mpds. |
cif_spacegroup_number( int) |
Used first as a sort key in the uniqueness analysis. Taken from the spacegroup_numbers attribute of the cleaned CifData produced by CifCleanWorkChain. Only set when the workflow reports exactly one space group. |
partial_occupancies( bool) |
Flags non-stoichiometric structures. Set to structure.is_alloy or structure.has_vacancies. |
incorrect_formula( str) |
Flags a formula mismatch between the cleaned CIF and the parsed structure. Set only when the CifCleanWorkChain exit code is one of: 430 → missing_elements, 431 → different_comp, 432 → check_failed. |
Created later by uniq
Once a structure leaves the AiiDA database (i.e. is serialised to JSON), it is referred to by a "source string" of the form:
<database>|<version>|<id>
e.g. cod|1521121|176429. This is the canonical key used in all JSON outputs — the unique families file from uniq, the deprecation reports from analyse, the new MC3D data file from select, and the duplicates extras written back by select.
The string is constructed by mc3d_source.tools.source.get_source_string, which understands both the raw CifData.source format (db_name + version + id) and the curated extras format (database + version + id).
Source strings are first emitted by uniq: it groups curated StructureData into duplicate families and writes a JSON list of source-string lists. Subsequent stages consume that file by source string rather than by AiiDA UUID.
Added later by select
select writes a duplicates extra (list[str]) on each golden structure with the list of source strings that collapsed into it (excluding the golden source itself).
Curate
curate takes a CifCleanWorkChain group (e.g. mpds/v2/workchain/clean) and writes results into a curated StructureData group (curated_structure_group, e.g. mpds/v2/structure/curated). See the group labels table for the full set of groups in play.
For each CifCleanWorkChain in the input group, curate adds the four extras described above to the parsed StructureData:
sourcefrom the raw CIF.cif_spacegroup_numberfrom the cleaned CIF (only if a single space group is reported).partial_occupanciesbased onis_alloy/has_vacancies.incorrect_formulabased on the work chain exit status.
Stoichiometric structures without formula-mismatch issues are added to the curated group; the rest get the extras but stay out of the curated group. This way nothing is lost, but the curated group is exactly the set fit for uniqueness analysis.
Update
update handles the case where you re-import a source database and want to fold the new import into the previous curated set without redoing the unique-family analysis from scratch.
It takes three group labels — old_group, new_group, and target_group. For the MPDS re-import these would be mpds/v1/structure/curated, mpds/v2/structure/curated, and mpds/v2/structure/final respectively. See the group labels table for context.
Warning
The current logic was written for the MPDS, where every entry's version field is bumped on every import. For databases where only some entries change version (e.g. COD), the logic needs to be revisited.
The procedure has two steps:
- Import and process the new version. Do a full
import, runCifCleanWorkChainfor all the new raw CIFs, thencurateto obtain e.g.mpds/v2/structure/curated. - Reconcile against the previous curated set. For each
StructureDatain the new curated group:- If a structure with the same source ID is already in the target group, skip it (this makes
updatere-runnable on a partial target). - Else if no entry with the same source ID exists in the old curated group, take the new structure.
- Else, compare the old and new structures via
StructureMatcher: if they match, keep the old structure (preserves the older version string and avoids spurious churn); otherwise take the new structure.
- If a structure with the same source ID is already in the target group, skip it (this makes
The output goes into the target group (the "final" group for the new version, e.g. mpds/v2/structure/final).
This handles three real cases:
- Entry removed upstream → absent from
latest, will be flagged later byanalyse id-removed. - Entry updated but structurally unchanged → old source is preserved.
- Entry updated and structurally different → the version string is bumped.
One case is not handled: when the new CifCleanWorkChain fails but the old one succeeded, we currently drop the entry even if the old version is still valid upstream. If this becomes important, the logic needs to consult the previous curated set as a fallback.
Uniqueness
uniq collapses curated structures into duplicate families. The procedure:
- Pre-filter. Skip structures whose extras include
incorrect_formula.--contains/--skipoptions apply additionalchemical_systemfilters at the query level. - Sort key. Each structure gets a key based on its Hill-compact formula and, by default, its space group number (taken from the
cif_spacegroup_numberextra or computed withspglibif missing). Structures with different keys can never match. - Compare within a bucket. For each bucket of structures sharing a key, run a similarity analysis via
pymatgen'sStructureMatcher. Three methods are available:first(default): walk the bucket linearly, comparing each structure against the existing representatives; the first matching representative absorbs it.seb: build a full adjacency matrix and split into connected components.pymatgen: defer toStructureMatcher.group_structures.
- Default matcher settings.
ltol=0.2,stol=0.3,angle_tol=5,primitive_cell=False(structures are already primitivised by the cleaning step),scale=True,attempt_supercell=False. Override with--matcher-settings <yaml>. - Parallelism and checkpointing. Buckets are processed in a
multiprocessing.Poolof size--parallelize(default 5). If--chunk-sizeis set, partial results are written tocheckpoint.jsonafter each chunk and reloaded on the next run.
The output is result.json: a list of families, where each family is a list of source strings that ended up matching.
Select
select turns the unique-families JSON into the final MC3D set. It needs three inputs:
unique_families.json— output ofuniq.- The previous MC3D data file (
mc3d_id_file), mapping MC3D IDs to their golden source and previous duplicate family. deprecation.json— output ofanalyse(see below).
The logic is:
- Re-use existing MC3D IDs. For each old MC3D ID, find the new family/families that contain any of its previous family's sources. If exactly one new family lines up, the MC3D ID inherits it. If multiple new families match, families containing an existing "golden" source from any other MC3D entry are dropped from the candidate set, so two old MC3D entries don't end up pointing to the same new family.
- Drop fully deprecated MC3D IDs. If none of the previous family's sources appear in any new family and every source in the previous family appears in
deprecation.json, the MC3D ID is marked deprecated. - Pick new families. Families that no MC3D ID has inherited are candidates for new entries. Two filters apply: families with a "golden" source from a previous MC3D entry are skipped (handled by step 1), and families whose sources are entirely a subset of the COD-with-hydrogen set (curated COD structures with
partial_occupancies=Falsewhosechemical_systemcontains-H-) are skipped. - Pick golden sources. For each new family, pick one source as the golden structure with effective priority MPDS > ICSD > COD: the code assigns from each bucket in turn with plain
if(notelif), so the last one set wins. (The inline comment inselect.pycalls this "by permissions"; the rationale is the license / accessibility hierarchy of the underlying databases.) The chosen source'sStructureDatabecomes the golden structure for that family; all other sources are written to itsduplicatesextra. - Persist. The golden structures are collected into a new AiiDA group (default
global/uniques/new) and the per-MC3D-ID data is written tonew-mc3d-data.json. The output also includes aspglib_space_groupfield per golden structure, read from the SeeKpath-setspacegroup_numberextra on theStructureData(see the extras table).
Deprecation
"Deprecating" a source or an MC3D ID means flagging it as no longer valid: it must be removed from the MC3D frontend and excluded from future uniqueness analyses.
Source-level deprecation
The analyse subcommands populate a single deprecation.json, keyed by source string, with one of three reasons (defined in mc3d_source.contants.SourceDeprecation):
id_removed— flagged byanalyse id-removed, which takes an old and a new rawCifDatagroup and writes the source strings whose IDs are present in the old group but not the new one. The stored source string is the old one (the one being removed).structure_updated— flagged byanalyse structure-updated, which takes an old curated structure group and a new "final" structure group and writes the source strings whose ID is present in both but whoseversion/structure changed in the new group. The stored source string is the old one (the now-superseded version).incorrect_formula— flagged byanalyse incorrect-formula, which scans allStructureDatain the AiiDA database (no group filter) for theincorrect_formulaextra set bycurate. The stored source string is the affected structure's source.
Each subcommand merges its findings into the existing deprecation.json (if any), but the overlap policies differ:
id-removed: silently overwrites existing keys with the new value.structure-updated: aborts (printsCriticaland returns) if any keys overlap.incorrect-formula: prompts viatyper.confirm; on confirm, silently overwrites.
The asymmetry is incidental and probably worth aligning.
MC3D-ID-level deprecation
An MC3D ID is deprecated when all sources in its previous duplicate family are deprecated (see step 2 of Select). If only some sources are deprecated — including the "golden" one — the MC3D ID is kept and a warning is surfaced on the frontend instead, so the entry remains findable.