Skip to content

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Mar 28, 2025

Summary

JSON/YAML generation directly from source with unified parser interface

Source recordings (different parsers for csv/yaml/...)
↓
Each generate one or more Property (property (hold unit/reference) -> element -> data)
↓
YAML for development and JSON for production
  • Assume all values for the same property have the same unit
  • Assume single reference for single property for now

Follow up PRs

  • JSON generator directly update core.periodic_table.json
  • Have unit in a separate field, and need to update unit handle mechanism in core.Element
  • Lazily load optional/extra properties
  • Some property are missing in current JSON (present in YAML): "Ground level", "NMR Quadrupole Moment"
  • "Electronic structure" in current JSON and YAML seems quite different (new YAML for this PR uses the JSON to be consistent with production code)
  • Resolve known differences in _compare_json.py

Compare new JSON with current

Save a copy of current (master branch) JSON/YAML for quick lookup (uploading YAML not allowed): master_json_yaml.zip

pymatgen/dev_scripts/periodic_table_resources/_compare_json.py

@shyuep
Copy link
Member

shyuep commented Mar 28, 2025

I want to point out that ChatGPT can make this extremely easy. I passed ChatGPT the yaml and prompted it to "separate out the units of each property for each element from the value, keeping the dictionary schema consistent". Below is the result.

Below is a Python script that reads your existing YAML file, scans every property for values that appear to include a numeric (or “no data”) part followed by units, and then rewrites those entries as dictionaries with separate “value” and “units” keys. The overall schema (i.e. the element keys and the structure for sub‐dictionaries, lists, etc.) remains unchanged.

You can add this file (for example, name it transform_periodic_table.py) to your project and run it. It will load periodic_table.yaml and produce a new file called periodic_table_transformed.yaml with the modified structure.

Below is the complete code:

import yaml
import re

def separate_units(val):
    """
    If val is a string of the form "<number> <unit>" or "no data <unit>",
    separate out the numeric part (or "no data") and the unit.
    Otherwise, return the original value.
    """
    if isinstance(val, str):
        # Match a number (integer or float) or the string "no data", followed by optional whitespace and the unit.
        m = re.match(r"^(no data|\d+(?:\.\d+)?)(?:\s+(.*))?$", val)
        if m:
            num_part = m.group(1)
            unit_part = m.group(2) or ""
            # Convert numeric part to float if possible and if not "no data"
            if num_part != "no data":
                try:
                    num_val = float(num_part)
                except Exception:
                    num_val = num_part
            else:
                num_val = num_part
            return {"value": num_val, "units": unit_part.strip()}
    return val

def transform(data):
    """
    Recursively process the YAML data.
    If a value is a string, attempt to separate out a measurement and its unit.
    If the value is a dict or list, recursively process it.
    """
    if isinstance(data, dict):
        new_data = {}
        for k, v in data.items():
            if isinstance(v, str):
                new_data[k] = separate_units(v)
            elif isinstance(v, (dict, list)):
                new_data[k] = transform(v)
            else:
                new_data[k] = v
        return new_data
    elif isinstance(data, list):
        return [transform(item) for item in data]
    else:
        return data

# Load the original periodic table YAML file
with open("periodic_table.yaml", "r", encoding="utf-8") as f:
    data = yaml.safe_load(f)

# Transform the data to separate units from the values
transformed_data = transform(data)

# Write out the transformed data to a new YAML file
with open("periodic_table_transformed.yaml", "w", encoding="utf-8") as f:
    yaml.dump(transformed_data, f, sort_keys=False)

print("Transformation complete. New file saved as 'periodic_table_transformed.yaml'.")

How It Works
	•	separate_units function:
This function uses a regular expression to detect a pattern where the value starts with either a number (integer or float) or the phrase "no data", followed by optional whitespace and then the unit. If the pattern is found, it returns a dictionary with the keys "value" and "units". If not, it simply returns the original value.
	•	transform function:
This function recursively walks through the entire YAML data structure. For each string value it finds, it applies the separate_units function. If the value is already a dictionary or a list, it recurses into that structure.
	•	I/O:
The script reads your original periodic_table.yaml and writes the transformed version to periodic_table_transformed.yaml.

Once you run this script, your properties (for example, "Boiling point: 3573 K") will be converted into a nested structure like:

Boiling point:
  value: 3573.0
  units: K

This keeps the overall dictionary schema (each element remains a top-level key with its properties) consistent.

You can now run this script in your PyCharm project, and the transformed file will be available for further use.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Mar 28, 2025

Beautiful, thanks a ton for chiming in! That would be really helpful. I would work on this likely next week, enjoy the weekend :)

@DanielYang59 DanielYang59 marked this pull request as ready for review April 1, 2025 10:18
@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Apr 1, 2025

@shyuep this PR is ready from my side, as it might be changing the core functionalities of pymatgen, I would like to push forward as slowly and carefully as possible, so I would separate the workflow into multiple steps:

  • This PR would only generate a duplicate JSON saved in dev_scripts without really changing our production recording
  • Any change/fix in production JSON recording would be submitted as a separate PR for visibility
  • More detailed TODOs see the description of this PR in Follow up PRs

@DanielYang59 DanielYang59 marked this pull request as draft April 1, 2025 11:25
@DanielYang59 DanielYang59 marked this pull request as ready for review April 1, 2025 11:36
@DanielYang59 DanielYang59 marked this pull request as draft April 3, 2025 14:56
@DanielYang59 DanielYang59 force-pushed the simplify-ptable-json-generator branch from a1ba60b to 946fda0 Compare April 3, 2025 17:53
@DanielYang59 DanielYang59 marked this pull request as ready for review April 3, 2025 17:54
@shyuep shyuep merged commit 9657a69 into materialsproject:master Apr 20, 2025
42 checks passed
@shyuep
Copy link
Member

shyuep commented Apr 20, 2025

This is merged since it only affects the dev directory. Thanks.

@DanielYang59 DanielYang59 deleted the simplify-ptable-json-generator branch April 20, 2025 20:25
@DanielYang59
Copy link
Contributor Author

No problem at all, thanks for reviewing!

Yes I was trying to separate JSON generator and rewrite of JSON loader/parser for ptable.py into separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants