Skip to content

Conversation

danielzuegner
Copy link
Contributor

@danielzuegner danielzuegner commented May 21, 2025

Summary

This PR speeds up the instantiation of Structure objects by preventing hash collisions in the lru_cache of get_el_sp and increasing its maxsize. The issue is that currently Element objects are hashed to the same value as the integer atomic numbers (e.g., Element[H] maps to the same hash as int(1)). This forces the lru_hash to perform an expensive __eq__ comparison between the two, which reduces the performance of instantiating many Structure objects. Also here we increase the maxsize of get_el_sp's lru_cache to 1024 for further performance improvements.

This reduces time taken to instantiate 100,000 Structure objects from 31 seconds to 8.7s (avoid hash collisions) to 6.1s (also increase maxsize to 1024). Test snippet:

from pymatgen.core import Structure
import numpy as np
from time import time

t = time()
for _ in range(100_000):
    struc = Structure(
        coords = np.random.rand(10, 3),
        species = np.random.randint(1, 100, size=(10,)),
        lattice = np.random.rand(3, 3),
        coords_are_cartesian = False,
    )
print("Time taken to create 100,000 structures:", time() - t)

Unless I'm missing some deep reason for why the hash of Element objects should be the same as their atomic number hashes, this should not lead to any different behavior of the code.

Major changes:

  • Update ElementBase __hash__ function to map to a different value than the atomic number
  • Increase the maxsize of get_el_sp's lru_cache to 1024

Checklist

  • [ x ] Google format doc strings added. Check with ruff.
  • [ x ] Type annotations included. Check with mypy.
  • Tests added for new features/fixes.
  • [ x ] If applicable, new classes/functions/modules have duecredit @due.dcite decorators to reference relevant papers by DOI (example)

Tip: Install pre-commit hooks to auto-check types and linting before every commit:

pip install -U pre-commit
pre-commit install

Signed-off-by: Daniel Zügner <daniel.zuegner@gmail.com>
@shyuep
Copy link
Member

shyuep commented May 21, 2025

Thanks. There is no deep reason why the hashes need to be the atomic number. But I am confused why hash collisions would occur if it is the atomic number. The atomic number is pretty much unique (except for isotopes, which is an esoteric use case)? Why would a hash collision occur for different elements with the atomic number as the hash? Is the main performance benefit just the increase in lru_cache size?

@danielzuegner
Copy link
Contributor Author

The problem is that the lru_cache maps from [Element | int | ...] to Element basically. When we now call get_el_sp(1), it will add a cache entry {1: Element[H]}. When we then call get_el_sp(Element[H]) later, it will map to the same hash, so when deciding whether to add a new key, Python will compare Element[H] == 1, which is quite expensive, and then decide to add a new entry {Element[H]: Element[H]} because this will evaluate to False. But what's worse, any time we now pass either 1 or Element[H] to get_el_sp, they will again map to the same hash as the two existing entries and we have to compare against both {1: Element[H]} and {Element[H]: Element[H]}. This leads to the degraded performance.

Only increasing maxsize to 1024 leads to a runtime of 19.6s for the snippet, so the main benefit is indeed from the hash function fix.

@shyuep
Copy link
Member

shyuep commented May 21, 2025

I see. Thanks. I am merging this now. The test errors are the result of poorly written tests comparing strings. I will fix those on my end.

@shyuep shyuep merged commit 33f16ba into materialsproject:master May 21, 2025
39 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants