speed-up Structure instantiation #4415

danielzuegner · 2025-05-21T15:37:14Z

Summary

This PR speeds up the instantiation of Structure objects by preventing hash collisions in the lru_cache of get_el_sp and increasing its maxsize. The issue is that currently Element objects are hashed to the same value as the integer atomic numbers (e.g., Element[H] maps to the same hash as int(1)). This forces the lru_hash to perform an expensive __eq__ comparison between the two, which reduces the performance of instantiating many Structure objects. Also here we increase the maxsize of get_el_sp's lru_cache to 1024 for further performance improvements.

This reduces time taken to instantiate 100,000 Structure objects from 31 seconds to 8.7s (avoid hash collisions) to 6.1s (also increase maxsize to 1024). Test snippet:

from pymatgen.core import Structure
import numpy as np
from time import time

t = time()
for _ in range(100_000):
    struc = Structure(
        coords = np.random.rand(10, 3),
        species = np.random.randint(1, 100, size=(10,)),
        lattice = np.random.rand(3, 3),
        coords_are_cartesian = False,
    )
print("Time taken to create 100,000 structures:", time() - t)

Unless I'm missing some deep reason for why the hash of Element objects should be the same as their atomic number hashes, this should not lead to any different behavior of the code.

Major changes:

Update ElementBase __hash__ function to map to a different value than the atomic number
Increase the maxsize of get_el_sp's lru_cache to 1024

Checklist

[ x ] Google format doc strings added. Check with ruff.
[ x ] Type annotations included. Check with mypy.
Tests added for new features/fixes.
[ x ] If applicable, new classes/functions/modules have duecredit @due.dcite decorators to reference relevant papers by DOI (example)

Tip: Install pre-commit hooks to auto-check types and linting before every commit:

pip install -U pre-commit
pre-commit install

Signed-off-by: Daniel Zügner <daniel.zuegner@gmail.com>

shyuep · 2025-05-21T15:44:21Z

Thanks. There is no deep reason why the hashes need to be the atomic number. But I am confused why hash collisions would occur if it is the atomic number. The atomic number is pretty much unique (except for isotopes, which is an esoteric use case)? Why would a hash collision occur for different elements with the atomic number as the hash? Is the main performance benefit just the increase in lru_cache size?

danielzuegner · 2025-05-21T15:50:22Z

The problem is that the lru_cache maps from [Element | int | ...] to Element basically. When we now call get_el_sp(1), it will add a cache entry {1: Element[H]}. When we then call get_el_sp(Element[H]) later, it will map to the same hash, so when deciding whether to add a new key, Python will compare Element[H] == 1, which is quite expensive, and then decide to add a new entry {Element[H]: Element[H]} because this will evaluate to False. But what's worse, any time we now pass either 1 or Element[H] to get_el_sp, they will again map to the same hash as the two existing entries and we have to compare against both {1: Element[H]} and {Element[H]: Element[H]}. This leads to the degraded performance.

Only increasing maxsize to 1024 leads to a runtime of 19.6s for the snippet, so the main benefit is indeed from the hash function fix.

shyuep · 2025-05-21T15:54:03Z

I see. Thanks. I am merging this now. The test errors are the result of poorly written tests comparing strings. I will fix those on my end.

speed-up Structure instantiation

127c5a6

Signed-off-by: Daniel Zügner <daniel.zuegner@gmail.com>

danielzuegner requested review from shyuep and mkhorton as code owners May 21, 2025 15:37

shyuep merged commit 33f16ba into materialsproject:master May 21, 2025
39 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speed-up Structure instantiation #4415

speed-up Structure instantiation #4415

Uh oh!

danielzuegner commented May 21, 2025 •

edited

Loading

Uh oh!

shyuep commented May 21, 2025

Uh oh!

danielzuegner commented May 21, 2025

Uh oh!

shyuep commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

speed-up Structure instantiation #4415

speed-up Structure instantiation #4415

Uh oh!

Conversation

danielzuegner commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

shyuep commented May 21, 2025

Uh oh!

danielzuegner commented May 21, 2025

Uh oh!

shyuep commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

danielzuegner commented May 21, 2025 •

edited

Loading