Skip to content

Conversation

OliMarc
Copy link
Contributor

@OliMarc OliMarc commented Mar 21, 2025

Summary

Major Changes:

This PR enhances the from_cube function in io.common.VolumetricData to significantly improve performance. When processing large .cube files, the original implementation took minutes to read and set Pymatgen objects. The optimized version incorporates several key improvements: file reading is now handled with readlines() instead of multiple readline() calls, reducing I/O operations. Voxel data parsing has been rewritten to use NumPy vectorized parsing instead of loops, making numerical processing faster. Atom site parsing has been improved by replacing the loop-based readline() approach with list comprehensions. Additionally, volumetric data parsing now leverages np.fromstring() instead of converting lists to NumPy arrays.

Performance Improvements

The Figures below illustrate the execution time and memory usage improvements:

Figure 1: Performance comparison for large .cube files (GB-scale).

Figure 2: Performance comparison for small .cube files (MB-scale).

To quantify improvements, I benchmarked 13 .cube files of various structures, volumetric data, and sizes. The optimized from_cube() is 55-65% faster in parsing while maintaining comparable memory usage.

Interestingly, preloading the entire file did not significantly impact RAM usage. While the optimized version appears to use slightly more RAM for smaller files (~20-100 MB difference), this is negligible given the short allocation time.

For File 3, a high memory usage difference was observed across multiple runs. While I am unsure of the exact cause, it might be related to NumPy's internal memory management and temporary storage allocation when parsing large files.


To-Do

Simple upgrade completed.

Note: All tested .cube files were generated from GCMC, probability plots (aka density plots). The improvements should be applicable to DFT-generated .cube files all the same (electron densities, molecular orbitals).


Checklist

[Y] Google format doc strings added. Check with ruff.

Formatting & linting checks passed (pre-commit run --all-files results below):

ruff.....................................................................Passed
ruff-format..............................................................Passed
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mypy.....................................................................Passed
codespell................................................................Passed
cython-lint..............................................................Passed
double-quote Cython strings..............................................Passed
blacken-docs.............................................................Passed
markdownlint.............................................................Passed
nbstripout...........................................(no files to check)Skipped
pyright..................................................................Passed

[Y] Type annotations included. Check with mypy.

Since only optimizations were made, type annotations remain unchanged from the original function.

[Y] Tests added for new features/fixes.

No new tests were added specifically, but I verified that the changes did not break anything by running pytest. The results were identical to those obtained on a fresh fork:

4 failed, 2973 passed, 161 skipped, 103 warnings

I also developed a local pytest to validate that the optimized from_cube() generates identical results as the original.

Tested parameters:

  • Lattice parameters
  • Structure equality
  • Volume consistency
  • Volumetric data integrity
  • Atomic species and coordinates
  • Density statistics
  • Lattice angles
  • Structure symmetry

All tests passed across 13 benchmark .cube files.


This is my first contribution, so let me know if I'm missing anything!

@OliMarc OliMarc requested review from shyuep and mkhorton as code owners March 21, 2025 03:07
@shyuep
Copy link
Member

shyuep commented Mar 22, 2025

Nice! Thanks a lot!

@shyuep shyuep enabled auto-merge (squash) March 22, 2025 15:27
@shyuep shyuep merged commit b5c65b3 into materialsproject:master Mar 22, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants