Optimized cube file parsing in from_cube for improved speed #4331

OliMarc · 2025-03-21T03:07:49Z

Summary

Major Changes:

This PR enhances the from_cube function in io.common.VolumetricData to significantly improve performance. When processing large .cube files, the original implementation took minutes to read and set Pymatgen objects. The optimized version incorporates several key improvements: file reading is now handled with readlines() instead of multiple readline() calls, reducing I/O operations. Voxel data parsing has been rewritten to use NumPy vectorized parsing instead of loops, making numerical processing faster. Atom site parsing has been improved by replacing the loop-based readline() approach with list comprehensions. Additionally, volumetric data parsing now leverages np.fromstring() instead of converting lists to NumPy arrays.

Performance Improvements

The Figures below illustrate the execution time and memory usage improvements:

Figure 1: Performance comparison for large .cube files (GB-scale).

Figure 2: Performance comparison for small .cube files (MB-scale).

To quantify improvements, I benchmarked 13 .cube files of various structures, volumetric data, and sizes. The optimized from_cube() is 55-65% faster in parsing while maintaining comparable memory usage.

Interestingly, preloading the entire file did not significantly impact RAM usage. While the optimized version appears to use slightly more RAM for smaller files (~20-100 MB difference), this is negligible given the short allocation time.

For File 3, a high memory usage difference was observed across multiple runs. While I am unsure of the exact cause, it might be related to NumPy's internal memory management and temporary storage allocation when parsing large files.

To-Do

Simple upgrade completed.

Note: All tested .cube files were generated from GCMC, probability plots (aka density plots). The improvements should be applicable to DFT-generated .cube files all the same (electron densities, molecular orbitals).

Checklist

[Y] Google format doc strings added. Check with ruff.

Formatting & linting checks passed (pre-commit run --all-files results below):

ruff.....................................................................Passed
ruff-format..............................................................Passed
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mypy.....................................................................Passed
codespell................................................................Passed
cython-lint..............................................................Passed
double-quote Cython strings..............................................Passed
blacken-docs.............................................................Passed
markdownlint.............................................................Passed
nbstripout...........................................(no files to check)Skipped
pyright..................................................................Passed

[Y] Type annotations included. Check with mypy.

Since only optimizations were made, type annotations remain unchanged from the original function.

[Y] Tests added for new features/fixes.

No new tests were added specifically, but I verified that the changes did not break anything by running pytest. The results were identical to those obtained on a fresh fork:

4 failed, 2973 passed, 161 skipped, 103 warnings

I also developed a local pytest to validate that the optimized from_cube() generates identical results as the original.

Tested parameters:

Lattice parameters
Structure equality
Volume consistency
Volumetric data integrity
Atomic species and coordinates
Density statistics
Lattice angles
Structure symmetry

All tests passed across 13 benchmark .cube files.

This is my first contribution, so let me know if I'm missing anything!

shyuep · 2025-03-22T15:27:23Z

Nice! Thanks a lot!

Optimized cube file parsing in from_cube for improved speed

37f8aef

OliMarc requested review from shyuep and mkhorton as code owners March 21, 2025 03:07

Merge branch 'master' into optimize-cube-parsing

eeb8d33

shyuep enabled auto-merge (squash) March 22, 2025 15:27

Merge branch 'master' into optimize-cube-parsing

f30688c

shyuep merged commit b5c65b3 into materialsproject:master Mar 22, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized cube file parsing in from_cube for improved speed #4331

Optimized cube file parsing in from_cube for improved speed #4331

Uh oh!

OliMarc commented Mar 21, 2025

Uh oh!

shyuep commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!

Optimized cube file parsing in from_cube for improved speed #4331

Optimized cube file parsing in from_cube for improved speed #4331

Uh oh!

Conversation

OliMarc commented Mar 21, 2025

Summary

Major Changes:

Performance Improvements

To-Do

Checklist

Tested parameters:

Uh oh!

shyuep commented Mar 22, 2025

Uh oh!

Uh oh!

Uh oh!