Skip to content

Bugfix: Fix memory management issues by replacing variable length arrays with STL vectors and arrays #3075

@georgemccabe

Description

@georgemccabe

Describe the Problem

Multiple runtime problems were discovered when MET was compiled and tested with the -O2 level of optimization in support of creating a METplus build through Conda. This issue is to resolve these runtime problems related to optimization. It will only be complete when (1) all MET unit tests run to completion with no error and (2) no significant differences remain between the output from MET compiled with optimization and without.

Additional recommendations:

  • Update the Dockerfile to routinely test MET with optimization (GNU -O2) enabled to more closely match how it is run in production.
  • GNU unit testing.
    • Compiled and tested with -O1, -O2, and -O3.
    • Unit tests run to completion and output is compared to main_v12.0 NB20250204 nightly build output.
    • For -O1, 3 files have size diffs > 1% and real diffs are flagged in 1 NetCDF file:
file1: ../../test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/pb2nc_indy/nam.20210311.t00z.prepbufr.tm00.pbl.nc
ERROR: found     4 differences in var obs_val                                - max abs: 0.05004883
  • For -O2, same 4 diff files as -O1 plus 15 more related to BCLP derivation in TC-Pairs.
  • For -O3, same 19 diff files as -O2.
  • So not optimizing with FFLAGS did NOT solve the BCLP derivation diffs.
  • Intel unit testing.
    • Compiled and tested with -O1, -O2, and -O3.
    • Note that unit_modis.xml and unit_lidar2nc.xml are skipped because those tools are not compiled.
    • Also note that unit_tc_diag.xml and unit_python.xml are skipped because of issues running Python.
    • unit_point2grid.xml segfaults on the TEST: point2grid_GOES_16_AOD_TO_G212_GAUSSIAN test:
DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M3_G16_s20181341702215_e20181341704588_c20181341711418.nc
terminate called after throwing an instance of 'netCDF::exceptions::NcAttMeta'
  what():  NetCDF: Can't open HDF5 attribute
file: ncCheck.cpp  line:92
FATAL ERROR (SEGFAULT): Process 3478583 got signal 11 @ local time = 2025-02-05 18:09:28Z
  • For -O1, ...
  • For -O2, ...
  • For -O3, ...

The runtime problems, along with their fixes, are described below:

The point2grid_GOES_16_ADP unit test fails to run when installed via conda. I suspect that this is due to optimization being turned on, similar to the bug from #3054.
Here is an example command that fails on seneca:

point2grid \
/d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20192662141196_e20192662143569_c20192662145547.nc \
G212 \
/d1/personal/mccabe/met_test_output/point2grid/point2grid_GOES_16_ADP.nc \
-field name="AOD_Smoke";  level="(*,*)"; \
-adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc \
-goes_qc 0,1 -method MAX -v 1

This produces a segfault when compiled with the -O2 optimization flag.
This problem has been fixed on the bugfix branch for this issue and was due to buffer overflow, reading > 512 characters into a buffer of that size.

  • unit_tc_pairs.xml produces bad output in the TCDIAG lines, where the diagnostic names contains garbage strings.

While tc_pairs completes without error, the unit test validation logic fails when reading the corrupted strings. This was fixed by updating the TrackInfo::diag_name() accessor function to return a string copy of the diagnostic name rather than a const char * pointer to temporary memory.

  • unit_point2grid.xml aborts on the point2grid_2D_time test on line 442 of nc_cf_file.cc. The debugger indicates that the NetCDF groupId value being read is corrupted:
p *_yDim
$31 = {nullObject = false, myId = 0, groupId = 1167840000}

When it should have the same groupId as _xDim (confirmed by debugging the non-optimized version):

p *_xDim
$32 = {nullObject = false, myId = 1, groupId = 131072}

The problem lies in the read_netcdf_grid() function in nc_cf_file.cc where the _xDim and _yDim pointers were pointing to a local variable. The fix is to search for and point them to class variables that won't go out of scope.

  • All unit tests now run to completion when compiled using -g -O2. However, comparing to the output from the unoptimized nightly build runs reveals differences, including differences in the number of TCST lines written by tc_pairs:
file1: test_output/tc_pairs/al022013_interp12_fill.tcst
file2: /d1/projects/MET/MET_regression/main_v12.0/NB20250204/MET-main_v12.0/test_output/tc_pairs/al022013_interp12_fill.tcst
ERROR: differing number of rows 1162 vs. 1198 for row type TCST_TCMPR between versions 12_0 vs. 12_0 

However, all the differences in these 2 files are for the BCLP model, which is computed using NHC-provided FORTRAN code. So it's possible an optimization problem also resides in the in FORTRAN code within MET. For the time being, I'll retest WITHOUT setting FFLAGS='-O2' to see if differences remain.

  • unit_tc_diag.xml segfaults the same way on Intel builds for all optimization levels. The problem occurs in the Python diagnostic computation step. unit_python.xml also fails on the python_plot_point_obs_met_nc_to_pandas test after completing many successful tests.

These Python failures could very well be related to us linking to this /nrit/ral/met-python3/bin/python3.10 version of Python built with GNU. Perhaps we need to link to an Intel Python build instead? For the time being, I've just skipped over unit_tc_diag.xml and unit_python.xml.

Expected Behavior

All MET unit tests should run to completion and create numerically equivalent output regardless of the optimization level.

Environment

Describe your runtime environment:
Reproduced on the MET development machine named seneca.

To Reproduce

Describe the steps to reproduce the behavior:

  • Compile the MET main_v12.0 branch with CFLAGS, CXXFLAGS, and FFLAGS set to include -O2.
  • Run bin/unit_test.sh to run all of the MET unit tests.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

2702701 to support conda builds.

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Review default alert labels
  • Select component(s)
  • Select priority
  • Select requestor(s)

Milestone and Projects

  • Select Milestone as the next bugfix version
  • Select Coordinated METplus-X.Y Support project for support of the current coordinated release
  • Select MET-X.Y.Z Development project for development toward the next official release

Define Related Issue(s)

Consider the impact to the other METplus components.

Bugfix Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of main_<Version>.
    Branch name: bugfix_<Issue Number>_main_<Version>_<Description>
  • Fix the bug and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into main_<Version>.
    Pull request: bugfix <Issue Number> main_<Version> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next bugfix version
    Select: Coordinated METplus-X.Y Support project for support of the current coordinated release
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Complete the steps above to fix the bug on the develop branch.
    Branch name: bugfix_<Issue Number>_develop_<Description>
    Pull request: bugfix <Issue Number> develop <Description>
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next official version
    Select: MET-X.Y.Z Development project for development toward the next official release
  • Close this issue.

Metadata

Metadata

Type

No type

Projects

Status

🏁 Done

Relationships

None yet

Development

No branches or pull requests

Issue actions