Skip to content

[BLOCKING] Segmentation fault while using external memory version #4037

@kambstreat

Description

@kambstreat

Hi...I am using the following code to run xgboost external memory version as the data I am using is huge.

Python code for xgboost external training.

import sys
import xgboost as xgb
import numpy as np
import multiprocessing

train_file = 'train_data.txt'
n_trees = 500
num_cpus = multiprocessing.cpu_count()
train_file_with_cache = train_file + "#dtrain.cache"

dtrain = xgb.DMatrix(train_file_with_cache)
param = {
    'max_depth': 5,
    'eta': 0.1,
    'objective': 'binary:logistic',
    'nthread': num_cpus
}
bst = xgb.train(param, dtrain, num_boost_round=n_trees)

train_data.txt

0 1:0.05 2:0.2886751346 3:0.1428571429 4:14.5185062529 5:18.9276504219 8:0.6666666667 9:0.8164965809 10:1000.0 11:10.0 12:0.0023387311 13:0.0002435188 14:0.0000963849 15:0.0063504397 16:0.00048034 17:0.0000850515 18:2.0 20:2.0 21:0.0235294118 22:11.0 23:3.0 24:2.0 25:1.0 26:11.0 27:3.0 28:2.0 29:1.0 31:1.0 34:1.0 37:1.0 40:20.0 41:0.5773502692 42:0.3333333333
0 1:0.1315789474 2:0.2672612419 3:0.15 4:26.789135688 5:16.1891025786 8:1.0 9:1.0 10:2899.0 11:29.99 12:0.0183876189 13:0.000405548 14:0.0000554602 15:0.0101011466 16:0.0005706792 17:0.000093875 18:199.0 20:199.0 21:1.0310880829 22:1.0 23:1.0 24:1.0 25:1.0 26:1.0 27:1.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:7.0 41:0.5773502692 42:0.3333333333
1 1:0.1111111111 2:0.4082482905 3:0.2222222222 4:23.1104210691 5:16.863139254 8:1.0 9:1.0 10:1700.0 11:17.0 12:0.0062726159 13:0.000491584 14:0.0003132148 15:0.0065093033 16:0.0005302828 17:0.0001788845 18:25.0 20:14.0 21:0.0370919881 22:4.0 23:1.0 24:1.0 25:1.0 26:4.0 27:1.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:2.0 41:1.0 42:1.0
1 1:0.0178571429 2:0.2611164839 3:0.12 4:4.2300158408 5:16.6917285999 8:0.2 9:0.3333333333 10:1600.0 11:16.0 12:0.0042197019 13:0.000288361 14:0.0001064819 15:0.0076858998 16:0.0007074996 17:0.000212954 22:61.0 23:13.0 24:7.0 25:4.0 26:61.0 27:13.0 28:7.0 29:4.0 31:1.0 34:1.0 37:1.0 40:15.0 41:0.5773502692 42:0.3333333333
0 1:0.0416666667 2:0.2581988897 3:0.0952380952 4:18.8293173733 5:12.3037244648 10:1995.0 11:20.95 12:0.0129559065 13:0.0008584607 14:0.0001920785 15:0.0102470342 16:0.0006167027 17:0.0002648885 18:57.0 20:57.0 21:0.2701421801 22:1.0 23:1.0 24:1.0 25:1.0 26:1.0 27:1.0 28:1.0 29:1.0 32:1.0 34:1.0 37:1.0 40:14.0 41:0.5 42:0.25
1 1:0.0588235294 2:0.3396831102 3:0.1875 4:14.5185062529 5:20.3706356214 8:1.0 9:1.0 10:800.0 11:8.0 12:0.0027076417 13:0.0001515364 14:0.0000630457 15:0.0035674091 16:0.0002520206 17:0.0000883569 18:3.0 20:3.0 21:0.0357142857 22:6.0 23:2.0 24:1.0 25:1.0 26:6.0 27:2.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:6.0 41:0.5773502692 42:0.3333333333
0 1:0.0625 2:0.3202563076 3:0.1428571429 4:25.767381348 5:19.023385007 8:0.3333333333 9:0.5773502692 10:1699.0 11:17.99 12:0.0167115449 13:0.0016991411 14:0.0004404174 15:0.007886134 16:0.0007152095 17:0.0002080609 18:1423.0 19:6.0 20:1277.0 21:3.3403755869 22:1.0 23:1.0 24:1.0 25:1.0 26:1.0 27:1.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:1.0 41:0.5773502692 42:0.3333333333
0 1:0.0545454545 2:0.246182982 3:0.0869565217 4:12.2680659626 5:6.0386632079 10:2149.0 11:21.49 12:0.0100405638 13:0.0003642296 14:0.000050019 15:0.0073479124 16:0.0003586877 17:0.0000985799 18:245.0 20:200.0 21:0.374617737 22:1.0 23:1.0 24:1.0 25:1.0 26:1.0 27:1.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:3.0 41:0.5 42:0.25
1 1:0.1272727273 2:0.298142397 3:0.16 4:43.0184411519 5:36.749580197 10:199.0 11:2.99 12:0.006621655 13:0.0012702954 14:0.0003402279 15:0.00645962 16:0.0006557264 17:0.0001699916 18:61.0 20:61.0 21:0.2489795918 22:6.0 23:2.0 24:1.0 25:1.0 26:6.0 27:2.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:7.0
1 1:0.0263157895 2:0.15430335 3:0.0625 4:13.1159362977 5:15.5852962034 8:0.6 9:0.75 10:7900.0 11:79.0 12:0.0026556091 13:0.0002660176 14:0.0000962959 15:0.0075640712 16:0.0000997316 17:0.0000178409 18:16.0 20:4.0 21:0.0146386093 22:4.0 23:1.0 24:1.0 25:1.0 26:4.0 27:1.0 28:1.0 29:1.0 31:1.0 34:1.0 37:1.0 40:34.0 41:0.5 42:0.25

I ran the code with valgrind and I got the following error when I ran with 200k data instances but I did not get any error when I ran with 100k data instances. Please help me with the issue

==27020== Invalid read of size 4
==27020==    at 0x1A1F2CF3: xgboost::tree::CQHistMaker<xgboost::tree::GradStats>::InitWorkSet(xgboost::DMatrix*, xgboost::RegTree const&, std::vector<unsigned int, std::allocator<unsigned int> >*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A1F697D: xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A25D75F: xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A25EC1E: xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A0E200D: xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A276664: XGBoosterUpdateOneIter (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x715EEB1: ffi_call_unix64 (unix64.S:76)
==27020==    by 0x715E8D3: ffi_call (ffi64.c:525)
==27020==    by 0x714D8F7: _call_function_pointer (callproc.c:836)
==27020==    by 0x714E520: _ctypes_callproc (callproc.c:1179)
==27020==    by 0x714710C: PyCFuncPtr_call (_ctypes.c:3965)
==27020==    by 0x14B743: PyObject_Call (abstract.c:2546)
==27020==  Address 0x1396a040 is 8 bytes after a block of size 344 alloc'd
==27020==    at 0x4C2C21F: operator new(unsigned long) (vg_replace_malloc.c:334)
==27020==    by 0x1A1F3384: xgboost::tree::CQHistMaker<xgboost::tree::GradStats>::InitWorkSet(xgboost::DMatrix*, xgboost::RegTree const&, std::vector<unsigned int, std::allocator<unsigned int> >*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A1F697D: xgboost::tree::HistMaker<xgboost::tree::GradStats>::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A25D75F: xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A25EC1E: xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::ObjFunction*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A0E200D: xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x1A276664: XGBoosterUpdateOneIter (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==27020==    by 0x715EEB1: ffi_call_unix64 (unix64.S:76)
==27020==    by 0x715E8D3: ffi_call (ffi64.c:525)
==27020==    by 0x714D8F7: _call_function_pointer (callproc.c:836)
==27020==    by 0x714E520: _ctypes_callproc (callproc.c:1179)
==27020==    by 0x714710C: PyCFuncPtr_call (_ctypes.c:3965)
==27020== 

==6607== Invalid write of size 4
==6607== at 0x1A1F2D21: xgboost::tree::CQHistMakerxgboost::tree::GradStats::InitWorkSet(xgboost::DMatrix*, xgboost::RegTree const&, std::vector<unsigned int, std::allocator >) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A1F697D: xgboost::tree::HistMakerxgboost::tree::GradStats::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A25D75F: xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A25EC1E: xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A0E200D: xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A276664: XGBoosterUpdateOneIter (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x715EEB1: ffi_call_unix64 (unix64.S:76)
==6607== by 0x715E8D3: ffi_call (ffi64.c:525)
==6607== by 0x714D8F7: _call_function_pointer (callproc.c:836)
==6607== by 0x714E520: _ctypes_callproc (callproc.c:1179)
==6607== by 0x714710C: PyCFuncPtr_call (_ctypes.c:3965)
==6607== by 0x14B743: PyObject_Call (abstract.c:2546)
==6607== Address 0x5dca6f4 is 12 bytes after a block of size 344 alloc’d
==6607== at 0x4C2C21F: operator new(unsigned long) (vg_replace_malloc.c:334)
==6607== by 0x1A1F3384: xgboost::tree::CQHistMakerxgboost::tree::GradStats::InitWorkSet(xgboost::DMatrix*, xgboost::RegTree const&, std::vector<unsigned int, std::allocator >) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A1F697D: xgboost::tree::HistMakerxgboost::tree::GradStats::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A25D75F: xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A25EC1E: xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A0E200D: xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x1A276664: XGBoosterUpdateOneIter (in /home/username/python_debug_env/lib/python2.7/site-packages/xgboost-0.81-py2.7.egg/xgboost/lib/libxgboost.so)
==6607== by 0x715EEB1: ffi_call_unix64 (unix64.S:76)
==6607== by 0x715E8D3: ffi_call (ffi64.c:525)
==6607== by 0x714D8F7: _call_function_pointer (callproc.c:836)
==6607== by 0x714E520: _ctypes_callproc (callproc.c:1179)
==6607== by 0x714710C: PyCFuncPtr_call (_ctypes.c:3965)
==6607==
==6607== Invalid read of size 8
==6607== at 0x24D762: visit_decref (gcmodule.c:360)
==6607== by 0x18797A: dict_traverse (dictobject.c:2114)
==6607== by 0x24D853: subtract_refs (gcmodule.c:385)
==6607== by 0x24E8B5: collect (gcmodule.c:925)
==6607== by 0x24F639: PyGC_Collect (gcmodule.c:1440)
==6607== by 0x233A07: Py_Finalize (pythonrun.c:448)
==6607== by 0x140308: Py_Main (main.c:665)
==6607== by 0x13EDFF: main (python.c:23)
==6607== Address 0x3f8000000051f238 is not stack’d, malloc’d or (recently) free’d
==6607==
==6607==
==6607== Process terminating with default action of signal 11 (SIGSEGV)
==6607== General Protection Fault
==6607== at 0x24D762: visit_decref (gcmodule.c:360)
==6607== by 0x18797A: dict_traverse (dictobject.c:2114)
==6607== by 0x24D853: subtract_refs (gcmodule.c:385)
==6607== by 0x24E8B5: collect (gcmodule.c:925)
==6607== by 0x24F639: PyGC_Collect (gcmodule.c:1440)
==6607== by 0x233A07: Py_Finalize (pythonrun.c:448)
==6607== by 0x140308: Py_Main (main.c:665)
==6607== by 0x13EDFF: main (python.c:23)
==6607==
==6607== HEAP SUMMARY:
==6607== in use at exit: 214,472,668 bytes in 68,975 blocks
==6607== total heap usage: 197,280 allocs, 128,305 frees, 967,454,846 bytes allocated
==6607==
==6607== LEAK SUMMARY:
==6607== definitely lost: 104 bytes in 2 blocks
==6607== indirectly lost: 0 bytes in 0 blocks
==6607== possibly lost: 3,167,370 bytes in 18,705 blocks
==6607== still reachable: 211,305,162 bytes in 50,267 blocks
==6607== of which reachable via heuristic:
==6607== newarray : 560 bytes in 35 blocks
==6607== suppressed: 32 bytes in 1 blocks
==6607== Rerun with --leak-check=full to see details of leaked memory
==6607==
==6607== For counts of detected and suppressed errors, rerun with: -v
==6607== ERROR SUMMARY: 69 errors from 5 contexts (suppressed: 0 from 0```

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions