Skip to content

Training behaviour difference between v1.1.0 and v1.3.1 #6552

@fordicus

Description

@fordicus

Merry Christmas, all.

I am suffering from a different behaviour between an old version and
the newest version of XGBoost. Here are some information.

  • 1.3.1 ( pip install xgboost ) the newest one
  • 1.1.0 ( pip install -Iv xgboost==1.1.0 ) the old one that I used
    Ubuntu 18.04.4 LTS
    CUDA 11.1.74
    QUADRO RTX 8000 48GB

Data file (7.49 GB) can be downloaded from (mega cloud)
https://mega.nz/file/irQXCSbA#rfl53ewh88k_pBsVLw_fA5hC6mz91Vi9yRUrMqEMYno

or it also can be found at (google drive)
https://drive.google.com/file/d/14RCv2COi6rw7JWDkYp9825z7zSHY1ogQ/view?usp=sharing

The issue with the old version was not being able to dump the trained model
while the training itself progresses very nicely.
( https://discuss.xgboost.ai/t/serialisation-of-large-models/1897/2 )

As I am aware that there have been a lot of effort to improve the dump process,
I had to try the newest version. But using the newst version, the issue is that
I cannot see the same training behaviour as the old one. Simply put,
it seems it takes a way longer time in order to see the rounds proceed.
I only terminated with a keyboard interruption.

For 30 minutes, there is no one-single progress of a round using v1.3.1.
Using v1.1.0 and waiting for 30 minutes, the prints at terminal read

[0] convergence-merror:0.10068
[1] convergence-merror:0.10052
[2] convergence-merror:0.09998
[3] convergence-merror:0.09930
[4] convergence-merror:0.09852
[5] convergence-merror:0.09773
[6] convergence-merror:0.09690
[7] convergence-merror:0.09608
[8] convergence-merror:0.09524
[9] convergence-merror:0.09439
[10] convergence-merror:0.09355
[11] convergence-merror:0.09268
[12] convergence-merror:0.09182
[13] convergence-merror:0.09097
[14] convergence-merror:0.09013
[15] convergence-merror:0.08930
[16] convergence-merror:0.08855
[17] convergence-merror:0.08781
[18] convergence-merror:0.08708
[19] convergence-merror:0.08636
[20] convergence-merror:0.08557
[21] convergence-merror:0.08477

Note that reinstalling the old version after removing the newest recovers
the previous good training procedure.

Since which code I use matters, I also put the code at the end of this.
You can give an input argument to tell how many rounds it trains when running the script.
But the path for the data that I mentioned above is hard-coded with its path.
I wish I could hear some advice.

Besides, I also want to study why this is happening from my side.
Could anyone let me know how to download the "full code" of v1.1.0
which I could install with ( pip install -Iv xgboost==1.1.0 )?

I want to study the routines and learn by myself about the implementation.
Since I am testing in Ubuntu, I am supposed to study and build the full code of
the old version in Ubuntu. I am not very good at Github and I do not know how to
download the full code to build of an old release.

Thanks a lot in advance.

Comment: I copied the code from Sublime Text.
Some spaces look irregular here, though.
They would look better in a text editor hopefully.

#-----------------------------------------------------------------------------------------
# modules
#-----------------------------------------------------------------------------------------
from   sklearn.metrics import accuracy_score # convenient accuracy measure
import xgboost         as     xgb            # XGBoost software package
import h5py                                  # deal with .mat data files
import sys                                   # deal with command arguments
import gc; gc.enable();                      # garbage memory collection 


#-------------------------------------------------------------------------------
# functions
#-------------------------------------------------------------------------------
def serialise_model(__pne__, __model__):
    __model__.save_model(__pne__); gc.collect()

def deserialise_model(__pne__):
    __model__ = xgb.Booster(); xgb.Booster.load_model(__model__, __pne__)
    return __model__

def load_mat(__pne__):  # load .mat data and return xgboost.DMatrix
    print('load_mat(%s): Commence.' % __pne__, flush = 1)
    Mat = h5py.File(__pne__)
    X__ = (Mat.get('X')[()]).transpose();
    y__ = (Mat.get('y')[()]).transpose().flatten(); del Mat;
    Xy_ = xgb.DMatrix(data = X__, label = y__)
    __shape__ = X__.shape
    print('load_mat(): DMatrix ready. X__.shape = ', end = '', flush = 1)
    print(__shape__, flush = 1)
    del X__; gc.collect();
    return Xy_, y__, __shape__;

def set_model_params(__max_depth__, __num_class__): 
    xgb_params = {
        'process_type':               'default',        # \in {default, update} for the continuation of a tree.
        'tree_method':                'gpu_hist',
        'booster':                    'gbtree',
        'grow_policy':                'lossguide',      # lossguide: split at nodes with highest loss change.
        'num_parallel_tree':          1,                # number of trees in boosted random forest.
        'min_split_loss':             0,                # 0: split whenver it improves.
        'learning_rate':              1.0,              # 1: no decay. weight on previous trees.
        'max_depth':                  __max_depth__,    # \in [0, Inf], 0 accepted if (hist | lossguided). cf. memory (!)
        'max_leaves':                 0,
        'reg_lambda':                 0.0,              # L2-regularisation
        'reg_alpha':                  0.0,              # L1-regularisation
        'num_class':                  __num_class__,
        'objective':                  'multi:softmax',
        'eval_metric':                'merror',
        'predictor':                  'gpu_predictor',
        'verbosity':                  1,
        'validate_parameters':        0,
        'single_precision_histogram': 0,                # gpu_hist can fail with single precision
        'deterministic_histogram':    0                 # default 1: rounds losing accuracy
    }

    return xgb_params


#-----------------------------------------------------------------------------------------
# data
#-----------------------------------------------------------------------------------------
data_pne         = '/workspace/temp/data.mat'
my_model_path    = '/workspace/temp/'
num_classes      = 3
my_max_depth     = 21 # 21
print('from: %s' % (data_pne))
print('to:   %s' % (my_model_path))


#-----------------------------------------------------------------------------------------
# user parameters
#-----------------------------------------------------------------------------------------
if len(sys.argv) != 2: 
    print('Give number of rounds');
    exit()

dump_model    = 1
my_print_freq = 1
xgb_params    = set_model_params(my_max_depth, num_classes)


#-----------------------------------------------------------------------------------------
# ----------------------------------------------------- using xgboost-core
#-----------------------------------------------------------------------------------------
Xy, y, shape  = load_mat(data_pne)
progress      = dict()
my_num_rounds = int(sys.argv[1])
my_model_pne  = '%s/%03d.json' % (my_model_path, my_num_rounds)

try:
    M0 = deserialise_model(my_model_pne)
    print('Existing model detected.\n')

except:
    M0 = None; print('Model will be newly trained.\n')
    M0 = xgb.train(xgb_params, Xy, my_num_rounds, [(Xy, 'convergence')], 
        evals_result = progress,
        verbose_eval = my_print_freq,
        xgb_model    = M0)

## report result
y0  = np.ubyte(M0.predict(Xy))
ACC = accuracy_score(y0, np.ubyte(y))
print('\nAccuracy = %6.2f%%' % (ACC * 100.0) )

## secure memory as much as possible
del Xy; del y; gc.collect()

## saving the trained model
if dump_model == 1:
    print('Dumping model M0: ', end = '', flush = 1);
    try:
        serialise_model(my_model_pne, M0)
    except:
        print('serialisation failure.')

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions