Skip to content

Writing to hdf5 file slows down enourmously for many long keys in the same group #1055

@mariaangelapellegrino

Description

@mariaangelapellegrino

To assist reproducing bugs, please include the following:

  • Operating System: Ubuntu 16.04 4.4.0-119-generic
  • Python version: 3.5.2
  • Where Python was acquired: system
  • h5py version: 2.8.0
  • HDF5 version 1.10.2
  • The full traceback/stack trace shown: none

We are converting data which has keys mapped to vectors into hdf5 format. After some time the writing slows down.

After several experiments we found that this happens when many datasets are written to a single group, and when these datasets have a long name (the longer the name, the earlier the problem occurs).
In the beginning the writing goes rather fast (+- 20MB/s), but after some point, the writing speed slows down to about 10 KB/s. After that point, the speed does not seem to go up again.

The size of the data does not affect. It seems to be some sort of limit on the amount and length of the keys.

The issue is reproducible with the code below:

import numpy as np
import h5py
import hashlib

def writeManyDatasets():
	file = h5py.File("myfile.h5", 'w')
	for i in range(0, 500000):
		data = np.asarray([1.0], dtype='float64')
		encodedName = hashlib.sha1(str(i).encode('utf-8')).hexdigest()
		encodedName = encodedName * 6
		#print (encodedName)
		dataset = file.create_dataset(encodedName, data=data)
		if i % 10000 == 0:
			print("Done with " + str(i))			
	file.close ()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions