Pandas 2.0.0 Compatibility #419

tempoxylophone · 2023-04-03T22:14:57Z

Pandas 2.0.0 was released today. A new argument called dtype_backend was added to the read_csv() function that appears to affect the default behavior when reading null values.

When the respective master.csv are read with Pandas 2.0.0, when the value "None" is written in a string, it appears to now be parsed by default to NaN. This is problematic in places where the meta_info object dictionary's property additional edge files and additional node files are assumed to be a string.

see lines 106 and 111 in ogb/linkpropped/dataset.py:

if self.meta_info['additional node files'] == 'None':
    additional_node_files = []
else:
    additional_node_files = self.meta_info['additional node files'].split(',')

The .split(",") function called on these will throw AttributeError: 'float' object has no attribute 'split'.

There are more instances that can cause this exception beyond the two above.

I considered two options - either edit the script and corresponding master.csv files to contain the empty string instead of 'None', which are parsed as "" instead of NaN, or add the keyword argument keep_default_na=False to instances of pd.read_csv where this could be an issue. This keyword argument prevents the "None"s from being parsed as NaNs.

Seeing as there are more instances of the latter option and would require a larger diff, I opted for the former approach. This involved editing the make_master_file.py files in their respective directories. I may have discovered a small inconsistency with the Python code in ogb/linkproppred/make_master_file.py for the has_edge_attr property for the ogbl-vessel dataset. In make_master_file.py, this property was set to False, but the committed file in the latest release has this property set to True in the csv file.

…m False to True to be consistent with .csv file presently in release on main branch.

tempoxylophone · 2023-04-03T22:28:37Z

Closing this because editing the csv files in this way produces the exact problem in earlier versions of pandas. A better solution would be to add the keep_default_na=False keyword argument for all instances where the csv files are read.

tempoxylophone added 2 commits April 3, 2023 17:49

pandas 2.0.0 read_csv compat.

1f7ae4f

changed attribute 'has_edge_attr' in linkpropped/make_master_file fro…

f498add

…m False to True to be consistent with .csv file presently in release on main branch.

tempoxylophone closed this Apr 3, 2023

tempoxylophone deleted the pandas_2_0_0_csv_loader branch April 3, 2023 22:28

tempoxylophone mentioned this pull request Apr 3, 2023

Pandas 2.0.0 Compatibility with read_csv #420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pandas 2.0.0 Compatibility #419

Pandas 2.0.0 Compatibility #419

Uh oh!

tempoxylophone commented Apr 3, 2023

Uh oh!

tempoxylophone commented Apr 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

Pandas 2.0.0 Compatibility #419

Pandas 2.0.0 Compatibility #419

Uh oh!

Conversation

tempoxylophone commented Apr 3, 2023

Uh oh!

tempoxylophone commented Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tempoxylophone commented Apr 3, 2023 •

edited

Loading