Skip to content

Conversation

tempoxylophone
Copy link
Contributor

Pandas 2.0.0 was released today. A new argument called dtype_backend was added to the read_csv() function that appears to affect the default behavior when reading null values.

When the respective master.csv are read with Pandas 2.0.0, when the value "None" is written in a string, it appears to now be parsed by default to NaN. This is problematic in places where the meta_info object dictionary's property additional edge files and additional node files are assumed to be a string.

  • see lines 106 and 111 in ogb/linkpropped/dataset.py:
if self.meta_info['additional node files'] == 'None':
    additional_node_files = []
else:
    additional_node_files = self.meta_info['additional node files'].split(',')

The .split(",") function called on these will throw AttributeError: 'float' object has no attribute 'split'.

There are more instances that can cause this exception beyond the two above.

I considered two options - either edit the script and corresponding master.csv files to contain the empty string instead of 'None', which are parsed as "" instead of NaN, or add the keyword argument keep_default_na=False to instances of pd.read_csv where this could be an issue. This keyword argument prevents the "None"s from being parsed as NaNs.

Seeing as there are more instances of the latter option and would require a larger diff, I opted for the former approach. This involved editing the make_master_file.py files in their respective directories. I may have discovered a small inconsistency with the Python code in ogb/linkproppred/make_master_file.py for the has_edge_attr property for the ogbl-vessel dataset. In make_master_file.py, this property was set to False, but the committed file in the latest release has this property set to True in the csv file.

…m False to True to be consistent with .csv file presently in release on main branch.
@tempoxylophone
Copy link
Contributor Author

tempoxylophone commented Apr 3, 2023

Closing this because editing the csv files in this way produces the exact problem in earlier versions of pandas. A better solution would be to add the keep_default_na=False keyword argument for all instances where the csv files are read.

@tempoxylophone tempoxylophone deleted the pandas_2_0_0_csv_loader branch April 3, 2023 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant