Skip to content

Conversation

leon-g-xu
Copy link
Contributor

@leon-g-xu leon-g-xu commented May 24, 2024

This is expose dtype in the data config so that we can support reading memmap files with different dtypes

@2015aroras 2015aroras self-requested a review June 6, 2024 22:53
Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change! Just left some small comments.

@@ -36,6 +36,7 @@ def build_memmap_dataset(
return MemMapDataset(
*paths,
chunk_size=train_config.model.max_sequence_length,
memmap_dtype=train_config.data.effective_memmap_dtype,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also used to setup the memmaps for evaluation. In that case, the data_config is not the same as train_config.data. We should respect the setting in data_config setting.

Suggested change
memmap_dtype=train_config.data.effective_memmap_dtype,
memmap_dtype=data_config.effective_memmap_dtype,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

olmo/config.py Outdated
@@ -555,6 +556,7 @@ class InstanceFilterConfig(BaseConfig):
@dataclass
class DataConfig(BaseConfig):
paths: Optional[List[str]] = None
memmap_dtype: Optional[str] = "uint16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this non-optional? I don't think None is useful here.

Suggested change
memmap_dtype: Optional[str] = "uint16"
memmap_dtype: str = "uint16"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@leon-g-xu
Copy link
Contributor Author

@2015aroras Thanks for the review. Updated the PR to address the comment

Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are style issue due to imports. Make sure to follow the steps here so that required automatic checks pass.

@leon-g-xu
Copy link
Contributor Author

leon-g-xu commented Jun 7, 2024

There are style issue due to imports. Make sure to follow the steps here so that required automatic checks pass.

@2015aroras Went though instructions and added all the missing things. Can I get another approval so that it kicks off all the auto checks?

@2015aroras 2015aroras merged commit 2639279 into allenai:main Jun 7, 2024
@leon-g-xu leon-g-xu deleted the lx/expose-data-dtype branch June 18, 2024 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants