Skip to content

Conversation

younik
Copy link
Member

@younik younik commented Jun 23, 2023

Description

Add support for gym.spaces.Text

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have run pytest -v and no errors are present.
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@younik
Copy link
Member Author

younik commented Jun 23, 2023

For now it only supports UTF-8; it should be possible to support unicode using NumPy object arrays; it is just less memory efficient

), "Nan found after cast to nump array, check the type of 'data'."
if all(map(lambda elem: isinstance(elem, str), data)):
data_shape = (len(data),)
dtype = h5py.string_dtype(encoding="utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see how this is a bottleneck as there's only this option or ascii (which is less chars), so given that I think it's fine to do utf-8 https://docs.h5py.org/en/stable/special.html#h5py.string_dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more searching around and this shouldn't be a problem after all, you can still encode full unicode characters using utf-8 (on unix they are encoded that way by default, on windows apparently it's utf-16), it just uses multi-byte encoding (https://unicodebook.readthedocs.io/unicode_encodings.html). Not sure how exactly it would work with gymnasium spaces but I think it should be good enough, utf-8 is the standard for pretty much everything. If a user really needs to use unicode characters in their gymnasium action/obs space I think they should be able to set the charset as just utf-8 and encode the unicode characters into utf-8.

tests/common.py Outdated
def __init__(self):
self.action_space = spaces.Text(max_length=10, min_length=2, charset="01")

self.observation_space = spaces.Text(max_length=20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably set Text(max_length=20, charset=string.printable) as that will be a list of all the printable ascii chars (length 100) rather than just the letters a-z and digits which is the default for gymnasium. You could have the charset be all of the utf-8 characters but that would be a million (as all unicode chars can be encoded) so sampling from that will probably take a while and not be ideal.
If you want the printable unicode multilingual characters (65,000) length it would be utf8_characters = [chr(i) for i in range(0x10000) if unicodedata.category(chr(i)) != "Cc"], but that's probably not necessary besides maybe showing people that it can be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying with all unicode it fails with an informative error UnicodeEncodeError: 'utf-8' codec can't encode character '\udb25' in position 2: surrogates not allowed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to somehow create a charset for all legal unicode characters, and decide what our stance is on surrogate characters, this may be nontrivial. Naively, I tried:

utf8_charset = "".join([chr(i) for i in range(sys.maxunicode)])

and I got a similar error

E   UnicodeEncodeError: 'utf-8' codec can't encode character '\ud937' in position 3: surrogates not allowed

Here is a h5py doc page talking about surrogate characters:
https://docs.h5py.org/en/3.2.1/strings.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean with "decide what our stance is on surrogate characters"

Copy link
Collaborator

@balisujohn balisujohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a test which uses full UTF-8 as the charset for a Text Space, similarly to how @elliottower described for the unicode multilingual characters. https://github.com/Farama-Foundation/Minari/pull/99/files#r1240189639

For this, I'm curious if, we might need a special flag to mean full utf-8 when serializing, instead of serializing a 1mil+ character string.

I'm leaning towards asking for a way to test also that the strings which get saved are exactly the same as the strings you get when loading the episode and not just check that they are in the space like we are currently doing. We could improve the dataset creation tests generally to enforce this for all space types, but let's say this last paragraph is a suggestion for a later PR.

@elliottower
Copy link
Contributor

I think there should be a test which uses full UTF-8 as the charset for a Text Space, similarly to how @elliottower described for the unicode multilingual characters. https://github.com/Farama-Foundation/Minari/pull/99/files#r1240189639

For this, I'm curious if, we might need a special flag to mean full utf-8 when serializing, instead of serializing a 1mil+ character string.

I'm leaning towards asking for a way to test also that the strings which get saved are exactly the same as the strings you get when loading the episode and not just check that they are in the space like we are currently doing. We could improve the dataset creation tests generally to enforce this for all space types, but let's say this last paragraph is a suggestion for a later PR.

More I think about it the more feel like the gymnasium spaces are kind of irrelevant to serializing, no matter what it’ll always go into the memory as utf8 encoding, people can do whatever subset of Unicode that they want and it’ll always be encodable that way.

@balisujohn
Copy link
Collaborator

balisujohn commented Jun 23, 2023

In general, I think that providing space serialization means a contract that the user will be able to fully recover the space specification beyond what is implicitly present in the data itself. We provide bounds on box and discrete spaces, and I think we need to provide similar metadata on serialized text spaces. Even if we don't need to know the space to faithfully reproduce the data, the user still needs to be able to recover the space from the serialization. I don't think it will add too much complexity if we add a boolean flag to a Text space serialized json that is set to true if all unicode characters are supported and false otherwise, and if it is set to true, the charset is left empty in the serialization.

We might even want to consider upstreaming something similar into Gymnasium.spaces.Text (because with the actual Text spaces in mempry, we will stilll have to use the million plus character long charset string, which could be represented as a couple bytes with either boundary-based charsets or a special "all utf-8" flag.

@elliottower
Copy link
Contributor

In general, I think that providing space serialization means a contract that the user will be able to fully recover the space specification beyond what is implicitly present in the data itself. We provide bounds on box and discrete spaces, and I think we need to provide similar metadata on serialized text spaces. Even if we don't need to know the space to faithfully reproduce the data, the user still needs to be able to recover the space from the serialization. I don't think it will add too much complexity if we add a boolean flag to a Text space serialized json that is set to true if all unicode characters are supported and false otherwise, and if it is set to true, the charset is left empty in the serialization.

We might even want to consider upstreaming something similar into Gymnasium.spaces.Text (because with the actual Text spaces in mempry, we will stilll have to use the million plus character long charset string, which could be represented as a couple bytes with either boundary-based charsets or a special "all utf-8" flag.

Fair enough about users needing to know. We should talk to Mark about it, I think a reasonable thing for gymnasium spaces would be to have encoding options like ascii or utf-8, but we should probably look into things more and get a better idea of what users will want. Last we talked Mark and Ariel said it probably didn’t make sense to have the charset argument or at least there should be easy ways to do utf-8 without specifying a list of every valid character

@younik
Copy link
Member Author

younik commented Jun 24, 2023

For how Text space is right now, I don't think it makes sense to have a special case for the full UTF-8 charset. As the user needs to specify it and all the characters are saved in a variable, I think it makes sense to save all the characters as well.
If in Gymnasium they change it (for example, they make the full UTF-8 set as the default value), then it makes sense to support it better here. In that case, a boolean flag or simply not saving the field charset (and using the UTF-8 charset as default while loading) makes more sense.

@younik younik requested a review from balisujohn June 24, 2023 10:59
tests/common.py Outdated
if not data.observation_space.contains(obs):
import pdb

pdb.set_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is leftover from debugging

@@ -143,6 +143,23 @@ def reset(self, seed=None, options=None):
return self.observation_space.sample(), {}


class DummyTextEnv(gym.Env):
def __init__(self):
self.action_space = spaces.Text(max_length=10, min_length=2, charset="01")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to create a dummy env with a full unicode charset minus surrogate chars that returns emoji as observations, as a stress test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case for DummyTextEnv for observation, isn't it?
Do you want it also for action space?

Copy link
Collaborator

@balisujohn balisujohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge, once the doc is updated to reflect text space support :^)

@younik younik requested a review from balisujohn June 29, 2023 23:10
Copy link
Collaborator

@balisujohn balisujohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :^)

@balisujohn balisujohn merged commit c4f8d95 into Farama-Foundation:main Jun 30, 2023
@younik younik deleted the support-text-space branch May 26, 2024 09:52
younik added a commit to younik/Minari that referenced this pull request Mar 21, 2025
* support text spaces

* test full utf-8 charset

* reformat

* test utf-16 without surrogate

* remove debugging leftover

* fix bug in data_collector

* add cursed string to test

* add documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants