-
-
Notifications
You must be signed in to change notification settings - Fork 62
ENH: Support text spaces #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For now it only supports |
), "Nan found after cast to nump array, check the type of 'data'." | ||
if all(map(lambda elem: isinstance(elem, str), data)): | ||
data_shape = (len(data),) | ||
dtype = h5py.string_dtype(encoding="utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see how this is a bottleneck as there's only this option or ascii (which is less chars), so given that I think it's fine to do utf-8 https://docs.h5py.org/en/stable/special.html#h5py.string_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some more searching around and this shouldn't be a problem after all, you can still encode full unicode characters using utf-8 (on unix they are encoded that way by default, on windows apparently it's utf-16), it just uses multi-byte encoding (https://unicodebook.readthedocs.io/unicode_encodings.html). Not sure how exactly it would work with gymnasium spaces but I think it should be good enough, utf-8 is the standard for pretty much everything. If a user really needs to use unicode characters in their gymnasium action/obs space I think they should be able to set the charset as just utf-8 and encode the unicode characters into utf-8.
tests/common.py
Outdated
def __init__(self): | ||
self.action_space = spaces.Text(max_length=10, min_length=2, charset="01") | ||
|
||
self.observation_space = spaces.Text(max_length=20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably set Text(max_length=20, charset=string.printable)
as that will be a list of all the printable ascii chars (length 100) rather than just the letters a-z and digits which is the default for gymnasium. You could have the charset be all of the utf-8 characters but that would be a million (as all unicode chars can be encoded) so sampling from that will probably take a while and not be ideal.
If you want the printable unicode multilingual characters (65,000) length it would be utf8_characters = [chr(i) for i in range(0x10000) if unicodedata.category(chr(i)) != "Cc"]
, but that's probably not necessary besides maybe showing people that it can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying with all unicode it fails with an informative error UnicodeEncodeError: 'utf-8' codec can't encode character '\udb25' in position 2: surrogates not allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to somehow create a charset for all legal unicode characters, and decide what our stance is on surrogate characters, this may be nontrivial. Naively, I tried:
utf8_charset = "".join([chr(i) for i in range(sys.maxunicode)])
and I got a similar error
E UnicodeEncodeError: 'utf-8' codec can't encode character '\ud937' in position 3: surrogates not allowed
Here is a h5py doc page talking about surrogate characters:
https://docs.h5py.org/en/3.2.1/strings.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean with "decide what our stance is on surrogate characters"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a test which uses full UTF-8 as the charset for a Text Space, similarly to how @elliottower described for the unicode multilingual characters. https://github.com/Farama-Foundation/Minari/pull/99/files#r1240189639
For this, I'm curious if, we might need a special flag to mean full utf-8 when serializing, instead of serializing a 1mil+ character string.
I'm leaning towards asking for a way to test also that the strings which get saved are exactly the same as the strings you get when loading the episode and not just check that they are in the space like we are currently doing. We could improve the dataset creation tests generally to enforce this for all space types, but let's say this last paragraph is a suggestion for a later PR.
More I think about it the more feel like the gymnasium spaces are kind of irrelevant to serializing, no matter what it’ll always go into the memory as utf8 encoding, people can do whatever subset of Unicode that they want and it’ll always be encodable that way. |
In general, I think that providing space serialization means a contract that the user will be able to fully recover the space specification beyond what is implicitly present in the data itself. We provide bounds on box and discrete spaces, and I think we need to provide similar metadata on serialized text spaces. Even if we don't need to know the space to faithfully reproduce the data, the user still needs to be able to recover the space from the serialization. I don't think it will add too much complexity if we add a boolean flag to a We might even want to consider upstreaming something similar into |
Fair enough about users needing to know. We should talk to Mark about it, I think a reasonable thing for gymnasium spaces would be to have encoding options like ascii or utf-8, but we should probably look into things more and get a better idea of what users will want. Last we talked Mark and Ariel said it probably didn’t make sense to have the charset argument or at least there should be easy ways to do utf-8 without specifying a list of every valid character |
For how |
tests/common.py
Outdated
if not data.observation_space.contains(obs): | ||
import pdb | ||
|
||
pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is leftover from debugging
@@ -143,6 +143,23 @@ def reset(self, seed=None, options=None): | |||
return self.observation_space.sample(), {} | |||
|
|||
|
|||
class DummyTextEnv(gym.Env): | |||
def __init__(self): | |||
self.action_space = spaces.Text(max_length=10, min_length=2, charset="01") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to create a dummy env with a full unicode charset minus surrogate chars that returns emoji as observations, as a stress test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case for DummyTextEnv
for observation, isn't it?
Do you want it also for action space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge, once the doc is updated to reflect text space support :^)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :^)
* support text spaces * test full utf-8 charset * reformat * test utf-16 without surrogate * remove debugging leftover * fix bug in data_collector * add cursed string to test * add documentation
Description
Add support for
gym.spaces.Text
Type of change
Checklist:
pre-commit
checks withpre-commit run --all-files
(seeCONTRIBUTING.md
instructions to set it up)pytest -v
and no errors are present.pytest -v
has generated that are related to my code to the best of my knowledge.