ENH: Support text spaces #99

younik · 2023-06-23T17:57:12Z

Description

Add support for gym.spaces.Text

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have run pytest -v and no errors are present.
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I solved any possible warnings that pytest -v has generated that are related to my code to the best of my knowledge.
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

minari/data_collector/data_collector.py

younik · 2023-06-23T18:13:36Z

For now it only supports UTF-8; it should be possible to support unicode using NumPy object arrays; it is just less memory efficient

elliottower · 2023-06-23T18:24:40Z

minari/data_collector/data_collector.py

-                    ), "Nan found after cast to nump array, check the type of 'data'."
+                    if all(map(lambda elem: isinstance(elem, str), data)):
+                        data_shape = (len(data),)
+                        dtype = h5py.string_dtype(encoding="utf-8")


Oh I see how this is a bottleneck as there's only this option or ascii (which is less chars), so given that I think it's fine to do utf-8 https://docs.h5py.org/en/stable/special.html#h5py.string_dtype

Did some more searching around and this shouldn't be a problem after all, you can still encode full unicode characters using utf-8 (on unix they are encoded that way by default, on windows apparently it's utf-16), it just uses multi-byte encoding (https://unicodebook.readthedocs.io/unicode_encodings.html). Not sure how exactly it would work with gymnasium spaces but I think it should be good enough, utf-8 is the standard for pretty much everything. If a user really needs to use unicode characters in their gymnasium action/obs space I think they should be able to set the charset as just utf-8 and encode the unicode characters into utf-8.

elliottower · 2023-06-23T18:51:29Z

tests/common.py

+    def __init__(self):
+        self.action_space = spaces.Text(max_length=10, min_length=2, charset="01")
+
+        self.observation_space = spaces.Text(max_length=20)


You should probably set Text(max_length=20, charset=string.printable) as that will be a list of all the printable ascii chars (length 100) rather than just the letters a-z and digits which is the default for gymnasium. You could have the charset be all of the utf-8 characters but that would be a million (as all unicode chars can be encoded) so sampling from that will probably take a while and not be ideal.
If you want the printable unicode multilingual characters (65,000) length it would be utf8_characters = [chr(i) for i in range(0x10000) if unicodedata.category(chr(i)) != "Cc"], but that's probably not necessary besides maybe showing people that it can be done.

Trying with all unicode it fails with an informative error UnicodeEncodeError: 'utf-8' codec can't encode character '\udb25' in position 2: surrogates not allowed

We need to somehow create a charset for all legal unicode characters, and decide what our stance is on surrogate characters, this may be nontrivial. Naively, I tried:

utf8_charset = "".join([chr(i) for i in range(sys.maxunicode)])

and I got a similar error

E UnicodeEncodeError: 'utf-8' codec can't encode character '\ud937' in position 3: surrogates not allowed

Here is a h5py doc page talking about surrogate characters:
https://docs.h5py.org/en/3.2.1/strings.html

I don't understand what you mean with "decide what our stance is on surrogate characters"

balisujohn

I think there should be a test which uses full UTF-8 as the charset for a Text Space, similarly to how @elliottower described for the unicode multilingual characters. https://github.com/Farama-Foundation/Minari/pull/99/files#r1240189639

For this, I'm curious if, we might need a special flag to mean full utf-8 when serializing, instead of serializing a 1mil+ character string.

I'm leaning towards asking for a way to test also that the strings which get saved are exactly the same as the strings you get when loading the episode and not just check that they are in the space like we are currently doing. We could improve the dataset creation tests generally to enforce this for all space types, but let's say this last paragraph is a suggestion for a later PR.

elliottower · 2023-06-23T23:31:17Z

I think there should be a test which uses full UTF-8 as the charset for a Text Space, similarly to how @elliottower described for the unicode multilingual characters. https://github.com/Farama-Foundation/Minari/pull/99/files#r1240189639

For this, I'm curious if, we might need a special flag to mean full utf-8 when serializing, instead of serializing a 1mil+ character string.

I'm leaning towards asking for a way to test also that the strings which get saved are exactly the same as the strings you get when loading the episode and not just check that they are in the space like we are currently doing. We could improve the dataset creation tests generally to enforce this for all space types, but let's say this last paragraph is a suggestion for a later PR.

More I think about it the more feel like the gymnasium spaces are kind of irrelevant to serializing, no matter what it’ll always go into the memory as utf8 encoding, people can do whatever subset of Unicode that they want and it’ll always be encodable that way.

balisujohn · 2023-06-23T23:50:49Z

In general, I think that providing space serialization means a contract that the user will be able to fully recover the space specification beyond what is implicitly present in the data itself. We provide bounds on box and discrete spaces, and I think we need to provide similar metadata on serialized text spaces. Even if we don't need to know the space to faithfully reproduce the data, the user still needs to be able to recover the space from the serialization. I don't think it will add too much complexity if we add a boolean flag to a Text space serialized json that is set to true if all unicode characters are supported and false otherwise, and if it is set to true, the charset is left empty in the serialization.

We might even want to consider upstreaming something similar into Gymnasium.spaces.Text (because with the actual Text spaces in mempry, we will stilll have to use the million plus character long charset string, which could be represented as a couple bytes with either boundary-based charsets or a special "all utf-8" flag.

elliottower · 2023-06-24T00:48:19Z

In general, I think that providing space serialization means a contract that the user will be able to fully recover the space specification beyond what is implicitly present in the data itself. We provide bounds on box and discrete spaces, and I think we need to provide similar metadata on serialized text spaces. Even if we don't need to know the space to faithfully reproduce the data, the user still needs to be able to recover the space from the serialization. I don't think it will add too much complexity if we add a boolean flag to a Text space serialized json that is set to true if all unicode characters are supported and false otherwise, and if it is set to true, the charset is left empty in the serialization.

We might even want to consider upstreaming something similar into Gymnasium.spaces.Text (because with the actual Text spaces in mempry, we will stilll have to use the million plus character long charset string, which could be represented as a couple bytes with either boundary-based charsets or a special "all utf-8" flag.

Fair enough about users needing to know. We should talk to Mark about it, I think a reasonable thing for gymnasium spaces would be to have encoding options like ascii or utf-8, but we should probably look into things more and get a better idea of what users will want. Last we talked Mark and Ariel said it probably didn’t make sense to have the charset argument or at least there should be easy ways to do utf-8 without specifying a list of every valid character

younik · 2023-06-24T10:24:21Z

For how Text space is right now, I don't think it makes sense to have a special case for the full UTF-8 charset. As the user needs to specify it and all the characters are saved in a variable, I think it makes sense to save all the characters as well.
If in Gymnasium they change it (for example, they make the full UTF-8 set as the default value), then it makes sense to support it better here. In that case, a boolean flag or simply not saving the field charset (and using the UTF-8 charset as default while loading) makes more sense.

minari/data_collector/data_collector.py

tests/common.py

balisujohn · 2023-06-27T23:40:28Z

tests/common.py

+            if not data.observation_space.contains(obs):
+                import pdb
+
+                pdb.set_trace()


I think this is leftover from debugging

balisujohn · 2023-06-27T23:42:22Z

tests/common.py

@@ -143,6 +143,23 @@ def reset(self, seed=None, options=None):
        return self.observation_space.sample(), {}


+class DummyTextEnv(gym.Env):
+    def __init__(self):
+        self.action_space = spaces.Text(max_length=10, min_length=2, charset="01")


I think it would be good to create a dummy env with a full unicode charset minus surrogate chars that returns emoji as observations, as a stress test.

This is the case for DummyTextEnv for observation, isn't it?
Do you want it also for action space?

balisujohn

Ready to merge, once the doc is updated to reflect text space support :^)

balisujohn

LGTM :^)

* support text spaces * test full utf-8 charset * reformat * test utf-16 without surrogate * remove debugging leftover * fix bug in data_collector * add cursed string to test * add documentation

support text spaces

ef1db04

younik commented Jun 23, 2023

View reviewed changes

minari/data_collector/data_collector.py Outdated Show resolved Hide resolved

younik requested review from rodrigodelazcano and balisujohn June 23, 2023 18:12

elliottower reviewed Jun 23, 2023

View reviewed changes

balisujohn requested changes Jun 23, 2023

View reviewed changes

younik added 2 commits June 24, 2023 12:55

test full utf-8 charset

c528132

reformat

97dbe68

younik requested a review from balisujohn June 24, 2023 10:59

rodrigodelazcano requested changes Jun 24, 2023

View reviewed changes

minari/data_collector/data_collector.py Outdated Show resolved Hide resolved

tests/common.py Show resolved Hide resolved

test utf-16 without surrogate

f6acd9a

balisujohn requested changes Jun 27, 2023

View reviewed changes

younik added 3 commits June 28, 2023 13:34

remove debugging leftover

4446f0f

fix bug in data_collector

8135832

add cursed string to test

8b2e3e1

balisujohn reviewed Jun 28, 2023

View reviewed changes

rodrigodelazcano approved these changes Jun 29, 2023

View reviewed changes

add documentation

106d99c

younik requested a review from balisujohn June 29, 2023 23:10

balisujohn approved these changes Jun 30, 2023

View reviewed changes

balisujohn merged commit c4f8d95 into Farama-Foundation:main Jun 30, 2023

younik deleted the support-text-space branch May 26, 2024 09:52

Uh oh!

ENH: Support text spaces #99

ENH: Support text spaces #99

Uh oh!

Conversation

younik commented Jun 23, 2023

Description

Type of change

Checklist:

Uh oh!

Uh oh!

younik commented Jun 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

balisujohn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elliottower commented Jun 23, 2023

Uh oh!

balisujohn commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliottower commented Jun 24, 2023

Uh oh!

younik commented Jun 24, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

balisujohn left a comment

Choose a reason for hiding this comment

Uh oh!

balisujohn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

balisujohn left a comment •

edited

Loading

balisujohn commented Jun 23, 2023 •

edited

Loading