-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
New Operator: RegexFullMatch
Describe the operator
ONNX currently lacks support for regex matching. Regex matching is an incredibly useful string operation that is pervasive in feature preprocessing. Given an input tensor of type string
, the operator would return a boolean
array denoting whether a full match has been achieved. The output tensor would be the same shape as the input as matching is done elementwise.
In Tensorflow this is directly supported via https://www.tensorflow.org/api_docs/python/tf/strings/regex_full_match which specifically supports regex patterns based on re2 syntax https://github.com/google/re2/wiki/Syntax. The behaviour of this operator should reflect the examples in the link.
Attributes:
- Given the lack of complete standardisation of regex syntax, I would propose an attribute called
syntax
which is a literalstring
denoting the specific accepted syntax. Options can be"ECMA"
and"posix"
. For runtimes/backends, there is a clear implementation path for these options across different languages. - The regex pattern to match on can be specified as a
string
attribute calledpattern
.
Can this operator be constructed using existing onnx operators?
No.
Is this operator used by any model currently? Which one?
This operator is very commonly used during data preprocessing where we have non-categorical string data. It is used to generate masks for filtering, validation and replacement operations.
Commonly used DataFrame oriented libraries such as pandas
and polars
provide commonly used capabilities for regex based replacement (pandas, polars) and for filtering/membership checking (pandas, polars) is present in pandas
. Combining a RegexFullMatch
operator with the existing operator set will unlock these very common operations without requiring numerous additional operators in the standard.
Are you willing to contribute it? (Y/N)
Y