Skip to content

New Operator: RegexFullMatch #5317

@adityagoel4512

Description

@adityagoel4512

New Operator: RegexFullMatch

Describe the operator

ONNX currently lacks support for regex matching. Regex matching is an incredibly useful string operation that is pervasive in feature preprocessing. Given an input tensor of type string, the operator would return a boolean array denoting whether a full match has been achieved. The output tensor would be the same shape as the input as matching is done elementwise.

In Tensorflow this is directly supported via https://www.tensorflow.org/api_docs/python/tf/strings/regex_full_match which specifically supports regex patterns based on re2 syntax https://github.com/google/re2/wiki/Syntax. The behaviour of this operator should reflect the examples in the link.

Attributes:

  • Given the lack of complete standardisation of regex syntax, I would propose an attribute called syntax which is a literal string denoting the specific accepted syntax. Options can be "ECMA" and "posix". For runtimes/backends, there is a clear implementation path for these options across different languages.
  • The regex pattern to match on can be specified as a string attribute called pattern.

Can this operator be constructed using existing onnx operators?

No.

Is this operator used by any model currently? Which one?

This operator is very commonly used during data preprocessing where we have non-categorical string data. It is used to generate masks for filtering, validation and replacement operations.

Commonly used DataFrame oriented libraries such as pandas and polars provide commonly used capabilities for regex based replacement (pandas, polars) and for filtering/membership checking (pandas, polars) is present in pandas. Combining a RegexFullMatch operator with the existing operator set will unlock these very common operations without requiring numerous additional operators in the standard.

Are you willing to contribute it? (Y/N)

Y

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic: operatorIssues related to ONNX operators

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions