-
Notifications
You must be signed in to change notification settings - Fork 272
feat: add support for named vectorizers to ai.vectorizer_errors #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for named vectorizers to ai.vectorizer_errors #740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the changes introduced compared to #683
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the changes introduced compared to #683
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the changes introduced compared to #683
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -212,6 +212,18 @@ def migrate_config_to_new_version(cls, data: Any) -> Any: | |||
logger.warning("Unable to migrate configuration: raw data type is unknown") | |||
return data # type: ignore[reportUnknownVariableType] | |||
|
|||
def ensure_features(self, features: Features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is to have this method in the Vectorizer rather than adding the logic to the Worker directly, as I believe it is really responsibility of the first.
However, I could revert this and move the logic to the worker here if you believe it is not clear/explicit enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
# Newer versions of pgai lib have the ai.vectorizer_errors view. | ||
# The table has been renamed to ai._vectorizer_errors | ||
query = """ | ||
SELECT table_name | ||
FROM information_schema.views | ||
WHERE table_schema = 'ai' AND table_name = 'vectorizer_errors'; | ||
""" | ||
cur.execute(query) | ||
has_vectorizer_errors_view = cur.fetchone() is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would've probably checked if the _vectorizer_errors
table exists and not if vectorizer_errors
is a view. But I guess this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any benefit from that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just one step closer to what the vectorizer actually needs.
It doesn't really care about the view it only needs to know which table it should insert the errors into. But effectively your are checking this since the view and _ table are created in the same migration.
""" | ||
Ensure the vectorizer is compatible with all possible features. | ||
This modifies the vectorizer to make it compatible with the max | ||
amount of features as possible. | ||
""" | ||
if ( | ||
self.errors_table == DEFAULT_VECTORIZER_ERRORS_TABLE | ||
and not features.has_vectorizer_errors_view | ||
): | ||
self.errors_table = "vectorizer_errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't really know why we configure this on the vectorizer in the first place and not on the executor. The vectorizer doesn't really use the errors table in any way its only useful for the processing. So that could just have a switch flip on insert.
But I guess this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, don't want to early optimize. I moved back that logic to the worker.
This is an improved version of #683, which was reverted.
This PR considers retrocompatibility with previous versions that have
ai.vectorizer_errors
as table.