-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[red-knot] Initial support for dataclass
es
#17353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
66a8262
to
f547a69
Compare
dataclass
es
29d4d21
to
107851e
Compare
if let Some(Type::BooleanLiteral(value)) = ty { | ||
*value | ||
} else { | ||
// TODO: emit a diagnostic if we receive `bool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mypy emits "init" argument must be a True or False literal
in this case, Pyright just ignores it (falls back to the default value), which is what we also do at the moment. It does not seem particularly important.
def _(flag: bool):
@dataclass(order=flag)
class A:
x: int
A(1) < A(2) # supported or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
107851e
to
0a3a38d
Compare
0a3a38d
to
696ac00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
@@ -98,6 +101,8 @@ pub struct Class<'db> { | |||
pub(crate) body_scope: ScopeId<'db>, | |||
|
|||
pub(crate) known: Option<KnownClass>, | |||
|
|||
pub(crate) dataclass_metadata: Option<DataclassMetadata>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than storing a field on Class
, did you consider making dataclass_metadata
a lazy Salsa-tracked method (and move the logic currently in infer.rs
into that method), similar to our existing Class::explicit_bases_query
? There might be lots of dataclasses in third-party modules that first-party code depends on, but I'm guessing we'd rarely see constructor calls to most of those classes in the first-party code we're actually checking. So most of the time we probably don't need to know whether a class is a dataclass or not. It's probably wasted computation (and memory) to check (and store) whether these are all dataclasses unless we actually see any constructor calls for these classes? (I wouldn't expect this to make a difference in our benchmarks because we don't have any dataclass-heavy benchmarks right now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that as we increase the accuracy of our modeling of dataclasses, we will have more and more cases where we do need to know whether a class is a dataclass or not (e.g. it affects inference of final attributes with an assignment on the class body, too), which may reduce the effectiveness of making that determination lazily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe! I'm not sure, which is why I was asking if it was considered ;)
I suppose another reason why explicit_bases_query
is a lazy method -- which probably does not apply for dataclasses -- is to support self-referential class definitions such as class str(Sequence[str]): ...
-- deferring inference of the bases helps avoid otherwise-unnecessary Salsa cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is why I was asking if it was considered
I did not consider this.
So most of the time we probably don't need to know whether a class is a dataclass or not. It's probably wasted computation (and memory) to check (and store) whether these are all dataclasses unless we actually see any constructor calls for these classes?
We have to infer types for all class decorators anyway. So the additional work that we do here is to check if the inferred type is one of two possible options. I'm just guessing here as well, but I just can't imagine that this will ever be a performance-critical code path. You would have to have loads of different dataclasses that are never constructed anywhere for this to be relevant, and even then, there's so much more we have to do when type-checking a class definition.
Memory-wise, MetaclassMetadata
is bitset with a size of 2 bytes. I think we could get that down to one byte potentially, as we might not need the information whether repr
and eq
are set to true (because those methods are defined in any case, it's just the implementation which changes?).
we don't have any dataclass-heavy benchmarks
I'm happy to run a benchmark locally if you have any suggestions for codebases to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to infer types for all class decorators anyway. So the additional work that we do here is to check if the inferred type is one of two possible options.
Hmm, good point. I suppose doing it lazily would only improve performance if you also changed our dataclass handling to be more localised, as per #17353 (comment) — but as discussed in that thread, that's probably not a good idea for other reasons.
The memory savings might still be worth it? Large codebases could have lots of class definitions; the new field on Class
increases the amount of data we're storing in the salsa database even for non-dataclasses. As per your other question, though — no, I don't have a great suggestion for a codebase to run performance or memory benchmarks on.
we might not need the information whether
repr
andeq
are set to true (because those methods are defined in any case, it's just the implementation which changes?).
It's a bit of a stretch, but you could have something like this — we'd emit an error about attempting to instantiate an abstract class if we didn't synthesise the __repr__
method on Bar
(once we support ABCs and @abstractmethod
properly):
from abc import ABC, abstractmethod
from dataclasses import dataclass
class Foo(ABC):
@abstractmethod
def __repr__(self) -> str: ...
@dataclass
class Bar(Foo): ...
Bar()
For mypy, it's also important for it to synthesise the __eq__
method for this error code: https://mypy.readthedocs.io/en/stable/error_code_list2.html#check-that-comparisons-are-overlapping-comparison-overlap
And there could be possible use cases from type-aware lints in the future, too? Not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory savings might still be worth it? Large codebases could have lots of class definitions; the new field on
Class
increases the amount of data we're storing in the salsa database even for non-dataclasses.
Yes, by two bytes (or three, for the Option
, if there are no niches) per class. Worst case eight bytes, due to alignment requirements. If you have 10 000 classes in a project, that's 80kB. Or am I missing something?
It's a bit of a stretch, but you could have something like this
Thanks — I'll keep the REPR
and EQ
flags for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case eight bytes, due to alignment requirements. If you have 10 000 classes in a project, that's 80kB.
Alright, thanks, probably not worth worrying about then 😄
let decorator_ty = self.infer_decorator(decorator); | ||
if decorator_ty | ||
.into_function_literal() | ||
.is_some_and(|function| function.is_known(self.db(), KnownFunction::Dataclass)) | ||
{ | ||
dataclass_metadata = Some(DataclassMetadata::default()); | ||
continue; | ||
} | ||
|
||
if let Type::DataclassDecorator(metadata) = decorator_ty { | ||
dataclass_metadata = Some(metadata); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting -- rather than creating a separate Type
variant, I was assuming we'd do some more localised special-casing for @dataclass
with arguments passed. Something like this:
let decorator_ty = self.infer_decorator(decorator); | |
if decorator_ty | |
.into_function_literal() | |
.is_some_and(|function| function.is_known(self.db(), KnownFunction::Dataclass)) | |
{ | |
dataclass_metadata = Some(DataclassMetadata::default()); | |
continue; | |
} | |
if let Type::DataclassDecorator(metadata) = decorator_ty { | |
dataclass_metadata = Some(metadata); | |
continue; | |
} | |
let decorator_ty = self.infer_decorator(decorator); | |
if decorator_ty | |
.into_function_literal() | |
.is_some_and(|function| function.is_known(self.db(), KnownFunction::Dataclass)) | |
{ | |
dataclass_metadata = Some(DataclassMetadata::default()); | |
continue; | |
} | |
if let ast::Expr::Call(ast::ExprCall { func, arguments, range } = decorator { | |
if self | |
.expression_ty(func) | |
.into_function_literal() | |
.is_some_and(|function| function.is_known(self.db(), KnownFunction::Dataclass)) { | |
dataclass_metadata = Some(DataclassMetadata::from_arguments(arguments)); | |
} | |
} | |
} |
The advantage would be that we wouldn't have to worry about all the questions everywhere such as "is this type a subtype of DataclassMetadata
?", etc.
The disadvantage is that having a new Type
variant means that we have a pretty good understanding of the dataclass
function and its associated metadata even when the dataclass()
call is long way removed from the class it's decorating. E.g. you can do things like this, and red-knot still understands that one of the classes has an autogenerated __init__
while the other doesn't: https://playknot.ruff.rs/5876dcbd-aaa4-4dc7-a0f2-121052dcd277. And I think I have seen issues at mypy where users have been upset that mypy doesn't support things like that. So it's pretty cool that we do support this kind of thing!
It might be good to add some tests demonstrating this capability, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's useful to support my_dataclass = dataclasses.dataclass(...)
followed by later (even in a different module) use of @my_dataclass
, and I agree we should add a test demonstrating that that works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming we'd do some more localised special-casing for
@dataclass
with arguments passed.
This is something I briefly considered. I decided against it, mainly because I wanted to model it "correctly", and also because I saw that other type checkers support it.
The advantage would be that we wouldn't have to worry about all the questions everywhere such as "is this type a subtype of
DataclassMetadata
?", etc.
Yes, I agree that this is annoying. I'm hopeful that we can merge some of the recently added special callable types into something more generally useful. Maybe that merged variant could have some methods like .to_callable()
(which would return a corresponding Callable
type) and .to_instance_fallback()
(which would return somthing like MethodWrapperType
), and then we could formulate a lot of those type properties in terms of these.
It might be good to add some tests demonstrating this capability, though?
Yes 👍
The "Using dataclass
as a function" test which I added in this PR was supposed to demonstrate something related: that you can also transform a class after-the-fact (class C: ...; C = dataclass(order=True)(C)
). That doesn't work yet, but it would also require us to actually generate a new callable type for the dataclass(order=True)
call, instead of just trying to detect the syntactic structure of dataclass
being used as a decorator.
* main: (31 commits) [red-knot] Add some knowledge of `__all__` to `*`-import machinery (#17373) Update taiki-e/install-action digest to be7c31b (#17379) Update Rust crate mimalloc to v0.1.46 (#17382) Update PyO3/maturin-action action to v1.49.1 (#17384) Update Rust crate anyhow to v1.0.98 (#17380) dependencies: switch from `chrono` to `jiff` Update Rust crate bstr to v1.12.0 (#17385) [red-knot] Further optimize `*`-import visibility constraints (#17375) [red-knot] Minor 'member_lookup_with_policy' fix (#17407) [red-knot] Initial support for `dataclass`es (#17353) Sync vendored typeshed stubs (#17402) [red-knot] improve function/bound method type display (#17294) [red-knot] Move relation methods from `CallableType` to `Signature` (#17365) [syntax-errors] `await` outside async functions (#17363) [red-knot] optimize is_subtype_of for literals (#17394) [red-knot] add a large-union-of-string-literals benchmark (#17393) Update pre-commit dependencies (#17383) [red-knot] mypy_primer: Fail job on panic or internal errors (#17389) [red-knot] Document limitations of diagnostics-silencing in unreachable code (#17387) [red-knot] detect unreachable attribute assignments (#16852) ...
Summary
Add very early support for dataclasses. This is mostly to make sure that we do not emit false positives on dataclass construction, but it also lies some foundations for future extensions.
This seems like a good initial step to merge to me, as it basically removes all false positives on dataclass constructor calls. This allows us to use the ecosystem checks for making sure we don't introduce new false positives as we continue to work on dataclasses.
ticket: #16651
Ecosystem analysis
I re-ran the mypy_primer evaluation of the
__init__
PR locally with our current mypy_primer version and project selection. It introduced 1597 new diagnostics. Filtering those by searching for__init__
and rejecting those that containinvalid-argument-type
(those could not possibly be solved by this PR) leaves 1281 diagnostics. The current version of this PR removes 1171 diagnostics, which leaves 110 unaccounted for. I extracted the lint + file path for all of these diagnostics and generated a diff (of diffs), to see which__init__
-diagnostics remain. I looked at a subset of these: There are a lot ofSomeClass(*args)
calls where we don't understand the unpacking yet (this is not even related to__init__
). Some others are related toNamedTuple
, which we also don't support yet. And then there are some errors related to@attrs.define
-decorated classes, which would probably require support fordataclass_transform
, which I made no attempt to include in this PR.Test Plan
New Markdown tests.