-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[Python Dev] Import items lazily #8741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I just realized, this might need a lock in the import cache |
…time is a problem
Thanks! LGTM |
Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg
@@ -453,8 +453,8 @@ Value TransformPythonValue(py::handle ele, const LogicalType &target_type, bool | |||
case PythonObjectType::Datetime: { | |||
auto &import_cache = *DuckDBPyConnection::ImportCache(); | |||
bool is_nat = false; | |||
if (import_cache.pandas().isnull.IsLoaded()) { | |||
auto isnull_result = import_cache.pandas().isnull()(ele); | |||
if (import_cache.pandas.isnull(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior isn't entirely equivalent?
The old version loaded pandas
entirely, caching all of the sub attributes we care about, then try to see if isnull
was present.
If it's present in the installed version of Pandas it would be loaded
The new behavior does not try to load anything, instead it merely checks if another path has been used that loaded isnull
before
@@ -340,10 +340,10 @@ void NumpyScan::Scan(PandasColumnBindData &bind_data, idx_t count, idx_t offset, | |||
out_mask.SetInvalid(row); | |||
continue; | |||
} | |||
if (import_cache.pandas().libs.NAType.IsLoaded()) { | |||
if (import_cache.pandas._libs.missing.NAType(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
@@ -340,10 +340,10 @@ void NumpyScan::Scan(PandasColumnBindData &bind_data, idx_t count, idx_t offset, | |||
out_mask.SetInvalid(row); | |||
continue; | |||
} | |||
if (import_cache.pandas().libs.NAType.IsLoaded()) { | |||
if (import_cache.pandas.libs.NAType(false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
(this PR contains #8738 to avoid merge conflicts)
History
When the PythonImportCache was first added, it eagerly imported everything.
A while back, Mytherin made a PR to lazily import modules only when they are first imported.
One problem left with this is that once the module is imported, we load all of the deeper modules and attributes associated with that module, which can add up.
Lazy loading ++
This PR continues on the path of lazy import, only importing the objects that are required.
By only importing once
operator()
is called on thePythonImportCacheItem
we can traverse the hierarchy of required items, starting at the first module found.We now generate the PythonImportCache code
Which part of it you ask? All of it
Using:
We can generate all the necessary code to be able to do for example
import_cache.pyarrow.Table()
This also allows us to verify that the imported paths are correct
From the python file we generate a json file
We can edit the JSON manually to add other fields as needed.
For example we can mark some of the imports as not required by adding:
This is preserved when we regenerate the JSON from the python file.
Using the JSON we then generate all the headers and the PythonImportCache class required to use the import cache.