[Python Dev] Import items lazily #8741

Tishj · 2023-08-30T21:06:00Z

(this PR contains #8738 to avoid merge conflicts)

History

When the PythonImportCache was first added, it eagerly imported everything.
A while back, Mytherin made a PR to lazily import modules only when they are first imported.

One problem left with this is that once the module is imported, we load all of the deeper modules and attributes associated with that module, which can add up.

Lazy loading ++

This PR continues on the path of lazy import, only importing the objects that are required.
By only importing once operator() is called on the PythonImportCacheItem we can traverse the hierarchy of required items, starting at the first module found.

We now generate the PythonImportCache code
Which part of it you ask? All of it

Using:

import pyarrow
import pyarrow.dataset

pyarrow.dataset.Scanner
pyarrow.dataset.Dataset
pyarrow.Table
pyarrow.RecordBatchReader

We can generate all the necessary code to be able to do for example import_cache.pyarrow.Table()
This also allows us to verify that the imported paths are correct

From the python file we generate a json file

{
    "pyarrow": {
        "type": "module",
        "full_path": "pyarrow",
        "name": "pyarrow",
        "children": [
            "pyarrow.dataset",
            "pyarrow.Table",
            "pyarrow.RecordBatchReader"
        ]
    },
    "pyarrow.dataset": {
        "type": "module",
        "full_path": "pyarrow.dataset",
        "name": "dataset",
        "children": [
            "pyarrow.dataset.Scanner",
            "pyarrow.dataset.Dataset"
        ]
    },
    "pyarrow.dataset.Scanner": {
        "type": "attribute",
        "full_path": "pyarrow.dataset.Scanner",
        "name": "Scanner",
        "children": []
    }
}

We can edit the JSON manually to add other fields as needed.
For example we can mark some of the imports as not required by adding:

"required": false

This is preserved when we regenerate the JSON from the python file.

Using the JSON we then generate all the headers and the PythonImportCache class required to use the import cache.

Tishj · 2023-09-04T13:25:18Z

I just realized, this might need a lock in the import cache
otherwise we could get race conditions when two threads try to add an import

…_upgrade

…time is a problem

…able names

…_upgrade_v2

…_upgrade

…e_upgrade

…che_upgrade

Mytherin · 2023-10-20T12:41:26Z

Thanks! LGTM

Merge pull request duckdb/duckdb#9164 from Mause/feature/jdbc-uuid-param Merge pull request duckdb/duckdb#9185 from pdet/adbc_07 Merge pull request duckdb/duckdb#9126 from Maxxen/parquet-kv-metadata Merge pull request duckdb/duckdb#9123 from lnkuiper/parquet_schema Merge pull request duckdb/duckdb#9086 from lnkuiper/json_inconsistent_structure Merge pull request duckdb/duckdb#8977 from Tishj/python_readcsv_multi_v2 Merge pull request duckdb/duckdb#9279 from hawkfish/nsdate-cast Merge pull request duckdb/duckdb#8851 from taniabogatsch/binary_lambdas Merge pull request duckdb/duckdb#8983 from Maxxen/types/fixedsizelist Merge pull request duckdb/duckdb#9318 from Maxxen/fix-unused Merge pull request duckdb/duckdb#9220 from hawkfish/exclude Merge pull request duckdb/duckdb#9230 from Maxxen/json-plan-serialization Merge pull request duckdb/duckdb#9011 from Tmonster/add_create_statement_support_to_fuzzer Merge pull request duckdb/duckdb#9400 from Maxxen/array-fixes Merge pull request duckdb/duckdb#8741 from Tishj/python_import_cache_upgrade Merge fixes Merge pull request duckdb/duckdb#9395 from taniabogatsch/lambda-performance Merge pull request duckdb/duckdb#9427 from Tishj/python_table_support_replacement_scan Merge pull request duckdb/duckdb#9516 from carlopi/fixformat Merge pull request duckdb/duckdb#9485 from Maxxen/fix-parquet-serialization Merge pull request duckdb/duckdb#9388 from chrisiou/issue217 Merge pull request duckdb/duckdb#9565 from Maxxen/fix-array-vector-sizes Merge pull request duckdb/duckdb#9583 from carlopi/feature Merge pull request duckdb/duckdb#8907 from cryoEncryp/new-list-functions Merge pull request duckdb/duckdb#8642 from Virgiel/capi-streaming-arrow Merge pull request duckdb/duckdb#8658 from Tishj/pytype_optional Merge pull request duckdb/duckdb#9040 from Light-City/feature/set_mg

Tishj · 2024-02-28T16:37:48Z

tools/pythonpkg/src/native/python_conversion.cpp

@@ -453,8 +453,8 @@ Value TransformPythonValue(py::handle ele, const LogicalType &target_type, bool
 	case PythonObjectType::Datetime: {
 		auto &import_cache = *DuckDBPyConnection::ImportCache();
 		bool is_nat = false;
-		if (import_cache.pandas().isnull.IsLoaded()) {
-			auto isnull_result = import_cache.pandas().isnull()(ele);
+		if (import_cache.pandas.isnull(false)) {


This behavior isn't entirely equivalent?
The old version loaded pandas entirely, caching all of the sub attributes we care about, then try to see if isnull was present.
If it's present in the installed version of Pandas it would be loaded

The new behavior does not try to load anything, instead it merely checks if another path has been used that loaded isnull before

Tishj · 2024-02-28T16:38:49Z

tools/pythonpkg/src/numpy/numpy_scan.cpp

@@ -340,10 +340,10 @@ void NumpyScan::Scan(PandasColumnBindData &bind_data, idx_t count, idx_t offset,
 					out_mask.SetInvalid(row);
 					continue;
 				}
-				if (import_cache.pandas().libs.NAType.IsLoaded()) {
+				if (import_cache.pandas._libs.missing.NAType(false)) {


Same comment as before

Tishj · 2024-02-28T16:56:53Z

tools/pythonpkg/src/numpy/numpy_scan.cpp

@@ -340,10 +340,10 @@ void NumpyScan::Scan(PandasColumnBindData &bind_data, idx_t count, idx_t offset,
 					out_mask.SetInvalid(row);
 					continue;
 				}
-				if (import_cache.pandas().libs.NAType.IsLoaded()) {
+				if (import_cache.pandas.libs.NAType(false)) {


Same comment as above

Tishj added 5 commits August 30, 2023 22:35

only import the required items when () is called

0d9ac46

fix pyduckdb.Value import

db7fb2d

fix version conflict marker

55e103f

Merge branch 'main' into python_import_cache_upgrade

fde6bc1

Merge branch 'main' into python_import_cache_upgrade

ecd78bb

Tishj marked this pull request as ready for review September 2, 2023 16:32

Merge branch 'main' into python_import_cache_upgrade

4b3d4ee

github-actions bot marked this pull request as draft September 4, 2023 13:26

Tishj added 15 commits September 7, 2023 13:02

Merge branch 'main' into python_import_cache_upgrade

c16f830

Merge remote-tracking branch 'upstream/main' into python_import_cache…

30e4faf

…_upgrade

generate json from 'import.py' file

2649516

add missing modules+attributes

f457617

use the full_path as dict key to ensure uniqueness, i.e datetime.date…

74e2c93

…time is a problem

fix missing attributes

39d7fd7

generate import cache HPP from json

fa787d4

deal with the 'short' and 'ushort' cases we need to avoid in cpp vari…

4947465

…able names

write the includes_file, import_cache and write the files

d8fd66c

revisions

4c6ab38

renamed arrow -> pyarrow

091b31e

preserve manually added 'required' flags in the json

ddf9630

Merge remote-tracking branch 'upstream/main' into python_import_cache…

b339dee

…_upgrade_v2

Merge remote-tracking branch 'upstream/main' into python_import_cache…

144c555

…_upgrade

Merge branch 'python_import_cache_upgrade_v2' into python_import_cach…

5d4eaad

…e_upgrade

Tishj changed the base branch from main to feature October 5, 2023 08:05

Tishj added 3 commits October 5, 2023 10:06

Merge remote-tracking branch 'upstream/feature' into python_import_ca…

f6d6342

…che_upgrade

pyduckdb -> duckdb

b8aede5

Merge remote-tracking branch 'upstream/feature' into python_import_ca…

d0fc4af

…che_upgrade

Tishj marked this pull request as ready for review October 7, 2023 09:32

Mytherin merged commit e72ab94 into duckdb:feature Oct 20, 2023

Tishj commented Feb 28, 2024

View reviewed changes

Tishj mentioned this pull request Feb 28, 2024

[Python][Dev] Fix issue in PythonImportCache Tishj/duckdb#110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Python Dev] Import items lazily #8741

[Python Dev] Import items lazily #8741

Uh oh!

Tishj commented Aug 30, 2023 •

edited

Loading

Uh oh!

Tishj commented Sep 4, 2023

Uh oh!

Mytherin commented Oct 20, 2023

Uh oh!

Tishj Feb 28, 2024

Uh oh!

Tishj Feb 28, 2024

Uh oh!

Tishj Feb 28, 2024

Uh oh!

Uh oh!

[Python Dev] Import items lazily #8741

[Python Dev] Import items lazily #8741

Uh oh!

Conversation

Tishj commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

History

Lazy loading ++

Uh oh!

Tishj commented Sep 4, 2023

Uh oh!

Mytherin commented Oct 20, 2023

Uh oh!

Tishj Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Tishj Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Tishj Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tishj commented Aug 30, 2023 •

edited

Loading