Skip to content

Slow collection time when tests are not in a relative folder to the current working folder #13420

@sashko1988

Description

@sashko1988

Created after this discussion - #13413

OSes, python and pytest versions

OS: macOS 15.4.1, Ubuntu 22.04
Python 3.12.8
Pytest 8.3.4

Problem description

I need to execute a lot of non-python tests that are stored in folders with lots of nesting. And I found that Pytest struggles during the collection.

Some code context:

@pytest.hookimpl(wrapper=True)
def pytest_collection(session):
    resolved_paths = resolve_suites(session)
    session.config.args.extend(resolved_paths)
    return (yield)
    
def pytest_collect_file(parent, file_path):
    if file_path.suffix == ".yaml":
        return YamlFile.from_parent(parent, path=file_path)

class YamlFile(pytest.File):
    def collect(self) -> Iterable[pytest.Item | pytest.Collector]:
        test_cases = YamlTestResolver().from_file(f"{self.path}")  # leftover from previous runner, but resolves needed stuff.
        for tc in test_cases:
            yield YamlTest.from_parent(self, name=tc.name, tc_spec=tc)
            
class YamlTest(pytest.Item):
    def __init__(self, ptul_tc, **kwargs) -> None:
        super().__init__(**kwargs)
        self.tc_spec = tc_spec

Consider this folder structure:

root_working_folder
├── framework_repo
│   └── framework_internal_folder
└── repo_with_tests
    └── tests
        ├── test_folder_1
        │   └── inner_folder
        └── test_folder_2
            └── inner_folder
                └── even_more_depth

But even more subfolders in repo_with_tests

Pytest call is the following: pytest --collect only ${list with 1k non-python tests}. (1 test per file)

When I execute the above from framework_internal_folder, the execution time is 56 minutes with cProfile, 23 minutes without. When I make the same call from root_working_folder or repo_with_tests, the execution time is ~2 minutes with with cProfile / 38 seconds without.

The most significant time difference in the two calls is in the cumulative time of that function - nodes.py:546(_check_initialpaths_for_relpath)

# from framework_internal_folder
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   237033   85.508    0.000 3176.004    0.013 ../_pytest/nodes.py:546(_check_initialpaths_for_relpath)

# from root_working_folder
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      135    0.051    0.000    1.772    0.013 ../_pytest/nodes.py:546(_check_initialpaths_for_relpath)

According to stats, when executing from framework_internal_folder, the most struggling function is here:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
206063304  262.471    0.000 1580.672    0.000 ../_pytest/pathlib.py:990(commonpath)

# and stats for callers of that function:
Function                            was called by...
                                        ncalls  tottime  cumtime
pathlib.py:990(commonpath)          <- 205937164/3722648  262.308   28.844  nodes.py:546(_check_initialpaths_for_relpath)

Possible solutions

Cache for _check_initialpaths_for_relpath

I experimented with adding lru_cache to _check_initialpaths_for_relpath:

@lru_cache(maxsize=1000)
def _check_initialpaths_for_relpath(initialpaths: frozenset[Path], path: Path) -> str | None:
    for initial_path in initialpaths:
        if commonpath(path, initial_path) == initial_path:
            rel = str(path.relative_to(initial_path))
            return "" if rel == "." else rel
    return None

That change decreased the overall collection time to 4 minutes.

Stats are also impressive:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5798    2.109    0.000   79.265    0.014 nodes.py:545(_check_initialpaths_for_relpath)

I'm not sure if commonpath needs caching as well.

Anything else on the collection mechanism?

Other optimizations in directory/file collections

Metadata

Metadata

Assignees

No one assigned

    Labels

    topic: collectionrelated to the collection phasetype: performanceperformance or memory problem/improvement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions