Skip to content

Conversation

Mytherin
Copy link
Collaborator

This PR reworks several parts of the file system and multi file reader to work on an extensible OpenFileInfo struct instead of a string path:

  • OpenFileExtended takes an OpenFileInfo
  • Glob returns a list of OpenFileInfo
  • Multi file readers operate on OpenFileInfo (including the MultiFileList that returns a list of OpenFileInfo)

The idea here is that we can pass per-file information down to the readers and file system that can avoid doing unnecessary work. For example:

  • When performing an S3 glob, the result set contains information about each file (file size, eTag and last modified time). When reading the file, we then perform a head request to obtain this information again. This is an unnecessary round-trip per-file as we already have this information. Using the new infrastructure we can extend the Glob/OpenFileExtended to avoid this round-trip.
  • When reading from Lakehouse formats, they generally store the file and footer size for the data files - again allowing us to avoid the subsequent head request and avoiding a round trip if we can pass down this information.

This PR does not change anything aside from wrapping the paths into a struct - but that will allow these optimizations in a follow-up PR.

@Tishj Tishj mentioned this pull request Apr 11, 2025
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 11, 2025 10:28
@Maxxen
Copy link
Member

Maxxen commented Apr 11, 2025

I like the idea of having a OpenFileInfo struct that can carry more info than just the string path. But have you considered also making Glob return something more than just a vector of OpenFileInfos? E.g. some sort of OpenFileIterator that can potentially lazily expand into a set of OpenFIleInfos?

Also I think OpenFIleInfo needs a virtual destructor

@Mytherin
Copy link
Collaborator Author

The idea is to add extra members to OpenFileInfo in a follow-up PR - it can't itself be inherited from because we rely on copy constructors all over.

The design I have for it in my follow-up branch is this:

struct ExtendedOpenFileInfo {
	unordered_map<string, Value> options;
};

struct OpenFileInfo {
	OpenFileInfo() = default;
	OpenFileInfo(string path_p) // NOLINT: allow implicit conversion from string
	    : path(std::move(path_p)) {
	}

	string path;
	shared_ptr<ExtendedOpenFileInfo> extended_info;

public:
	bool operator<(const OpenFileInfo &rhs) const {
		return path < rhs.path;
	}
};

I think reworking glob to return an iterator is also a good idea - but somewhat orthogonal to this PR.

@Maxxen
Copy link
Member

Maxxen commented Apr 11, 2025

Ah alright, I guess that makes sense. I guess struct ExtendedOpenFileInfo might be able to be virtual then ;)

@Mytherin Mytherin marked this pull request as ready for review April 11, 2025 12:20
@Mytherin Mytherin merged commit c69efd5 into duckdb:main Apr 11, 2025
52 checks passed
Mytherin added a commit that referenced this pull request Apr 14, 2025
…on keys and footer size to Parquet reader (#17085)

Follow-up to #17071

This PR extends the  `OpenFileInfo` struct with `ExtendedOpenFileInfo`:

```cpp
struct ExtendedOpenFileInfo {
	unordered_map<string, Value> options;
};
```

The Parquet reader is extended to support two keys in this struct:

* `encryption_key`: used to pass a global encryption key for the Parquet
file to the reader
* `footer_size`: used to pass the footer size of the Parquet file to the
reader. This can be used to skip the separate reading of the footer
length from the back of the file - instead we can directly read the
footer. Note that this must be exactly correct - an error is thrown if
an incorrect footer size is provided.
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
MultiFileReader Rework (part 18): Replace file path with `OpenFileInfo` struct (duckdb/duckdb#17071)
update julia to v1.2.2 (duckdb/duckdb#17074)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
MultiFileReader Rework (part 18): Replace file path with `OpenFileInfo` struct (duckdb/duckdb#17071)
update julia to v1.2.2 (duckdb/duckdb#17074)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
MultiFileReader Rework (part 18): Replace file path with `OpenFileInfo` struct (duckdb/duckdb#17071)
update julia to v1.2.2 (duckdb/duckdb#17074)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 18, 2025
MultiFileReader Rework (part 18): Replace file path with `OpenFileInfo` struct (duckdb/duckdb#17071)
update julia to v1.2.2 (duckdb/duckdb#17074)
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request May 19, 2025
MultiFileReader Rework (part 18): Replace file path with `OpenFileInfo` struct (duckdb/duckdb#17071)
update julia to v1.2.2 (duckdb/duckdb#17074)
@Mytherin Mytherin deleted the openfileinfo branch June 12, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants