-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat #2746: [Continued] refactor additional loaders phase 1 #2819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes rename and refactor Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Loader (e.g. JinaURLLoader/MarkItDownLoader/PandasLoader)
participant DataSource (File/URL/DF)
User->>Loader: load(source)
alt Single source
Loader->>DataSource: Fetch/read/convert single source
Loader-->>User: {"contents": [result]}
else Multiple sources
loop For each source
Loader->>DataSource: Fetch/read/convert source
end
Loader-->>User: {"contents": [result1, result2, ...]}
end
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jajajalalalala ! Left some comments below also one commit to update jina_url_reader as example: 0d14a36, we also need to rename the file naming, like from jina_url_reader
to jina_url_loader
camel/loaders/jina_url_reader.py
Outdated
source: str, | ||
**kwargs: Any, | ||
) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type should follow the base class
camel/loaders/jina_url_reader.py
Outdated
f"Failed to read content from {source}: {e}" | ||
) from e | ||
|
||
def load( # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, the type should follow the base class
camel/loaders/markitdown.py
Outdated
@@ -202,3 +203,56 @@ def convert_files( | |||
converted_files[path] = f"Error: {e}" | |||
|
|||
return converted_files | |||
|
|||
def _load_single( # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type should follow the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename the orignial converted_files and converted_file, to make the code more tidy?
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
camel/loaders/markitdown.py (1)
207-221
: LGTM!The
_load_single
method is correctly implemented and returns the expected dictionary format. The type annotation appropriately uses# type: ignore[override]
as noted in past reviews.
🧹 Nitpick comments (2)
examples/loaders/pandas_example.py (1)
84-95
: Update example output comments to match the code changes.The example output comments still reference "PandasReader" but the code now uses "PandasLoader".
-Example 1: PandasReader without LLM +Example 1: PandasLoader without LLM Loaded DataFrame shape: (10, 2) Top 5 countries by sales: country sales @@ -91,7 +91,7 @@ 3 Germany 4100 1 United Kingdom 3200 -Example 2: PandasReader with LLM +Example 2: PandasLoader with LLM Querying data with LLM:camel/loaders/pandas_reader.py (1)
135-141
: Fix docstring parameter name and incomplete text.The docstring refers to parameter "data" but the actual parameter is "source". Also, there's an incomplete text "(de)" on line 139.
Args: - data: The data source(s) to load from. Can be: + source: The data source(s) to load from. Can be: - A pandas DataFrame - A single path/URL (str or Path) - A list of paths/DataFrames - (de) *args: Additional positional arguments. **kwargs: Additional keyword arguments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
camel/loaders/__init__.py
(2 hunks)camel/loaders/jina_url_reader.py
(3 hunks)camel/loaders/markitdown.py
(2 hunks)camel/loaders/pandas_reader.py
(3 hunks)examples/loaders/jina_url_reader_example.py
(1 hunks)examples/loaders/markitdown_example.py
(2 hunks)examples/loaders/pandas_example.py
(2 hunks)test/loaders/test_jina_url_reader.py
(3 hunks)test/loaders/test_markitdown.py
(2 hunks)test/loaders/test_pandasai.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Review the python code against the google python style guide and point out any mismatches.
**/*.py
: Review the python code against the google python style guide and point out any mismatches.
⚙️ Source: CodeRabbit Configuration File
List of files the instruction was applied to:
examples/loaders/markitdown_example.py
test/loaders/test_pandasai.py
camel/loaders/__init__.py
test/loaders/test_markitdown.py
examples/loaders/pandas_example.py
examples/loaders/jina_url_reader_example.py
test/loaders/test_jina_url_reader.py
camel/loaders/pandas_reader.py
camel/loaders/markitdown.py
camel/loaders/jina_url_reader.py
🧬 Code Graph Analysis (1)
examples/loaders/markitdown_example.py (1)
camel/loaders/markitdown.py (1)
load
(223-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: pytest_examples
- GitHub Check: pytest_apps
- GitHub Check: pre-commit
- GitHub Check: pytest_package_llm_test
- GitHub Check: pytest_package_fast_test
- GitHub Check: Analyze (python)
🔇 Additional comments (25)
examples/loaders/jina_url_reader_example.py (3)
16-16
: LGTM: Consistent class renameThe import update from
JinaURLReader
toJinaURLLoader
aligns with the refactoring effort to standardize loader naming conventions.
22-22
: LGTM: Variable name follows class renameThe variable name change maintains consistency with the updated class name.
25-25
: LGTM: Correctly adapted to new APIThe method call has been properly updated to use the new standardized
load()
API, accessing the content through the["contents"][0]
pattern as expected by the refactored interface.camel/loaders/__init__.py (2)
20-20
: LGTM: Consistent import updatesThe import statements have been properly updated to reflect the class renamings from
JinaURLReader
toJinaURLLoader
andPandasReader
toPandasLoader
, maintaining consistency with the refactoring effort.Also applies to: 24-24
33-33
: LGTM: Export list updated consistentlyThe
__all__
list has been properly updated to export the renamed classes, ensuring the public API remains consistent with the implementation changes.Also applies to: 36-36
test/loaders/test_pandasai.py (3)
22-22
: LGTM: Import updated consistentlyThe import has been properly updated to use the renamed
PandasLoader
class.
26-26
: LGTM: Class instantiations updated consistentlyAll test functions have been updated to instantiate
PandasLoader
instead ofPandasReader
, maintaining consistency with the class renaming.Also applies to: 63-63, 114-114, 147-147
53-53
: LGTM: API calls updated to new interfaceAll method calls have been properly updated to use the new
load()
API with the standardized result access pattern["contents"][0]
. The test logic remains intact while adapting to the refactored interface.Also applies to: 102-102, 126-126, 152-152
test/loaders/test_markitdown.py (1)
54-55
: LGTM: API updated correctlyThe method call has been properly updated to use the new
load()
API. The result access pattern correctly extracts the content from the standardized dictionary structure.examples/loaders/markitdown_example.py (2)
40-40
: LGTM: Single file conversion updated correctlyThe method call has been properly updated to use the new unified
load()
API with the correct result access pattern.
51-53
: LGTM: Multiple file conversion updated correctlyThe method call has been properly updated to use the new unified
load()
API, correctly passing the options (parallel=True
,skip_failed=True
) and accessing the result through the standardized["contents"][0]
pattern.test/loaders/test_jina_url_reader.py (3)
18-18
: LGTM!The import correctly reflects the renamed class from
JinaURLReader
toJinaURLLoader
.
53-57
: LGTM!The test correctly uses the new
JinaURLLoader
class and properly accesses the content from the new dictionary structure returned by theload
method.
69-72
: LGTM!The error handling test is properly updated to use the new interface and expects the exception when accessing the nested content structure.
examples/loaders/pandas_example.py (1)
21-21
: LGTM!The code correctly uses the new
PandasLoader
class and properly accesses results using the new dictionary structure with["contents"][0]
.Also applies to: 53-57, 63-74
camel/loaders/markitdown.py (2)
16-18
: LGTM!The imports and class inheritance are correctly updated to inherit from
BaseLoader
.Also applies to: 24-24
251-258
: LGTM!The
supported_formats
property correctly returns the set of supported file extensions.camel/loaders/jina_url_reader.py (5)
16-17
: LGTM!The imports and class definition correctly implement the refactoring to
JinaURLLoader
inheriting fromBaseLoader
.Also applies to: 22-22, 27-27
60-61
: LGTM!The constructor correctly calls the parent class initializer and stores the timeout as an instance variable.
Also applies to: 83-83
85-138
: Excellent error handling and defensive programming!The
_load_single
method has comprehensive validation for Path objects, string URLs, and URL schemes. The error messages are clear and helpful.
139-167
: LGTM!The
load
method correctly handles both single and multiple sources, returning the standardized dictionary format with proper delegation to_load_single
.
169-176
: LGTM!The
supported_formats
property correctly returns the set of supported URL schemes.camel/loaders/pandas_reader.py (3)
16-26
: LGTM!The imports and class definition correctly implement the refactoring to
PandasLoader
inheriting fromBaseLoader
.Also applies to: 59-59
150-173
: LGTM!The load method correctly handles both single and multiple sources, with proper SmartDataframe conversion when LLM is configured.
175-182
: LGTM!The
supported_formats
property correctly returns the set of supported file extensions from the loader mapping.
test/loaders/test_markitdown.py
Outdated
@@ -61,13 +62,13 @@ def test_convert_file_not_found(): | |||
with pytest.raises( | |||
FileNotFoundError, match="File not found: nonexistent.txt" | |||
): | |||
converter.convert_file("nonexistent.txt") | |||
converter.load("nonexistent.txt")["contents"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix exception handling in tests
The tests expect exceptions to be raised, but the code tries to access ["contents"][0]
after calling load()
. If an exception is raised during the load()
call, the result access will never be reached.
Move the result access inside the exception context:
- converter.load("nonexistent.txt")["contents"][0]
+ converter.load("nonexistent.txt")
- converter.load(mock_files["unsupported_xxx"])['contents'][0]
+ converter.load(mock_files["unsupported_xxx"])
Also applies to: 71-71
🤖 Prompt for AI Agents
In test/loaders/test_markitdown.py at lines 65 and 71, the test code calls
converter.load("nonexistent.txt") and then accesses ["contents"][0] outside the
exception handling block. This causes the test to fail to properly catch
exceptions raised by load(). To fix this, move the ["contents"][0] access inside
the exception context manager so that the exception is correctly caught during
the load call and subsequent access.
camel/loaders/markitdown.py
Outdated
result = [] | ||
if isinstance(source, str): | ||
result.append(self._load_single(source, **kwargs)) | ||
else: | ||
result.append(self.convert_files(source, **kwargs)) | ||
return {"contents": result} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the return format for multiple files.
When handling a list of sources, the method wraps the result of convert_files
in a list, but convert_files
returns a flat dictionary mapping file paths to content. This doesn't match the expected format where the "contents" list should contain individual dictionaries.
result = []
if isinstance(source, str):
result.append(self._load_single(source, **kwargs))
else:
- result.append(self.convert_files(source, **kwargs))
+ converted = self.convert_files(source, **kwargs)
+ result.extend({path: content} for path, content in converted.items())
return {"contents": result}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
result = [] | |
if isinstance(source, str): | |
result.append(self._load_single(source, **kwargs)) | |
else: | |
result.append(self.convert_files(source, **kwargs)) | |
return {"contents": result} | |
result = [] | |
if isinstance(source, str): | |
result.append(self._load_single(source, **kwargs)) | |
else: | |
converted = self.convert_files(source, **kwargs) | |
result.extend({path: content} for path, content in converted.items()) | |
return {"contents": result} |
🤖 Prompt for AI Agents
In camel/loaders/markitdown.py around lines 244 to 249, when the source is a
list, the code incorrectly wraps the dictionary returned by convert_files in a
list under "contents". Instead, extract the individual file content dictionaries
from the convert_files result and append each one separately to the "contents"
list, so the returned dictionary's "contents" key holds a list of dictionaries
as expected.
# Get the appropriate loader method based on file extension | ||
loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None) | ||
if not loader_method: | ||
raise ValueError(f"No loader method found for {path.suffix} files") | ||
|
||
return loader_method(file_path, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the loader method retrieval logic.
The code tries to dynamically get a loader method using getattr
, but self.__LOADER
already contains direct method references, not method names. You should use the mapping directly.
-# Get the appropriate loader method based on file extension
-loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)
-if not loader_method:
- raise ValueError(f"No loader method found for {path.suffix} files")
-
-return loader_method(file_path, **kwargs)
+# Get the appropriate loader method based on file extension
+loader_method = self.__LOADER.get(path.suffix.lower())
+if not loader_method:
+ raise ValueError(f"No loader method found for {path.suffix} files")
+
+return loader_method(file_path, **kwargs)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Get the appropriate loader method based on file extension | |
loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None) | |
if not loader_method: | |
raise ValueError(f"No loader method found for {path.suffix} files") | |
return loader_method(file_path, **kwargs) | |
# Get the appropriate loader method based on file extension | |
loader_method = self.__LOADER.get(path.suffix.lower()) | |
if not loader_method: | |
raise ValueError(f"No loader method found for {path.suffix} files") | |
return loader_method(file_path, **kwargs) |
🤖 Prompt for AI Agents
In camel/loaders/pandas_reader.py around lines 120 to 125, the code incorrectly
uses getattr to retrieve a loader method by name, but self.__LOADER already
holds direct method references. Fix this by accessing the loader method directly
from self.__LOADER using the file extension key instead of using getattr. Raise
an error if the extension is not found in the mapping, then call the retrieved
method with file_path and kwargs.
0d14a36
to
23e1a75
Compare
Description
refactor additional loaders
pandas loader
jina loader
markitdown loader
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.Fixes #issue-number
in the PR description (required)pyproject.toml
anduv lock
If you are unsure about any of these, don't hesitate to ask. We are here to help!