Skip to content

Conversation

jajajalalalala
Copy link
Collaborator

@jajajalalalala jajajalalalala commented Jul 7, 2025

Description

refactor additional loaders

pandas loader
jina loader
markitdown loader

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

Copy link
Contributor

coderabbitai bot commented Jul 7, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes rename and refactor JinaURLReader to JinaURLLoader and PandasReader to PandasLoader, updating their interfaces to inherit from BaseLoader and standardize loading methods and outputs. The MarkItDownLoader also now inherits from BaseLoader and provides a unified load interface. Example scripts and tests were updated to reflect these new class names and interfaces.

Changes

File(s) Change Summary
camel/loaders/init.py Updated imports and __all__ to rename JinaURLReaderJinaURLLoader, PandasReaderPandasLoader.
camel/loaders/jina_url_reader.py Renamed and refactored JinaURLReader to JinaURLLoader with new methods, error handling, and output format.
camel/loaders/pandas_reader.py Renamed and refactored PandasReader to PandasLoader, inheriting from BaseLoader, modularized loading.
camel/loaders/markitdown.py Updated MarkItDownLoader to inherit from BaseLoader, added unified load method and supported formats prop.
examples/loaders/jina_url_reader_example.py Updated to use JinaURLLoader and new load method interface.
examples/loaders/markitdown_example.py Updated to use load method of MarkItDownLoader and changed result extraction.
examples/loaders/pandas_example.py Updated to use PandasLoader and new load method interface.
test/loaders/test_jina_url_reader.py Updated to test JinaURLLoader and its new interface.
test/loaders/test_markitdown.py Updated to test MarkItDownLoader and its new load method output structure.
test/loaders/test_pandasai.py Updated to test PandasLoader and its new output structure.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Loader (e.g. JinaURLLoader/MarkItDownLoader/PandasLoader)
    participant DataSource (File/URL/DF)
    User->>Loader: load(source)
    alt Single source
        Loader->>DataSource: Fetch/read/convert single source
        Loader-->>User: {"contents": [result]}
    else Multiple sources
        loop For each source
            Loader->>DataSource: Fetch/read/convert source
        end
        Loader-->>User: {"contents": [result1, result2, ...]}
    end
Loading

Poem

In the warren of code, new loaders hop in,
Jina and Pandas, with fresh names begin.
MarkItDown joins with a hop and a bound,
Unified loading, in carrots abound!
Tests and examples all join the parade,
🥕 Hooray for the loaders, so neatly remade!

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch loader_improvement1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jajajalalalala jajajalalalala changed the title add additional loaders phase 1 feat #2746: [Continued] refactor additional loaders phase 1 Jul 7, 2025
@jajajalalalala jajajalalalala self-assigned this Jul 7, 2025
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jajajalalalala ! Left some comments below also one commit to update jina_url_reader as example: 0d14a36, we also need to rename the file naming, like from jina_url_reader to jina_url_loader

Comment on lines 86 to 88
source: str,
**kwargs: Any,
) -> Dict[str, str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type should follow the base class

f"Failed to read content from {source}: {e}"
) from e

def load( # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, the type should follow the base class

@@ -202,3 +203,56 @@ def convert_files(
converted_files[path] = f"Error: {e}"

return converted_files

def _load_single( # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type should follow the base class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename the orignial converted_files and converted_file, to make the code more tidy?

@Wendong-Fan
Copy link
Member

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jul 9, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
camel/loaders/markitdown.py (1)

207-221: LGTM!

The _load_single method is correctly implemented and returns the expected dictionary format. The type annotation appropriately uses # type: ignore[override] as noted in past reviews.

🧹 Nitpick comments (2)
examples/loaders/pandas_example.py (1)

84-95: Update example output comments to match the code changes.

The example output comments still reference "PandasReader" but the code now uses "PandasLoader".

-Example 1: PandasReader without LLM
+Example 1: PandasLoader without LLM
 Loaded DataFrame shape: (10, 2)
 Top 5 countries by sales:
           country  sales
@@ -91,7 +91,7 @@
 3         Germany   4100
 1  United Kingdom   3200

-Example 2: PandasReader with LLM
+Example 2: PandasLoader with LLM
 Querying data with LLM:
camel/loaders/pandas_reader.py (1)

135-141: Fix docstring parameter name and incomplete text.

The docstring refers to parameter "data" but the actual parameter is "source". Also, there's an incomplete text "(de)" on line 139.

 Args:
-    data: The data source(s) to load from. Can be:
+    source: The data source(s) to load from. Can be:
         - A pandas DataFrame
         - A single path/URL (str or Path)
         - A list of paths/DataFrames
-        (de)
     *args: Additional positional arguments.
     **kwargs: Additional keyword arguments.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f03bcf3 and 0d14a36.

📒 Files selected for processing (10)
  • camel/loaders/__init__.py (2 hunks)
  • camel/loaders/jina_url_reader.py (3 hunks)
  • camel/loaders/markitdown.py (2 hunks)
  • camel/loaders/pandas_reader.py (3 hunks)
  • examples/loaders/jina_url_reader_example.py (1 hunks)
  • examples/loaders/markitdown_example.py (2 hunks)
  • examples/loaders/pandas_example.py (2 hunks)
  • test/loaders/test_jina_url_reader.py (3 hunks)
  • test/loaders/test_markitdown.py (2 hunks)
  • test/loaders/test_pandasai.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Review the python code against the google python style guide and point out any mismatches.

**/*.py: Review the python code against the google python style guide and point out any mismatches.

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

  • examples/loaders/markitdown_example.py
  • test/loaders/test_pandasai.py
  • camel/loaders/__init__.py
  • test/loaders/test_markitdown.py
  • examples/loaders/pandas_example.py
  • examples/loaders/jina_url_reader_example.py
  • test/loaders/test_jina_url_reader.py
  • camel/loaders/pandas_reader.py
  • camel/loaders/markitdown.py
  • camel/loaders/jina_url_reader.py
🧬 Code Graph Analysis (1)
examples/loaders/markitdown_example.py (1)
camel/loaders/markitdown.py (1)
  • load (223-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: pytest_examples
  • GitHub Check: pytest_apps
  • GitHub Check: pre-commit
  • GitHub Check: pytest_package_llm_test
  • GitHub Check: pytest_package_fast_test
  • GitHub Check: Analyze (python)
🔇 Additional comments (25)
examples/loaders/jina_url_reader_example.py (3)

16-16: LGTM: Consistent class rename

The import update from JinaURLReader to JinaURLLoader aligns with the refactoring effort to standardize loader naming conventions.


22-22: LGTM: Variable name follows class rename

The variable name change maintains consistency with the updated class name.


25-25: LGTM: Correctly adapted to new API

The method call has been properly updated to use the new standardized load() API, accessing the content through the ["contents"][0] pattern as expected by the refactored interface.

camel/loaders/__init__.py (2)

20-20: LGTM: Consistent import updates

The import statements have been properly updated to reflect the class renamings from JinaURLReader to JinaURLLoader and PandasReader to PandasLoader, maintaining consistency with the refactoring effort.

Also applies to: 24-24


33-33: LGTM: Export list updated consistently

The __all__ list has been properly updated to export the renamed classes, ensuring the public API remains consistent with the implementation changes.

Also applies to: 36-36

test/loaders/test_pandasai.py (3)

22-22: LGTM: Import updated consistently

The import has been properly updated to use the renamed PandasLoader class.


26-26: LGTM: Class instantiations updated consistently

All test functions have been updated to instantiate PandasLoader instead of PandasReader, maintaining consistency with the class renaming.

Also applies to: 63-63, 114-114, 147-147


53-53: LGTM: API calls updated to new interface

All method calls have been properly updated to use the new load() API with the standardized result access pattern ["contents"][0]. The test logic remains intact while adapting to the refactored interface.

Also applies to: 102-102, 126-126, 152-152

test/loaders/test_markitdown.py (1)

54-55: LGTM: API updated correctly

The method call has been properly updated to use the new load() API. The result access pattern correctly extracts the content from the standardized dictionary structure.

examples/loaders/markitdown_example.py (2)

40-40: LGTM: Single file conversion updated correctly

The method call has been properly updated to use the new unified load() API with the correct result access pattern.


51-53: LGTM: Multiple file conversion updated correctly

The method call has been properly updated to use the new unified load() API, correctly passing the options (parallel=True, skip_failed=True) and accessing the result through the standardized ["contents"][0] pattern.

test/loaders/test_jina_url_reader.py (3)

18-18: LGTM!

The import correctly reflects the renamed class from JinaURLReader to JinaURLLoader.


53-57: LGTM!

The test correctly uses the new JinaURLLoader class and properly accesses the content from the new dictionary structure returned by the load method.


69-72: LGTM!

The error handling test is properly updated to use the new interface and expects the exception when accessing the nested content structure.

examples/loaders/pandas_example.py (1)

21-21: LGTM!

The code correctly uses the new PandasLoader class and properly accesses results using the new dictionary structure with ["contents"][0].

Also applies to: 53-57, 63-74

camel/loaders/markitdown.py (2)

16-18: LGTM!

The imports and class inheritance are correctly updated to inherit from BaseLoader.

Also applies to: 24-24


251-258: LGTM!

The supported_formats property correctly returns the set of supported file extensions.

camel/loaders/jina_url_reader.py (5)

16-17: LGTM!

The imports and class definition correctly implement the refactoring to JinaURLLoader inheriting from BaseLoader.

Also applies to: 22-22, 27-27


60-61: LGTM!

The constructor correctly calls the parent class initializer and stores the timeout as an instance variable.

Also applies to: 83-83


85-138: Excellent error handling and defensive programming!

The _load_single method has comprehensive validation for Path objects, string URLs, and URL schemes. The error messages are clear and helpful.


139-167: LGTM!

The load method correctly handles both single and multiple sources, returning the standardized dictionary format with proper delegation to _load_single.


169-176: LGTM!

The supported_formats property correctly returns the set of supported URL schemes.

camel/loaders/pandas_reader.py (3)

16-26: LGTM!

The imports and class definition correctly implement the refactoring to PandasLoader inheriting from BaseLoader.

Also applies to: 59-59


150-173: LGTM!

The load method correctly handles both single and multiple sources, with proper SmartDataframe conversion when LLM is configured.


175-182: LGTM!

The supported_formats property correctly returns the set of supported file extensions from the loader mapping.

@@ -61,13 +62,13 @@ def test_convert_file_not_found():
with pytest.raises(
FileNotFoundError, match="File not found: nonexistent.txt"
):
converter.convert_file("nonexistent.txt")
converter.load("nonexistent.txt")["contents"][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix exception handling in tests

The tests expect exceptions to be raised, but the code tries to access ["contents"][0] after calling load(). If an exception is raised during the load() call, the result access will never be reached.

Move the result access inside the exception context:

-        converter.load("nonexistent.txt")["contents"][0]
+        converter.load("nonexistent.txt")
-        converter.load(mock_files["unsupported_xxx"])['contents'][0]
+        converter.load(mock_files["unsupported_xxx"])

Also applies to: 71-71

🤖 Prompt for AI Agents
In test/loaders/test_markitdown.py at lines 65 and 71, the test code calls
converter.load("nonexistent.txt") and then accesses ["contents"][0] outside the
exception handling block. This causes the test to fail to properly catch
exceptions raised by load(). To fix this, move the ["contents"][0] access inside
the exception context manager so that the exception is correctly caught during
the load call and subsequent access.

Comment on lines 244 to 249
result = []
if isinstance(source, str):
result.append(self._load_single(source, **kwargs))
else:
result.append(self.convert_files(source, **kwargs))
return {"contents": result}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the return format for multiple files.

When handling a list of sources, the method wraps the result of convert_files in a list, but convert_files returns a flat dictionary mapping file paths to content. This doesn't match the expected format where the "contents" list should contain individual dictionaries.

 result = []
 if isinstance(source, str):
     result.append(self._load_single(source, **kwargs))
 else:
-    result.append(self.convert_files(source, **kwargs))
+    converted = self.convert_files(source, **kwargs)
+    result.extend({path: content} for path, content in converted.items())
 return {"contents": result}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = []
if isinstance(source, str):
result.append(self._load_single(source, **kwargs))
else:
result.append(self.convert_files(source, **kwargs))
return {"contents": result}
result = []
if isinstance(source, str):
result.append(self._load_single(source, **kwargs))
else:
converted = self.convert_files(source, **kwargs)
result.extend({path: content} for path, content in converted.items())
return {"contents": result}
🤖 Prompt for AI Agents
In camel/loaders/markitdown.py around lines 244 to 249, when the source is a
list, the code incorrectly wraps the dictionary returned by convert_files in a
list under "contents". Instead, extract the individual file content dictionaries
from the convert_files result and append each one separately to the "contents"
list, so the returned dictionary's "contents" key holds a list of dictionaries
as expected.

Comment on lines +120 to +126
# Get the appropriate loader method based on file extension
loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)
if not loader_method:
raise ValueError(f"No loader method found for {path.suffix} files")

return loader_method(file_path, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the loader method retrieval logic.

The code tries to dynamically get a loader method using getattr, but self.__LOADER already contains direct method references, not method names. You should use the mapping directly.

-# Get the appropriate loader method based on file extension
-loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)
-if not loader_method:
-    raise ValueError(f"No loader method found for {path.suffix} files")
-
-return loader_method(file_path, **kwargs)
+# Get the appropriate loader method based on file extension
+loader_method = self.__LOADER.get(path.suffix.lower())
+if not loader_method:
+    raise ValueError(f"No loader method found for {path.suffix} files")
+
+return loader_method(file_path, **kwargs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get the appropriate loader method based on file extension
loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)
if not loader_method:
raise ValueError(f"No loader method found for {path.suffix} files")
return loader_method(file_path, **kwargs)
# Get the appropriate loader method based on file extension
loader_method = self.__LOADER.get(path.suffix.lower())
if not loader_method:
raise ValueError(f"No loader method found for {path.suffix} files")
return loader_method(file_path, **kwargs)
🤖 Prompt for AI Agents
In camel/loaders/pandas_reader.py around lines 120 to 125, the code incorrectly
uses getattr to retrieve a loader method by name, but self.__LOADER already
holds direct method references. Fix this by accessing the loader method directly
from self.__LOADER using the file extension key instead of using getattr. Raise
an error if the extension is not found in the mapping, then call the retrieved
method with file_path and kwargs.

@Wendong-Fan Wendong-Fan added the Waiting for Update PR has been reviewed, need to be updated based on review comment label Jul 9, 2025
@Wendong-Fan Wendong-Fan mentioned this pull request Aug 30, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Update PR has been reviewed, need to be updated based on review comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants