feat #2746: [Continued] refactor additional loaders phase 1 #2819

jajajalalalala · 2025-07-07T08:23:46Z

Description

refactor additional loaders

pandas loader
jina loader
markitdown loader

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
I have updated the tests accordingly (required for a bug fix or a new feature)
I have updated the documentation if needed:
I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

coderabbitai · 2025-07-07T08:23:51Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes rename and refactor JinaURLReader to JinaURLLoader and PandasReader to PandasLoader, updating their interfaces to inherit from BaseLoader and standardize loading methods and outputs. The MarkItDownLoader also now inherits from BaseLoader and provides a unified load interface. Example scripts and tests were updated to reflect these new class names and interfaces.

Changes

File(s)	Change Summary
camel/loaders/init.py	Updated imports and `__all__` to rename `JinaURLReader`→`JinaURLLoader`, `PandasReader`→`PandasLoader`.
camel/loaders/jina_url_reader.py	Renamed and refactored `JinaURLReader` to `JinaURLLoader` with new methods, error handling, and output format.
camel/loaders/pandas_reader.py	Renamed and refactored `PandasReader` to `PandasLoader`, inheriting from `BaseLoader`, modularized loading.
camel/loaders/markitdown.py	Updated `MarkItDownLoader` to inherit from `BaseLoader`, added unified `load` method and supported formats prop.
examples/loaders/jina_url_reader_example.py	Updated to use `JinaURLLoader` and new `load` method interface.
examples/loaders/markitdown_example.py	Updated to use `load` method of `MarkItDownLoader` and changed result extraction.
examples/loaders/pandas_example.py	Updated to use `PandasLoader` and new `load` method interface.
test/loaders/test_jina_url_reader.py	Updated to test `JinaURLLoader` and its new interface.
test/loaders/test_markitdown.py	Updated to test `MarkItDownLoader` and its new `load` method output structure.
test/loaders/test_pandasai.py	Updated to test `PandasLoader` and its new output structure.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Loader (e.g. JinaURLLoader/MarkItDownLoader/PandasLoader)
    participant DataSource (File/URL/DF)
    User->>Loader: load(source)
    alt Single source
        Loader->>DataSource: Fetch/read/convert single source
        Loader-->>User: {"contents": [result]}
    else Multiple sources
        loop For each source
            Loader->>DataSource: Fetch/read/convert source
        end
        Loader-->>User: {"contents": [result1, result2, ...]}
    end

Poem

In the warren of code, new loaders hop in,
Jina and Pandas, with fresh names begin.
MarkItDown joins with a hop and a bound,
Unified loading, in carrots abound!
Tests and examples all join the parade,
🥕 Hooray for the loaders, so neatly remade!

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch loader_improvement1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Wendong-Fan

thanks @jajajalalalala ! Left some comments below also one commit to update jina_url_reader as example: 0d14a36, we also need to rename the file naming, like from jina_url_reader to jina_url_loader

Wendong-Fan · 2025-07-09T16:11:26Z

camel/loaders/jina_url_reader.py

+        source: str,
+        **kwargs: Any,
+    ) -> Dict[str, str]:


the type should follow the base class

Wendong-Fan · 2025-07-09T16:48:18Z

camel/loaders/jina_url_reader.py

+                f"Failed to read content from {source}: {e}"
+            ) from e
+
+    def load(  # type: ignore[override]


same here, the type should follow the base class

Wendong-Fan · 2025-07-09T16:48:49Z

camel/loaders/markitdown.py

@@ -202,3 +203,56 @@ def convert_files(
                        converted_files[path] = f"Error: {e}"

        return converted_files
+
+    def _load_single(  # type: ignore[override]


the type should follow the base class

Wendong-Fan · 2025-07-09T16:49:25Z

camel/loaders/markitdown.py

should we rename the orignial converted_files and converted_file, to make the code more tidy?

Wendong-Fan · 2025-07-09T16:52:24Z

@coderabbitai review

coderabbitai · 2025-07-09T16:52:30Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

camel/loaders/markitdown.py (1)

207-221: LGTM!

The _load_single method is correctly implemented and returns the expected dictionary format. The type annotation appropriately uses # type: ignore[override] as noted in past reviews.

🧹 Nitpick comments (2)

examples/loaders/pandas_example.py (1)
84-95: Update example output comments to match the code changes.

The example output comments still reference "PandasReader" but the code now uses "PandasLoader".
-Example 1: PandasReader without LLM
+Example 1: PandasLoader without LLM
 Loaded DataFrame shape: (10, 2)
 Top 5 countries by sales:
           country  sales
@@ -91,7 +91,7 @@
 3         Germany   4100
 1  United Kingdom   3200

-Example 2: PandasReader with LLM
+Example 2: PandasLoader with LLM
 Querying data with LLM:
camel/loaders/pandas_reader.py (1)
135-141: Fix docstring parameter name and incomplete text.

The docstring refers to parameter "data" but the actual parameter is "source". Also, there's an incomplete text "(de)" on line 139.
 Args:
-    data: The data source(s) to load from. Can be:
+    source: The data source(s) to load from. Can be:
         - A pandas DataFrame
         - A single path/URL (str or Path)
         - A list of paths/DataFrames
-        (de)
     *args: Additional positional arguments.
     **kwargs: Additional keyword arguments.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f03bcf3 and 0d14a36.

📒 Files selected for processing (10)

camel/loaders/__init__.py (2 hunks)
camel/loaders/jina_url_reader.py (3 hunks)
camel/loaders/markitdown.py (2 hunks)
camel/loaders/pandas_reader.py (3 hunks)
examples/loaders/jina_url_reader_example.py (1 hunks)
examples/loaders/markitdown_example.py (2 hunks)
examples/loaders/pandas_example.py (2 hunks)
test/loaders/test_jina_url_reader.py (3 hunks)
test/loaders/test_markitdown.py (2 hunks)
test/loaders/test_pandasai.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.py`: Review the python code against the google python style guide and point out any mismatches.

**/*.py: Review the python code against the google python style guide and point out any mismatches.

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

examples/loaders/markitdown_example.py
test/loaders/test_pandasai.py
camel/loaders/__init__.py
test/loaders/test_markitdown.py
examples/loaders/pandas_example.py
examples/loaders/jina_url_reader_example.py
test/loaders/test_jina_url_reader.py
camel/loaders/pandas_reader.py
camel/loaders/markitdown.py
camel/loaders/jina_url_reader.py

🧬 Code Graph Analysis (1)

examples/loaders/markitdown_example.py (1)

camel/loaders/markitdown.py (1)

load (223-249)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: pytest_examples
GitHub Check: pytest_apps
GitHub Check: pre-commit
GitHub Check: pytest_package_llm_test
GitHub Check: pytest_package_fast_test
GitHub Check: Analyze (python)

🔇 Additional comments (25)

examples/loaders/jina_url_reader_example.py (3)

16-16: LGTM: Consistent class rename

The import update from JinaURLReader to JinaURLLoader aligns with the refactoring effort to standardize loader naming conventions.

22-22: LGTM: Variable name follows class rename

The variable name change maintains consistency with the updated class name.

25-25: LGTM: Correctly adapted to new API

The method call has been properly updated to use the new standardized load() API, accessing the content through the ["contents"][0] pattern as expected by the refactored interface.

camel/loaders/__init__.py (2)

20-20: LGTM: Consistent import updates

The import statements have been properly updated to reflect the class renamings from JinaURLReader to JinaURLLoader and PandasReader to PandasLoader, maintaining consistency with the refactoring effort.

Also applies to: 24-24

33-33: LGTM: Export list updated consistently

The __all__ list has been properly updated to export the renamed classes, ensuring the public API remains consistent with the implementation changes.

Also applies to: 36-36

test/loaders/test_pandasai.py (3)

22-22: LGTM: Import updated consistently

The import has been properly updated to use the renamed PandasLoader class.

26-26: LGTM: Class instantiations updated consistently

All test functions have been updated to instantiate PandasLoader instead of PandasReader, maintaining consistency with the class renaming.

Also applies to: 63-63, 114-114, 147-147

53-53: LGTM: API calls updated to new interface

All method calls have been properly updated to use the new load() API with the standardized result access pattern ["contents"][0]. The test logic remains intact while adapting to the refactored interface.

Also applies to: 102-102, 126-126, 152-152

test/loaders/test_markitdown.py (1)

54-55: LGTM: API updated correctly

The method call has been properly updated to use the new load() API. The result access pattern correctly extracts the content from the standardized dictionary structure.

examples/loaders/markitdown_example.py (2)

40-40: LGTM: Single file conversion updated correctly

The method call has been properly updated to use the new unified load() API with the correct result access pattern.

51-53: LGTM: Multiple file conversion updated correctly

The method call has been properly updated to use the new unified load() API, correctly passing the options (parallel=True, skip_failed=True) and accessing the result through the standardized ["contents"][0] pattern.

test/loaders/test_jina_url_reader.py (3)

18-18: LGTM!

The import correctly reflects the renamed class from JinaURLReader to JinaURLLoader.

53-57: LGTM!

The test correctly uses the new JinaURLLoader class and properly accesses the content from the new dictionary structure returned by the load method.

69-72: LGTM!

The error handling test is properly updated to use the new interface and expects the exception when accessing the nested content structure.

examples/loaders/pandas_example.py (1)

21-21: LGTM!

The code correctly uses the new PandasLoader class and properly accesses results using the new dictionary structure with ["contents"][0].

Also applies to: 53-57, 63-74

camel/loaders/markitdown.py (2)

16-18: LGTM!

The imports and class inheritance are correctly updated to inherit from BaseLoader.

Also applies to: 24-24

251-258: LGTM!

The supported_formats property correctly returns the set of supported file extensions.

camel/loaders/jina_url_reader.py (5)

16-17: LGTM!

The imports and class definition correctly implement the refactoring to JinaURLLoader inheriting from BaseLoader.

Also applies to: 22-22, 27-27

60-61: LGTM!

The constructor correctly calls the parent class initializer and stores the timeout as an instance variable.

Also applies to: 83-83

85-138: Excellent error handling and defensive programming!

The _load_single method has comprehensive validation for Path objects, string URLs, and URL schemes. The error messages are clear and helpful.

139-167: LGTM!

The load method correctly handles both single and multiple sources, returning the standardized dictionary format with proper delegation to _load_single.

169-176: LGTM!

The supported_formats property correctly returns the set of supported URL schemes.

camel/loaders/pandas_reader.py (3)

16-26: LGTM!

The imports and class definition correctly implement the refactoring to PandasLoader inheriting from BaseLoader.

Also applies to: 59-59

150-173: LGTM!

The load method correctly handles both single and multiple sources, with proper SmartDataframe conversion when LLM is configured.

175-182: LGTM!

The supported_formats property correctly returns the set of supported file extensions from the loader mapping.

coderabbitai · 2025-07-09T16:56:44Z

test/loaders/test_markitdown.py

@@ -61,13 +62,13 @@ def test_convert_file_not_found():
    with pytest.raises(
        FileNotFoundError, match="File not found: nonexistent.txt"
    ):
-        converter.convert_file("nonexistent.txt")
+        converter.load("nonexistent.txt")["contents"][0]


⚠️ Potential issue

Fix exception handling in tests

The tests expect exceptions to be raised, but the code tries to access ["contents"][0] after calling load(). If an exception is raised during the load() call, the result access will never be reached.

Move the result access inside the exception context:

- converter.load("nonexistent.txt")["contents"][0] + converter.load("nonexistent.txt")

- converter.load(mock_files["unsupported_xxx"])['contents'][0] + converter.load(mock_files["unsupported_xxx"])

Also applies to: 71-71

🤖 Prompt for AI Agents

In test/loaders/test_markitdown.py at lines 65 and 71, the test code calls converter.load("nonexistent.txt") and then accesses ["contents"][0] outside the exception handling block. This causes the test to fail to properly catch exceptions raised by load(). To fix this, move the ["contents"][0] access inside the exception context manager so that the exception is correctly caught during the load call and subsequent access.

coderabbitai · 2025-07-09T16:56:44Z

camel/loaders/markitdown.py

+        result = []
+        if isinstance(source, str):
+            result.append(self._load_single(source, **kwargs))
+        else:
+            result.append(self.convert_files(source, **kwargs))
+        return {"contents": result}


⚠️ Potential issue

Fix the return format for multiple files.

When handling a list of sources, the method wraps the result of convert_files in a list, but convert_files returns a flat dictionary mapping file paths to content. This doesn't match the expected format where the "contents" list should contain individual dictionaries.

result = [] if isinstance(source, str): result.append(self._load_single(source, **kwargs)) else: - result.append(self.convert_files(source, **kwargs)) + converted = self.convert_files(source, **kwargs) + result.extend({path: content} for path, content in converted.items()) return {"contents": result}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

result = []

if isinstance(source, str):

result.append(self._load_single(source, **kwargs))

else:

result.append(self.convert_files(source, **kwargs))

return {"contents": result}

result = []

if isinstance(source, str):

result.append(self._load_single(source, **kwargs))

else:

converted = self.convert_files(source, **kwargs)

result.extend({path: content} for path, content in converted.items())

return {"contents": result}

🤖 Prompt for AI Agents

In camel/loaders/markitdown.py around lines 244 to 249, when the source is a list, the code incorrectly wraps the dictionary returned by convert_files in a list under "contents". Instead, extract the individual file content dictionaries from the convert_files result and append each one separately to the "contents" list, so the returned dictionary's "contents" key holds a list of dictionaries as expected.

coderabbitai · 2025-07-09T16:56:44Z

camel/loaders/pandas_reader.py

+        # Get the appropriate loader method based on file extension
+        loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)
+        if not loader_method:
+            raise ValueError(f"No loader method found for {path.suffix} files")
+
+        return loader_method(file_path, **kwargs)


⚠️ Potential issue

Fix the loader method retrieval logic.

The code tries to dynamically get a loader method using getattr, but self.__LOADER already contains direct method references, not method names. You should use the mapping directly.

-# Get the appropriate loader method based on file extension -loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None) -if not loader_method: - raise ValueError(f"No loader method found for {path.suffix} files") - -return loader_method(file_path, **kwargs) +# Get the appropriate loader method based on file extension +loader_method = self.__LOADER.get(path.suffix.lower()) +if not loader_method: + raise ValueError(f"No loader method found for {path.suffix} files") + +return loader_method(file_path, **kwargs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Get the appropriate loader method based on file extension

loader_method = getattr(self, f"read_{path.suffix[1:].lower()}", None)

if not loader_method:

raise ValueError(f"No loader method found for {path.suffix} files")

return loader_method(file_path, **kwargs)

# Get the appropriate loader method based on file extension

loader_method = self.__LOADER.get(path.suffix.lower())

if not loader_method:

raise ValueError(f"No loader method found for {path.suffix} files")

return loader_method(file_path, **kwargs)

🤖 Prompt for AI Agents

In camel/loaders/pandas_reader.py around lines 120 to 125, the code incorrectly uses getattr to retrieve a loader method by name, but self.__LOADER already holds direct method references. Fix this by accessing the loader method directly from self.__LOADER using the file extension key instead of using getattr. Raise an error if the extension is not found in the mapping, then call the retrieved method with file_path and kwargs.

add pandas loadeer

86078c7

jajajalalalala added 6 commits July 7, 2025 18:51

add jina loader

575b261

add markitdown loader

51f0a34

fix format

27156e1

Merge branch 'master' into loader_improvement1

5fc7946

fix override error

4589b2a

remove cache

f568b73

jajajalalalala changed the title ~~add additional loaders phase 1~~ feat #2746: [Continued] refactor additional loaders phase 1 Jul 7, 2025

jajajalalalala self-assigned this Jul 7, 2025

remove charts

aae480f

jajajalalalala requested a review from Wendong-Fan July 7, 2025 10:48

Wendong-Fan reviewed Jul 9, 2025

View reviewed changes

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

Wendong-Fan added the Waiting for Update PR has been reviewed, need to be updated based on review comment label Jul 9, 2025

jajajalalalala added 2 commits July 12, 2025 15:52

Merge branch 'master' into loader_improvement1

a27b0d9

improve code based on comments

23e1a75

jajajalalalala force-pushed the loader_improvement1 branch from 0d14a36 to 23e1a75 Compare July 12, 2025 07:20

jajajalalalala added 10 commits July 12, 2025 17:28

fix markit down

d6a439d

remove override

933ff03

Merge branch 'master' into loader_improvement1

e616c10

Merge branch 'master' into loader_improvement1

7cd50f8

Merge branch 'master' into loader_improvement1

c1d07f7

change loader key logic

8a37d89

fix pre commit

0ba1e1e

Merge branch 'master' into loader_improvement1

ebf7cdd

remove key generation part

2b5f275

Merge branch 'master' into loader_improvement1

01d9482

jajajalalalala added 2 commits August 27, 2025 16:29

change file name

c819126

revert name changing

9ecbec0

Wendong-Fan mentioned this pull request Aug 30, 2025

feat: Dynamic File Editing #3063

Merged

6 tasks

Merge branch 'master' into loader_improvement1

abaa735

feat #2746: [Continued] refactor additional loaders phase 1 #2819

Are you sure you want to change the base?

feat #2746: [Continued] refactor additional loaders phase 1 #2819

Uh oh!

Conversation

jajajalalalala commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

coderabbitai bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

Wendong-Fan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan commented Jul 9, 2025

Uh oh!

coderabbitai bot commented Jul 9, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jajajalalalala commented Jul 7, 2025 •

edited

Loading

coderabbitai bot commented Jul 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Wendong-Fan left a comment •

edited

Loading