PaperRAG

A clean, modular RAG (Retrieval-Augmented Generation) system for academic papers built with ChromaDB and OpenAI.

📚 Paper Source

This system is designed to work with papers from the LLM Agent Papers repository - a comprehensive collection of must-read papers on Large Language Model Agents. The repository contains papers covering:

🤖 Agent Systems: Personality, Memory, Planning, Tool Use, RL Training
🤖💬🤖 Multi-Agent Systems: Collaborative, Adversarial, and Casual Conversations
🪐 Applications: Real-world agent implementations
🖼️ Frameworks: Development frameworks and tools
🧰 Resources: Benchmarks, tool lists, and evaluation metrics

Visit LLM Agent Papers to explore the full collection of papers and resources.

⚡ Performance Optimization

With 200+ papers in the collection, vectorization can be computationally intensive even on high-end hardware. For optimal performance, this system was developed using:

🚀 Cloud GPU Processing

# DigitalOcean H100 GPU Droplet for vectorization
doctl compute droplet create \
    --image 191457505 \
    --size 288 \
    --region 4 \
    --vpc-uuid a59c66ae-d131-49e6-a1b4-88535df82c14 \
    --tag-names '' \
    ml-ai-ubuntu-gpu-h100x1-80gb-nyc2

Why H100? The NVIDIA H100 GPU provides:

80GB HBM3 Memory for large-scale vector operations
4th Gen Tensor Cores for accelerated AI workloads
Multi-Instance GPU (MIG) for efficient resource utilization
PCIe 5.0 for high-speed data transfer

🔄 Workflow

Cloud Processing: Vectorize papers on H100 GPU droplet
Local Development: SSH copy vectorized data to local machine
Fast Queries: Run RAG queries locally with pre-computed embeddings

This approach ensures rapid development cycles while leveraging cloud GPU power for the heavy computational work.

Features

Modular Design: Clean base classes for easy extension and customization
Paper Processing: Automatic PDF text extraction and chunking with caching
Smart Query Enhancement: Uses OpenAI to improve search queries
Academic Focus: Optimized for research paper Q&A with proper citations
Interactive Interface: User-friendly command-line interface

Architecture

PaperRAG/
├── src/
│   ├── rag/
│   │   ├── __init__.py          # RAG module exports
│   │   ├── base.py              # Base RAG classes
│   │   └── PaperRAG.py          # Paper-specific RAG implementation
│   └── utils/
│       ├── __init__.py          # Utils module exports
│       ├── paper_chunks.py      # PDF processing and chunking
│       └── chroma/              # ChromaDB storage
├── assets/
│   └── papers/                  # Place your PDF papers here
├── main.py                      # Interactive application
└── requirements.txt             # Dependencies

Base Classes

BaseRAG

Abstract base class for all RAG systems:

gen(user_query: str) -> str: Generate response to user query
setup() -> None: Initialize the RAG system

ChromaRAG

Base class for ChromaDB-based RAG systems:

Handles collection management
Provides query pipeline
Abstract methods for customization:
- _augment_user_query(): Query enhancement
- _generate_answer(): Answer generation
- _load_data(): Data loading

PaperRAG

Concrete implementation for academic papers:

Paper-specific query enhancement
Academic citation formatting
Automatic paper chunking and loading

Quick Start

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment:

cp .env.example .env
# Add your OpenAI API key to .env

Add papers:
```
# Place PDF files in assets/papers/
```
Run the interactive system:
```
python main.py
```

Usage Examples

Interactive Mode

python main.py

Programmatic Usage

from src.rag import PaperRAG
import chromadb

# Initialize
chroma_client = chromadb.PersistentClient(path="chroma")
rag_system = PaperRAG(chroma_client)

# Setup (loads papers and creates collection)
rag_system.setup()

# Ask questions
answer = rag_system.gen("What are the best practices for AI agents?")
print(answer)

Custom RAG System

from src.rag.base import ChromaRAG

class CustomRAG(ChromaRAG):
    def _augment_user_query(self, user_query: str) -> str:
        # Custom query enhancement
        return f"enhanced: {user_query}"
    
    def _generate_answer(self, user_query: str, rag_results) -> str:
        # Custom answer generation
        return "Custom answer based on RAG results"
    
    def _load_data(self) -> None:
        # Custom data loading
        pass

Configuration

Environment Variables

OPENAI_API_KEY: Your OpenAI API key

Paper Processing

chunk_size: Default 800 characters per chunk
chunk_overlap: Default 200 characters overlap
Cache is automatically managed based on file modifications

ChromaDB

Persistent storage in src/utils/chroma/
Collection name: paper_collection
Automatic collection creation and management

Extending the System

Adding New RAG Types

Extend ChromaRAG or BaseRAG
Implement required abstract methods
Add to src/rag/__init__.py

Custom Data Sources

Override _load_data() method
Implement your data loading logic
Ensure proper document formatting for ChromaDB

Custom Query Enhancement

Override _augment_user_query() method
Implement your query improvement logic
Return enhanced query string

File Structure Details

`src/rag/base.py`

BaseRAG: Abstract base class
ChromaRAG: ChromaDB-specific base class
Common functionality for all RAG systems

`src/rag/PaperRAG.py`

PaperRAG: Academic paper RAG implementation
OpenAI integration for query enhancement and answer generation
Paper-specific metadata handling

`src/utils/paper_chunks.py`

PDF text extraction with PyPDF2
Intelligent chunking with overlap
Caching system for performance
File change detection

`main.py`

PaperRAGApp: Main application class
Interactive command-line interface
Error handling and user experience

Performance Features

Caching: Paper chunks are cached to avoid reprocessing
Incremental Updates: Only new/modified papers are processed
Efficient Storage: ChromaDB handles vector storage and retrieval
Smart Queries: OpenAI-enhanced search queries for better results

Troubleshooting

Common Issues

OpenAI API Key Missing:
- Ensure .env file exists with OPENAI_API_KEY=your_key
No Papers Found:
- Check that PDF files are in assets/papers/ directory
ChromaDB Errors:
- Delete src/utils/chroma/ directory to reset database
- Ensure write permissions in the directory
Import Errors:
- Ensure all dependencies are installed: pip install -r requirements.txt
- Check Python path includes src/ directory

Debug Mode

Add debug prints to see detailed processing:

import logging
logging.basicConfig(level=logging.DEBUG)

Contributing

Follow the existing class structure
Add type hints to all functions
Include docstrings for all classes and methods
Test with different paper types and queries
Update README for new features

License

This project is open source. Feel free to use and modify for your research needs.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

daukadolt/paperRAG

Folders and files

Latest commit

History

Repository files navigation

PaperRAG

📚 Paper Source

⚡ Performance Optimization

🚀 Cloud GPU Processing

🔄 Workflow

Features

Architecture

Base Classes

BaseRAG

ChromaRAG

PaperRAG

Quick Start

Usage Examples

Interactive Mode

Programmatic Usage

Custom RAG System

Configuration

Environment Variables

Paper Processing

ChromaDB

Extending the System

Adding New RAG Types

Custom Data Sources

Custom Query Enhancement

File Structure Details

src/rag/base.py

src/rag/PaperRAG.py

src/utils/paper_chunks.py

main.py

Performance Features

Troubleshooting

Common Issues

Debug Mode

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`src/rag/base.py`

`src/rag/PaperRAG.py`

`src/utils/paper_chunks.py`

`main.py`

Packages