A clean, modular RAG (Retrieval-Augmented Generation) system for academic papers built with ChromaDB and OpenAI.
This system is designed to work with papers from the LLM Agent Papers repository - a comprehensive collection of must-read papers on Large Language Model Agents. The repository contains papers covering:
- π€ Agent Systems: Personality, Memory, Planning, Tool Use, RL Training
- π€π¬π€ Multi-Agent Systems: Collaborative, Adversarial, and Casual Conversations
- πͺ Applications: Real-world agent implementations
- πΌοΈ Frameworks: Development frameworks and tools
- π§° Resources: Benchmarks, tool lists, and evaluation metrics
Visit LLM Agent Papers to explore the full collection of papers and resources.
With 200+ papers in the collection, vectorization can be computationally intensive even on high-end hardware. For optimal performance, this system was developed using:
# DigitalOcean H100 GPU Droplet for vectorization
doctl compute droplet create \
--image 191457505 \
--size 288 \
--region 4 \
--vpc-uuid a59c66ae-d131-49e6-a1b4-88535df82c14 \
--tag-names '' \
ml-ai-ubuntu-gpu-h100x1-80gb-nyc2
Why H100? The NVIDIA H100 GPU provides:
- 80GB HBM3 Memory for large-scale vector operations
- 4th Gen Tensor Cores for accelerated AI workloads
- Multi-Instance GPU (MIG) for efficient resource utilization
- PCIe 5.0 for high-speed data transfer
- Cloud Processing: Vectorize papers on H100 GPU droplet
- Local Development: SSH copy vectorized data to local machine
- Fast Queries: Run RAG queries locally with pre-computed embeddings
This approach ensures rapid development cycles while leveraging cloud GPU power for the heavy computational work.
- Modular Design: Clean base classes for easy extension and customization
- Paper Processing: Automatic PDF text extraction and chunking with caching
- Smart Query Enhancement: Uses OpenAI to improve search queries
- Academic Focus: Optimized for research paper Q&A with proper citations
- Interactive Interface: User-friendly command-line interface
PaperRAG/
βββ src/
β βββ rag/
β β βββ __init__.py # RAG module exports
β β βββ base.py # Base RAG classes
β β βββ PaperRAG.py # Paper-specific RAG implementation
β βββ utils/
β βββ __init__.py # Utils module exports
β βββ paper_chunks.py # PDF processing and chunking
β βββ chroma/ # ChromaDB storage
βββ assets/
β βββ papers/ # Place your PDF papers here
βββ main.py # Interactive application
βββ requirements.txt # Dependencies
Abstract base class for all RAG systems:
gen(user_query: str) -> str
: Generate response to user querysetup() -> None
: Initialize the RAG system
Base class for ChromaDB-based RAG systems:
- Handles collection management
- Provides query pipeline
- Abstract methods for customization:
_augment_user_query()
: Query enhancement_generate_answer()
: Answer generation_load_data()
: Data loading
Concrete implementation for academic papers:
- Paper-specific query enhancement
- Academic citation formatting
- Automatic paper chunking and loading
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment:
cp .env.example .env # Add your OpenAI API key to .env
-
Add papers:
# Place PDF files in assets/papers/
-
Run the interactive system:
python main.py
python main.py
from src.rag import PaperRAG
import chromadb
# Initialize
chroma_client = chromadb.PersistentClient(path="chroma")
rag_system = PaperRAG(chroma_client)
# Setup (loads papers and creates collection)
rag_system.setup()
# Ask questions
answer = rag_system.gen("What are the best practices for AI agents?")
print(answer)
from src.rag.base import ChromaRAG
class CustomRAG(ChromaRAG):
def _augment_user_query(self, user_query: str) -> str:
# Custom query enhancement
return f"enhanced: {user_query}"
def _generate_answer(self, user_query: str, rag_results) -> str:
# Custom answer generation
return "Custom answer based on RAG results"
def _load_data(self) -> None:
# Custom data loading
pass
OPENAI_API_KEY
: Your OpenAI API key
chunk_size
: Default 800 characters per chunkchunk_overlap
: Default 200 characters overlap- Cache is automatically managed based on file modifications
- Persistent storage in
src/utils/chroma/
- Collection name:
paper_collection
- Automatic collection creation and management
- Extend
ChromaRAG
orBaseRAG
- Implement required abstract methods
- Add to
src/rag/__init__.py
- Override
_load_data()
method - Implement your data loading logic
- Ensure proper document formatting for ChromaDB
- Override
_augment_user_query()
method - Implement your query improvement logic
- Return enhanced query string
BaseRAG
: Abstract base classChromaRAG
: ChromaDB-specific base class- Common functionality for all RAG systems
PaperRAG
: Academic paper RAG implementation- OpenAI integration for query enhancement and answer generation
- Paper-specific metadata handling
- PDF text extraction with PyPDF2
- Intelligent chunking with overlap
- Caching system for performance
- File change detection
PaperRAGApp
: Main application class- Interactive command-line interface
- Error handling and user experience
- Caching: Paper chunks are cached to avoid reprocessing
- Incremental Updates: Only new/modified papers are processed
- Efficient Storage: ChromaDB handles vector storage and retrieval
- Smart Queries: OpenAI-enhanced search queries for better results
-
OpenAI API Key Missing:
- Ensure
.env
file exists withOPENAI_API_KEY=your_key
- Ensure
-
No Papers Found:
- Check that PDF files are in
assets/papers/
directory
- Check that PDF files are in
-
ChromaDB Errors:
- Delete
src/utils/chroma/
directory to reset database - Ensure write permissions in the directory
- Delete
-
Import Errors:
- Ensure all dependencies are installed:
pip install -r requirements.txt
- Check Python path includes
src/
directory
- Ensure all dependencies are installed:
Add debug prints to see detailed processing:
import logging
logging.basicConfig(level=logging.DEBUG)
- Follow the existing class structure
- Add type hints to all functions
- Include docstrings for all classes and methods
- Test with different paper types and queries
- Update README for new features
This project is open source. Feel free to use and modify for your research needs.