Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.
In today's fast-paced development environment, you need:
- π Quick access to dependency documentation without the bloat
- π€ Documentation in a format that's ready for RAG systems and LLMs
- π― Focused content without navigation elements, ads, or irrelevant sections
- β‘ Fast, efficient way to keep documentation up-to-date
- π§Ή Clean Markdown output for easy integration with documentation tools
Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.
-
Clean Documentation Output
- Markdown format for content-focused documentation
- JSON format for structured menu data
- Perfect for documentation sites, wikis, and knowledge bases
- Ideal format for LLM training and RAG systems
-
Smart Content Extraction
- Automatically identifies main content areas
- Strips away navigation, ads, and irrelevant sections
- Preserves code blocks and technical formatting
- Maintains proper Markdown structure
-
Flexible Crawling Strategies
- Single page for quick reference docs
- Multi-page for comprehensive library documentation
- Sitemap-based for complete framework coverage
- Menu-based for structured documentation hierarchies
-
LLM and RAG Ready
- Clean Markdown text suitable for embeddings
- Preserved code blocks for technical accuracy
- Structured menu data in JSON format
- Consistent formatting for reliable processing
A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.
- π Multiple crawling strategies
- π Automatic nested menu expansion
- π Handles dynamic content and lazy-loaded elements
- π― Configurable selectors
- π Clean Markdown output for documentation
- π JSON output for menu structure
- π¨ Colorful terminal feedback
- π Smart URL processing
- β‘ Asynchronous execution
-
Single URL Crawler (
single_url_crawler.py
)- Extracts content from a single documentation page
- Outputs clean Markdown format
- Perfect for targeted content extraction
- Configurable content selectors
-
Multi URL Crawler (
multi_url_crawler.py
)- Processes multiple URLs in parallel
- Generates individual Markdown files per page
- Efficient batch processing
- Shared browser session for better performance
-
Sitemap Crawler (
sitemap_crawler.py
)- Automatically discovers and crawls sitemap.xml
- Creates Markdown files for each page
- Supports recursive sitemap parsing
- Handles gzipped sitemaps
-
Menu Crawler (
menu_crawler.py
)- Extracts all menu links from documentation
- Outputs structured JSON format
- Handles nested and dynamic menus
- Smart menu expansion
- Python 3.7+
- Virtual Environment (recommended)
- Clone the repository:
git clone https://github.com/felores/crawl4ai_docs_scraper.git
cd crawl4ai_docs_scraper
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
python single_url_crawler.py https://docs.example.com/page
Arguments:
- URL: Target documentation URL (required, first argument)
Note: Use quotes only if your URL contains special characters or spaces.
Output format (Markdown):
# Page Title
## Section 1
Content with preserved formatting, including:
- Lists
- Links
- Tables
### Code Examples
```python
def example():
return "Code blocks are preserved"
# Using a text file with URLs
python multi_url_crawler.py urls.txt
# Using JSON output from menu crawler
python multi_url_crawler.py menu_links.json
# Using custom output prefix
python multi_url_crawler.py menu_links.json --output-prefix custom_name
Arguments:
- URLs file: Path to file containing URLs (required, first argument)
- Can be .txt with one URL per line
- Or .json from menu crawler output
--output-prefix
: Custom prefix for output markdown file (optional)
Note: Use quotes only if your file path contains spaces.
Output filename format:
- Without
--output-prefix
:domain_path_docs_content_timestamp.md
(e.g.,cloudflare_agents_docs_content_20240323_223656.md
) - With
--output-prefix
:custom_prefix_docs_content_timestamp.md
(e.g.,custom_name_docs_content_20240323_223656.md
)
The crawler accepts two types of input files:
- Text file with one URL per line:
https://docs.example.com/page1
https://docs.example.com/page2
https://docs.example.com/page3
- JSON file (compatible with menu crawler output):
{
"menu_links": [
"https://docs.example.com/page1",
"https://docs.example.com/page2"
]
}
python sitemap_crawler.py https://docs.example.com/sitemap.xml
Options:
--max-depth
: Maximum sitemap recursion depth (optional)--patterns
: URL patterns to include (optional)
python menu_crawler.py https://docs.example.com
Options:
--selectors
: Custom menu selectors (optional)
The menu crawler now saves its output to the input_files
directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:
{
"start_url": "https://docs.example.com/",
"total_links_found": 42,
"menu_links": [
"https://docs.example.com/page1",
"https://docs.example.com/page2"
]
}
After running the menu crawler, you'll get a command to run the multi-url crawler with the generated file.
crawl4ai_docs_scraper/
βββ input_files/ # Input files for URL processing
β βββ urls.txt # Text file with URLs
β βββ menu_links.json # JSON output from menu crawler
βββ scraped_docs/ # Output directory for markdown files
β βββ docs_timestamp.md # Generated documentation
βββ multi_url_crawler.py
βββ menu_crawler.py
βββ requirements.txt
All crawlers include comprehensive error handling with colored terminal output:
- π’ Green: Success messages
- π΅ Cyan: Processing status
- π‘ Yellow: Warnings
- π΄ Red: Error messages
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses Crawl4AI for web data extraction.