Skip to content

kimdonghwi94/web-analyzer-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Web Analyzer MCP

WebAnalyzer MCP server

A powerful MCP (Model Context Protocol) server for intelligent web content analysis and summarization. Built with FastMCP, this server provides smart web scraping, content extraction, and AI-powered question-answering capabilities.

✨ Features

🎯 Core Tools

  1. url_to_markdown - Extract and summarize key web page content

    • Analyzes content importance using custom algorithms
    • Removes ads, navigation, and irrelevant content
    • Keeps only essential information (tables, images, key text)
    • Outputs structured markdown optimized for analysis
  2. web_content_qna - AI-powered Q&A about web content

    • Extracts relevant content sections from web pages
    • Uses intelligent chunking and relevance matching
    • Answers questions using OpenAI GPT models

πŸš€ Key Features

  • Smart Content Ranking: Algorithm-based content importance scoring
  • Essential Content Only: Removes clutter, keeps what matters
  • Multi-IDE Support: Works with Claude Desktop, Cursor, VS Code, PyCharm
  • Flexible Models: Choose from GPT-3.5, GPT-4, GPT-4 Turbo, or GPT-5

πŸ“¦ Installation

Prerequisites

  • uv (Python package manager)
  • Chrome/Chromium browser (for Selenium)
  • OpenAI API key (for Q&A functionality)

πŸš€ Quick Start with uv (Recommended)

# Clone the repository
git clone https://github.com/kimdonghwi94/web-analyzer-mcp.git
cd web-analyzer-mcp

# Run directly with uv (auto-installs dependencies)
uv run mcp-webanalyzer

Installing via Smithery

To install web-analyzer-mcp for Claude Desktop automatically via Smithery:

npx -y @smithery/cli install @kimdonghwi94/web-analyzer-mcp --client claude

IDE/Editor Integration

Install Claude Desktop

Add to your Claude Desktop_config.json file. See Claude Desktop MCP documentation for more details.

{
  "mcpServers": {
    "web-analyzer": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/web-analyzer-mcp",
        "run", 
        "mcp-webanalyzer"
      ],
      "env": {
        "OPENAI_API_KEY": "your_openai_api_key_here",
        "OPENAI_MODEL": "gpt-4"
      }
    }
  }
}
Install Claude Code (VS Code Extension)

Add the server using Claude Code CLI:

claude mcp add web-analyzer -e OPENAI_API_KEY=your_api_key_here -e OPENAI_MODEL=gpt-4 -- uv --directory /path/to/web-analyzer-mcp run mcp-webanalyzer
Install Cursor IDE

Add to your Cursor settings (File > Preferences > Settings > Extensions > MCP):

{
  "mcpServers": {
    "web-analyzer": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/web-analyzer-mcp",
        "run", 
        "mcp-webanalyzer"
      ],
      "env": {
        "OPENAI_API_KEY": "your_openai_api_key_here",
        "OPENAI_MODEL": "gpt-4"
      }
    }
  }
}
Install JetBrains AI Assistant

See JetBrains AI Assistant Documentation for more details.

  1. In JetBrains IDEs go to Settings β†’ Tools β†’ AI Assistant β†’ Model Context Protocol (MCP)
  2. Click + Add
  3. Click on Command in the top-left corner of the dialog and select the As JSON option from the list
  4. Add this configuration and click OK:
{
  "mcpServers": {
    "web-analyzer": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/web-analyzer-mcp",
        "run", 
        "mcp-webanalyzer"
      ],
      "env": {
        "OPENAI_API_KEY": "your_openai_api_key_here",
        "OPENAI_MODEL": "gpt-4"
      }
    }
  }
}

πŸŽ›οΈ Tool Descriptions

url_to_markdown

Converts web pages to clean markdown format with essential content extraction.

Parameters:

  • url (string): The web page URL to analyze

Returns: Clean markdown content with structured data preservation

web_content_qna

Answers questions about web page content using intelligent content analysis.

Parameters:

  • url (string): The web page URL to analyze
  • question (string): Question about the page content

Returns: AI-generated answer based on page content

πŸ—οΈ Architecture

Content Extraction Pipeline

  1. URL Validation - Ensures proper URL format
  2. HTML Fetching - Uses Selenium for dynamic content
  3. Content Parsing - BeautifulSoup for HTML processing
  4. Element Scoring - Custom algorithm ranks content importance
  5. Content Filtering - Removes duplicates and low-value content
  6. Markdown Conversion - Structured output generation

Q&A Processing Pipeline

  1. Content Chunking - Intelligent text segmentation
  2. Relevance Scoring - Matches content to questions
  3. Context Selection - Picks most relevant chunks
  4. Answer Generation - OpenAI GPT integration

πŸ—οΈ Project Structure

web-analyzer-mcp/
β”œβ”€β”€ web_analyzer_mcp/          # Main Python package
β”‚   β”œβ”€β”€ __init__.py           # Package initialization
β”‚   β”œβ”€β”€ server.py             # FastMCP server with tools
β”‚   β”œβ”€β”€ web_extractor.py      # Web content extraction engine
β”‚   └── rag_processor.py      # RAG-based Q&A processor
β”œβ”€β”€ scripts/                   # Build and utility scripts
β”‚   └── build.js              # Node.js build script
β”œβ”€β”€ README.md                 # English documentation
β”œβ”€β”€ README.ko.md              # Korean documentation
β”œβ”€β”€ package.json              # npm configuration and scripts
β”œβ”€β”€ pyproject.toml            # Python package configuration
β”œβ”€β”€ .env.example              # Environment variables template
└── dist-info.json            # Build information (generated)

πŸ› οΈ Development

Modern Development with uv

# Clone repository
git clone https://github.com/kimdonghwi94/web-analyzer-mcp.git
cd web-analyzer-mcp

# Development commands
uv run mcp-webanalyzer     # Start development server
uv run python -m pytest   # Run tests
uv run ruff check .        # Lint code
uv run ruff format .       # Format code
uv sync                    # Sync dependencies

# Install development dependencies
uv add --dev pytest ruff mypy

# Create production build
npm run build

Alternative: Traditional Python Development

# Setup Python environment (if not using uv)
pip install -e .[dev]

# Development commands
python -m web_analyzer_mcp.server  # Start server
python -m pytest tests/            # Run tests
python -m ruff check .             # Lint code
python -m ruff format .            # Format code
python -m mypy web_analyzer_mcp/   # Type checking

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“‹ Roadmap

  • Support for more content types (PDFs, videos)
  • Multi-language content extraction
  • Custom extraction rules
  • Caching for frequently accessed content
  • Webhook support for real-time updates

⚠️ Limitations

  • Requires Chrome/Chromium for JavaScript-heavy sites
  • OpenAI API key needed for Q&A functionality
  • Rate limited to prevent abuse
  • Some sites may block automated access

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™‹β€β™‚οΈ Support

  • Create an issue for bug reports or feature requests
  • Contribute to discussions in the GitHub repository
  • Check the documentation for detailed guides

🌟 Acknowledgments

  • Built with FastMCP framework
  • Inspired by HTMLRAG techniques for web content processing
  • Thanks to the MCP community for feedback and contributions

Made with ❀️ for the MCP community

About

This project is a web analysis summary MCP server project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages