Micro Web Crawler in PHP & Manticore
Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for Gemini Protocol.
To use HTTP
version, please checkout main branch!
- MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
- Page snap history with local and remote mirrors support (including FTP protocol)
- CLI tools for index administration and crontab tasks
- Gemini Protocol UI (coming soon)
- Manticore Server
- PHP library for Manticore
- PHP library for Gemini Protocol
- PHP library for Network operations
- FTP client for snap mirrors
wget https://repo.manticoresearch.com/manticore-repo.noarch.deb
dpkg -i manticore-repo.noarch.deb
apt update
apt install git composer manticore manticore-extra memcached php-fpm php-mbstring php-memcached
Yo search engine uses Manticore as the primary database. If your server sensitive to power down,
change default binlog flush strategy to binlog_flush = 1
git clone https://github.com/YGGverse/Yo.git
cd Yo
git checkout gemini
composer update
git clone https://github.com/YGGverse/Yo.git
cd Yo
git checkout gemini
git checkout -b pr-branch
git commit -m 'new fix'
git push
cd Yo
git pull
composer update
cp example/config.json config.json
php src/cli/index/init.php
php src/cli/document/add.php URL
php src/cli/document/crawl.php
php src/cli/document/search.php '*'
Coming soon..
Create initial index
php src/cli/index/init.php [reset]
reset
- optional, reset existing index
Change existing index
php src/cli/index/alter.php {operation} {column} {type}
operation
- operation name, supported values:add
|drop
column
- target column nametype
- target column type, supported values:text
|integer
php src/cli/document/add.php URL
URL
- add new URL to the crawl queue
php src/cli/document/crawl.php
Make index optimization, apply new configuration rules
php src/cli/document/clean.php [limit]
limit
- integer, documents quantity per queue
php src/cli/document/search.php '@title "*"' [limit]
query
- requiredlimit
- optional search results limit
SQL text dumps could be useful for public index distribution, but requires more computing resources.
Better for infrastructure administration and includes original data binaries.
Coming soon..