Tweak README

This commit is contained in:
Ivan Kozik 2015-07-18 09:54:07 +00:00
parent 266cf34a23
commit 8e47415e83

View File

@ -2,13 +2,20 @@ grab-site
===
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
grab-site a URL and it will recursively crawl the site and write
[WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling.
The wpull options are preconfigured based on Archive Team's experience with
[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
crawl is already running. This allows you to skip the crawling of junk URLs that would
grab-site gives you
* a dashboard that displays all of your crawls, showing which URLs are being
grabbed, how many URLs are left in the queue, and more.
* the ability to add ignore patterns when the crawl is already running.
This allows you to skip the crawling of junk URLs that would
otherwise prevent your crawl from ever finishing. See below.
@ -84,7 +91,8 @@ These environmental variables control what the server listens on:
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
* `GRAB_SITE_WS_PORT` (default 29001)
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`,
or else you will have to add `?host=IP:PORT` to your dashboard URL.
These environmental variables control which server each `grab-site` process connects to: