Tweak README
This commit is contained in:
parent
266cf34a23
commit
8e47415e83
20
README.md
20
README.md
@ -2,13 +2,20 @@ grab-site
|
||||
===
|
||||
|
||||
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
|
||||
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
||||
grab-site a URL and it will recursively crawl the site and write
|
||||
[WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
||||
|
||||
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are
|
||||
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
||||
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling.
|
||||
The wpull options are preconfigured based on Archive Team's experience with
|
||||
[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
||||
|
||||
grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
|
||||
crawl is already running. This allows you to skip the crawling of junk URLs that would
|
||||
grab-site gives you
|
||||
|
||||
* a dashboard that displays all of your crawls, showing which URLs are being
|
||||
grabbed, how many URLs are left in the queue, and more.
|
||||
|
||||
* the ability to add ignore patterns when the crawl is already running.
|
||||
This allows you to skip the crawling of junk URLs that would
|
||||
otherwise prevent your crawl from ever finishing. See below.
|
||||
|
||||
|
||||
@ -84,7 +91,8 @@ These environmental variables control what the server listens on:
|
||||
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
|
||||
* `GRAB_SITE_WS_PORT` (default 29001)
|
||||
|
||||
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
||||
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`,
|
||||
or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
||||
|
||||
These environmental variables control which server each `grab-site` process connects to:
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user