Tweak README
This commit is contained in:
parent
266cf34a23
commit
8e47415e83
20
README.md
20
README.md
@ -2,13 +2,20 @@ grab-site
|
|||||||
===
|
===
|
||||||
|
|
||||||
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
|
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
|
||||||
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
grab-site a URL and it will recursively crawl the site and write
|
||||||
|
[WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
||||||
|
|
||||||
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are
|
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling.
|
||||||
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
The wpull options are preconfigured based on Archive Team's experience with
|
||||||
|
[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
||||||
|
|
||||||
grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
|
grab-site gives you
|
||||||
crawl is already running. This allows you to skip the crawling of junk URLs that would
|
|
||||||
|
* a dashboard that displays all of your crawls, showing which URLs are being
|
||||||
|
grabbed, how many URLs are left in the queue, and more.
|
||||||
|
|
||||||
|
* the ability to add ignore patterns when the crawl is already running.
|
||||||
|
This allows you to skip the crawling of junk URLs that would
|
||||||
otherwise prevent your crawl from ever finishing. See below.
|
otherwise prevent your crawl from ever finishing. See below.
|
||||||
|
|
||||||
|
|
||||||
@ -84,7 +91,8 @@ These environmental variables control what the server listens on:
|
|||||||
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
|
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
|
||||||
* `GRAB_SITE_WS_PORT` (default 29001)
|
* `GRAB_SITE_WS_PORT` (default 29001)
|
||||||
|
|
||||||
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`,
|
||||||
|
or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
||||||
|
|
||||||
These environmental variables control which server each `grab-site` process connects to:
|
These environmental variables control which server each `grab-site` process connects to:
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user