From 8e47415e839b96f0517f085030fdfc96782e3bb0 Mon Sep 17 00:00:00 2001 From: Ivan Kozik <ivan@ludios.org> Date: Sat, 18 Jul 2015 09:54:07 +0000 Subject: [PATCH] Tweak README --- README.md | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 397e71c..8c59db0 100644 --- a/README.md +++ b/README.md @@ -2,14 +2,21 @@ grab-site === grab-site is an easy preconfigured web crawler designed for backing up websites. Give -grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem). +grab-site a URL and it will recursively crawl the site and write +[WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem). -grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are -preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot). +grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. +The wpull options are preconfigured based on Archive Team's experience with +[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot). -grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the -crawl is already running. This allows you to skip the crawling of junk URLs that would -otherwise prevent your crawl from ever finishing. See below. +grab-site gives you + +* a dashboard that displays all of your crawls, showing which URLs are being + grabbed, how many URLs are left in the queue, and more. + +* the ability to add ignore patterns when the crawl is already running. + This allows you to skip the crawling of junk URLs that would + otherwise prevent your crawl from ever finishing. See below. Installation @@ -84,7 +91,8 @@ These environmental variables control what the server listens on: * `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0) * `GRAB_SITE_WS_PORT` (default 29001) -`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL. +`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, +or else you will have to add `?host=IP:PORT` to your dashboard URL. These environmental variables control which server each `grab-site` process connects to: