grab-site/README.md

grab-site
===

grab-site is an easy preconfigured web crawler designed for backing up websites.  Give
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).

grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling.  The wpull options are
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).

grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
crawl is already running.  This allows you to skip the crawling of junk URLs that would
otherwise prevent your crawl from ever finishing.  See below.


Installation
---

On Ubuntu 14.04.1 or newer:

```
sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip
pip3 install --user wpull manhole lmdb autobahn trollius
git clone https://github.com/ludios/grab-site
cd grab-site
```


Usage
---

```
./grab-site URL
./grab-site URL --igsets=blogs,forums
./grab-site URL --igsets=blogs,forums --no-offsite-links
```

Note: `URL` must come before the options.

Note: `--igsets=` means "ignore sets" and must have the `=`.

`forums` and `blogs` are some frequently-used ignore sets.
See [the full list of available ignore sets](https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/ignore_patterns).

Just as with ArchiveBot, the [global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)
ignore set is implied and enabled.

grab-site always grabs page requisites (e.g. inline images and stylesheets), even if
they are on other domains.  By default, grab-site also grabs linked pages to a depth
of 1 on other domains.  To turn off this behavior, use `--no-offsite-links`.

Using `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,
etc from being grabbed, because these are often hosted on a CDN or subdomain, and
thus would otherwise not be included in the recursive crawl.


Changing ignores during the crawl
---

While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
changes will be applied as soon as the next URL is grabbed.

`DIR/igsets` is a comma-separated list of ignore sets to use.

`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](http://pythex.org/)
to use in addition to the ignore sets.

You can `touch DIR/igoff` to stop `IGNOR` message spew, and `rm DIR/igoff`
to turn it back on again.


Monitoring all of your crawls with the dashboard
---

Start the dashboard with:

`./server.py`

and point your browser to http://127.0.0.1:29000/

These environmental variables control what the server listens on:

*	`GRAB_SITE_HTTP_INTERFACE` (default 0.0.0.0)
*	`GRAB_SITE_HTTP_PORT` (default 29000)
*	`GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
*	`GRAB_SITE_WS_PORT` (default 29001)

`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.

These environmental variables control which server each `grab-site` process connects to:

*	`GRAB_SITE_WS_HOST` (default 127.0.0.1)
*	`GRAB_SITE_WS_PORT` (default 29001)
Describe what this is 2015-03-09 04:48:27 +00:00			`grab-site`
			`===`

Mention WARC files; clarify 2015-03-09 04:52:18 +00:00			`grab-site is an easy preconfigured web crawler designed for backing up websites. Give`
Cleanup 2015-03-09 04:53:38 +00:00			`grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).`

			`grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are`
			`preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).`
Describe what this is 2015-03-09 04:48:27 +00:00
			`grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the`
Mention WARC files; clarify 2015-03-09 04:52:18 +00:00			`crawl is already running. This allows you to skip the crawling of junk URLs that would`
			`otherwise prevent your crawl from ever finishing. See below.`
Describe what this is 2015-03-09 04:48:27 +00:00
Document grab-site dashboard 2015-07-18 06:16:46 +00:00
Describe what this is 2015-03-09 04:48:27 +00:00			`Installation`
			`---`

Improve README 2015-02-05 04:27:38 +00:00			`On Ubuntu 14.04.1 or newer:`

CRLF -> LF 2015-02-05 04:25:49 +00:00			```
Tell user to install git as well 2015-02-05 19:27:19 +00:00			`sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip`
Make reconnecting work 2015-07-18 03:17:27 +00:00			`pip3 install --user wpull manhole lmdb autobahn trollius`
CRLF -> LF 2015-02-05 04:25:49 +00:00			`git clone https://github.com/ludios/grab-site`
Make it real obvious 2015-02-05 05:34:49 +00:00			`cd grab-site`
CRLF -> LF 2015-02-05 04:25:49 +00:00			```

Document grab-site dashboard 2015-07-18 06:16:46 +00:00
Describe what this is 2015-03-09 04:48:27 +00:00			`Usage`
			`---`
Improve README 2015-02-05 04:27:38 +00:00
			```
			`./grab-site URL`
Make WebSocket client/server sort of work; rename ignore_sets to igsets 2015-07-18 02:11:18 +00:00			`./grab-site URL --igsets=blogs,forums`
			`./grab-site URL --igsets=blogs,forums --no-offsite-links`
Improve README 2015-02-05 04:27:38 +00:00			```

Clarify argument order requirement 2015-07-17 03:59:42 +00:00			Note: `URL` must come before the options.

Make WebSocket client/server sort of work; rename ignore_sets to igsets 2015-07-18 02:11:18 +00:00			Note: `--igsets=` means "ignore sets" and must have the `=`.
Improve README 2015-02-05 04:27:38 +00:00
Describe arguments more 2015-03-09 05:06:44 +00:00			`forums` and `blogs` are some frequently-used ignore sets.
			`See [the full list of available ignore sets](https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/ignore_patterns).`

Link to global ignore set 2015-02-05 19:32:47 +00:00			`Just as with ArchiveBot, the [global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)`
			`ignore set is implied and enabled.`
Clarify 2015-02-05 19:31:50 +00:00
Describe arguments more 2015-03-09 05:06:44 +00:00			`grab-site always grabs page requisites (e.g. inline images and stylesheets), even if`
			`they are on other domains. By default, grab-site also grabs linked pages to a depth`
			of 1 on other domains. To turn off this behavior, use `--no-offsite-links`.

			Using `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,
			`etc from being grabbed, because these are often hosted on a CDN or subdomain, and`
			`thus would otherwise not be included in the recursive crawl.`

Document grab-site dashboard 2015-07-18 06:16:46 +00:00
Describe what this is 2015-03-09 04:48:27 +00:00			`Changing ignores during the crawl`
			`---`

ignore_sets -> igsets 2015-07-18 06:23:24 +00:00			While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
Load changes from DIR/ignores and DIR/ignore_sets while the crawl is running 2015-02-05 04:59:28 +00:00			`changes will be applied as soon as the next URL is grabbed.`

ignore_sets -> igsets 2015-07-18 06:23:24 +00:00			`DIR/igsets` is a comma-separated list of ignore sets to use.
Document file formats 2015-02-05 05:37:34 +00:00
Clarify 2015-02-05 19:31:50 +00:00			`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](http://pythex.org/)
			`to use in addition to the ignore sets.`
Document file formats 2015-02-05 05:37:34 +00:00
			You can `touch DIR/igoff` to stop `IGNOR` message spew, and `rm DIR/igoff`
Add igoff feature 2015-02-05 05:19:34 +00:00			`to turn it back on again.`
Document grab-site dashboard 2015-07-18 06:16:46 +00:00

			`Monitoring all of your crawls with the dashboard`
			`---`

			`Start the dashboard with:`

			`./server.py`

			`and point your browser to http://127.0.0.1:29000/`

			`These environmental variables control what the server listens on:`

Tweak README 2015-07-18 06:21:03 +00:00			* `GRAB_SITE_HTTP_INTERFACE` (default 0.0.0.0)
			* `GRAB_SITE_HTTP_PORT` (default 29000)
			* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
			* `GRAB_SITE_WS_PORT` (default 29001)
Document grab-site dashboard 2015-07-18 06:16:46 +00:00
			`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.

Tweak README 2015-07-18 06:22:47 +00:00			These environmental variables control which server each `grab-site` process connects to:
Document grab-site dashboard 2015-07-18 06:16:46 +00:00
Tweak README 2015-07-18 06:21:03 +00:00			* `GRAB_SITE_WS_HOST` (default 127.0.0.1)
			* `GRAB_SITE_WS_PORT` (default 29001)