2015-03-09 04:48:27 +00:00
|
|
|
grab-site
|
|
|
|
===
|
|
|
|
|
2015-03-09 04:52:18 +00:00
|
|
|
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
|
2015-03-09 04:53:38 +00:00
|
|
|
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
|
|
|
|
|
|
|
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are
|
|
|
|
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
2015-03-09 04:48:27 +00:00
|
|
|
|
|
|
|
grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
|
2015-03-09 04:52:18 +00:00
|
|
|
crawl is already running. This allows you to skip the crawling of junk URLs that would
|
|
|
|
otherwise prevent your crawl from ever finishing. See below.
|
2015-03-09 04:48:27 +00:00
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-03-09 04:48:27 +00:00
|
|
|
Installation
|
|
|
|
---
|
|
|
|
|
2015-02-05 04:27:38 +00:00
|
|
|
On Ubuntu 14.04.1 or newer:
|
|
|
|
|
2015-02-05 04:25:49 +00:00
|
|
|
```
|
2015-02-05 19:27:19 +00:00
|
|
|
sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip
|
2015-07-18 03:17:27 +00:00
|
|
|
pip3 install --user wpull manhole lmdb autobahn trollius
|
2015-02-05 04:25:49 +00:00
|
|
|
git clone https://github.com/ludios/grab-site
|
2015-02-05 05:34:49 +00:00
|
|
|
cd grab-site
|
2015-02-05 04:25:49 +00:00
|
|
|
```
|
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-03-09 04:48:27 +00:00
|
|
|
Usage
|
|
|
|
---
|
2015-02-05 04:27:38 +00:00
|
|
|
|
|
|
|
```
|
|
|
|
./grab-site URL
|
2015-07-18 02:11:18 +00:00
|
|
|
./grab-site URL --igsets=blogs,forums
|
|
|
|
./grab-site URL --igsets=blogs,forums --no-offsite-links
|
2015-02-05 04:27:38 +00:00
|
|
|
```
|
|
|
|
|
2015-07-17 03:59:42 +00:00
|
|
|
Note: `URL` must come before the options.
|
|
|
|
|
2015-07-18 02:11:18 +00:00
|
|
|
Note: `--igsets=` means "ignore sets" and must have the `=`.
|
2015-02-05 04:27:38 +00:00
|
|
|
|
2015-03-09 05:06:44 +00:00
|
|
|
`forums` and `blogs` are some frequently-used ignore sets.
|
|
|
|
See [the full list of available ignore sets](https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/ignore_patterns).
|
|
|
|
|
2015-02-05 19:32:47 +00:00
|
|
|
Just as with ArchiveBot, the [global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)
|
|
|
|
ignore set is implied and enabled.
|
2015-02-05 19:31:50 +00:00
|
|
|
|
2015-03-09 05:06:44 +00:00
|
|
|
grab-site always grabs page requisites (e.g. inline images and stylesheets), even if
|
|
|
|
they are on other domains. By default, grab-site also grabs linked pages to a depth
|
|
|
|
of 1 on other domains. To turn off this behavior, use `--no-offsite-links`.
|
|
|
|
|
|
|
|
Using `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,
|
|
|
|
etc from being grabbed, because these are often hosted on a CDN or subdomain, and
|
|
|
|
thus would otherwise not be included in the recursive crawl.
|
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-03-09 04:48:27 +00:00
|
|
|
Changing ignores during the crawl
|
|
|
|
---
|
|
|
|
|
2015-07-18 06:23:24 +00:00
|
|
|
While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
|
2015-02-05 04:59:28 +00:00
|
|
|
changes will be applied as soon as the next URL is grabbed.
|
|
|
|
|
2015-07-18 06:23:24 +00:00
|
|
|
`DIR/igsets` is a comma-separated list of ignore sets to use.
|
2015-02-05 05:37:34 +00:00
|
|
|
|
2015-02-05 19:31:50 +00:00
|
|
|
`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](http://pythex.org/)
|
|
|
|
to use in addition to the ignore sets.
|
2015-02-05 05:37:34 +00:00
|
|
|
|
|
|
|
You can `touch DIR/igoff` to stop `IGNOR` message spew, and `rm DIR/igoff`
|
2015-02-05 05:19:34 +00:00
|
|
|
to turn it back on again.
|
2015-07-18 06:16:46 +00:00
|
|
|
|
|
|
|
|
|
|
|
Monitoring all of your crawls with the dashboard
|
|
|
|
---
|
|
|
|
|
|
|
|
Start the dashboard with:
|
|
|
|
|
|
|
|
`./server.py`
|
|
|
|
|
|
|
|
and point your browser to http://127.0.0.1:29000/
|
|
|
|
|
|
|
|
These environmental variables control what the server listens on:
|
|
|
|
|
2015-07-18 06:21:03 +00:00
|
|
|
* `GRAB_SITE_HTTP_INTERFACE` (default 0.0.0.0)
|
|
|
|
* `GRAB_SITE_HTTP_PORT` (default 29000)
|
|
|
|
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
|
|
|
|
* `GRAB_SITE_WS_PORT` (default 29001)
|
2015-07-18 06:16:46 +00:00
|
|
|
|
|
|
|
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
|
|
|
|
2015-07-18 06:22:47 +00:00
|
|
|
These environmental variables control which server each `grab-site` process connects to:
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-18 06:21:03 +00:00
|
|
|
* `GRAB_SITE_WS_HOST` (default 127.0.0.1)
|
|
|
|
* `GRAB_SITE_WS_PORT` (default 29001)
|