grab-site/README.md

93 lines
3.1 KiB
Markdown
Raw Normal View History

2015-03-09 04:48:27 +00:00
grab-site
===
2015-03-09 04:52:18 +00:00
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
2015-03-09 04:53:38 +00:00
grab-site a URL and it will recursively crawl the site and write [WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling. The wpull options are
preconfigured based on Archive Team's experience with [ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
2015-03-09 04:48:27 +00:00
grab-site includes ArchiveBot's killer feature of being able to add ignore patterns while the
2015-03-09 04:52:18 +00:00
crawl is already running. This allows you to skip the crawling of junk URLs that would
otherwise prevent your crawl from ever finishing. See below.
2015-03-09 04:48:27 +00:00
2015-07-18 06:16:46 +00:00
2015-03-09 04:48:27 +00:00
Installation
---
2015-02-05 04:27:38 +00:00
On Ubuntu 14.04.1 or newer:
2015-02-05 04:25:49 +00:00
```
2015-02-05 19:27:19 +00:00
sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip
2015-07-18 03:17:27 +00:00
pip3 install --user wpull manhole lmdb autobahn trollius
2015-02-05 04:25:49 +00:00
git clone https://github.com/ludios/grab-site
2015-02-05 05:34:49 +00:00
cd grab-site
2015-02-05 04:25:49 +00:00
```
2015-07-18 06:16:46 +00:00
2015-03-09 04:48:27 +00:00
Usage
---
2015-02-05 04:27:38 +00:00
```
./grab-site URL
./grab-site URL --igsets=blogs,forums
./grab-site URL --igsets=blogs,forums --no-offsite-links
2015-02-05 04:27:38 +00:00
```
2015-07-17 03:59:42 +00:00
Note: `URL` must come before the options.
Note: `--igsets=` means "ignore sets" and must have the `=`.
2015-02-05 04:27:38 +00:00
2015-03-09 05:06:44 +00:00
`forums` and `blogs` are some frequently-used ignore sets.
See [the full list of available ignore sets](https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/ignore_patterns).
2015-02-05 19:32:47 +00:00
Just as with ArchiveBot, the [global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)
ignore set is implied and enabled.
2015-02-05 19:31:50 +00:00
2015-03-09 05:06:44 +00:00
grab-site always grabs page requisites (e.g. inline images and stylesheets), even if
they are on other domains. By default, grab-site also grabs linked pages to a depth
of 1 on other domains. To turn off this behavior, use `--no-offsite-links`.
Using `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,
etc from being grabbed, because these are often hosted on a CDN or subdomain, and
thus would otherwise not be included in the recursive crawl.
2015-07-18 06:16:46 +00:00
2015-03-09 04:48:27 +00:00
Changing ignores during the crawl
---
2015-07-18 06:23:24 +00:00
While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
changes will be applied as soon as the next URL is grabbed.
2015-07-18 06:23:24 +00:00
`DIR/igsets` is a comma-separated list of ignore sets to use.
2015-02-05 05:37:34 +00:00
2015-02-05 19:31:50 +00:00
`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](http://pythex.org/)
to use in addition to the ignore sets.
2015-02-05 05:37:34 +00:00
You can `touch DIR/igoff` to stop `IGNOR` message spew, and `rm DIR/igoff`
2015-02-05 05:19:34 +00:00
to turn it back on again.
2015-07-18 06:16:46 +00:00
Monitoring all of your crawls with the dashboard
---
Start the dashboard with:
`./server.py`
and point your browser to http://127.0.0.1:29000/
These environmental variables control what the server listens on:
2015-07-18 06:21:03 +00:00
* `GRAB_SITE_HTTP_INTERFACE` (default 0.0.0.0)
* `GRAB_SITE_HTTP_PORT` (default 29000)
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
* `GRAB_SITE_WS_PORT` (default 29001)
2015-07-18 06:16:46 +00:00
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`, or else you will have to add `?host=IP:PORT` to your dashboard URL.
2015-07-18 06:22:47 +00:00
These environmental variables control which server each `grab-site` process connects to:
2015-07-18 06:16:46 +00:00
2015-07-18 06:21:03 +00:00
* `GRAB_SITE_WS_HOST` (default 127.0.0.1)
* `GRAB_SITE_WS_PORT` (default 29001)