2015-03-09 04:48:27 +00:00
|
|
|
grab-site
|
|
|
|
===
|
|
|
|
|
2015-03-09 04:52:18 +00:00
|
|
|
grab-site is an easy preconfigured web crawler designed for backing up websites. Give
|
2015-07-18 09:54:07 +00:00
|
|
|
grab-site a URL and it will recursively crawl the site and write
|
|
|
|
[WARC files](http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem).
|
2015-03-09 04:53:38 +00:00
|
|
|
|
2015-07-18 09:54:07 +00:00
|
|
|
grab-site uses [wpull](https://github.com/chfoo/wpull) for crawling.
|
|
|
|
The wpull options are preconfigured based on Archive Team's experience with
|
|
|
|
[ArchiveBot](https://github.com/ArchiveTeam/ArchiveBot).
|
2015-03-09 04:48:27 +00:00
|
|
|
|
2015-07-18 09:54:07 +00:00
|
|
|
grab-site gives you
|
|
|
|
|
2015-07-18 11:25:00 +00:00
|
|
|
* a dashboard with all of your crawls, showing which URLs are being
|
2015-07-18 09:54:07 +00:00
|
|
|
grabbed, how many URLs are left in the queue, and more.
|
|
|
|
|
|
|
|
* the ability to add ignore patterns when the crawl is already running.
|
|
|
|
This allows you to skip the crawling of junk URLs that would
|
|
|
|
otherwise prevent your crawl from ever finishing. See below.
|
2015-03-09 04:48:27 +00:00
|
|
|
|
2015-07-18 09:58:17 +00:00
|
|
|
* an extensively tested default ignore set ("[global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)")
|
|
|
|
as well as additional (optional) ignore sets for blogs, forums, etc.
|
|
|
|
|
2015-07-18 10:02:10 +00:00
|
|
|
* duplicate page detection: links are not followed on pages whose
|
2015-07-18 10:01:25 +00:00
|
|
|
content duplicates an already-seen page.
|
|
|
|
|
2015-07-18 11:22:28 +00:00
|
|
|
![dashboard screenshot](https://raw.githubusercontent.com/ludios/grab-site/master/images/dashboard.png)
|
2015-07-18 11:19:07 +00:00
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-20 06:35:32 +00:00
|
|
|
Install on Ubuntu
|
2015-03-09 04:48:27 +00:00
|
|
|
---
|
|
|
|
|
2015-02-05 04:27:38 +00:00
|
|
|
On Ubuntu 14.04.1 or newer:
|
|
|
|
|
2015-02-05 04:25:49 +00:00
|
|
|
```
|
2015-02-05 19:27:19 +00:00
|
|
|
sudo apt-get install --no-install-recommends git build-essential python3-dev python3-pip
|
2015-07-18 10:41:24 +00:00
|
|
|
pip3 install --user git+https://github.com/ludios/grab-site
|
2015-02-05 04:25:49 +00:00
|
|
|
```
|
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
To avoid having to type out `~/.local/bin/` below, add this to your
|
|
|
|
`~/.bashrc` or `~/.zshrc`:
|
2015-07-20 07:25:06 +00:00
|
|
|
|
|
|
|
```
|
|
|
|
PATH="$PATH:$HOME/.local/bin"
|
|
|
|
```
|
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-20 06:35:32 +00:00
|
|
|
Install on OS X
|
|
|
|
---
|
|
|
|
|
|
|
|
On OS X 10.10:
|
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
1. If xcode is not already installed, type `gcc` in Terminal; you will be
|
|
|
|
prompted to install the command-line developer tools. Click 'Install'.
|
2015-07-20 06:35:32 +00:00
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
2. If Python 3 is not already installed, install Python 3.4.3 using the
|
|
|
|
installer from https://www.python.org/downloads/release/python-343/
|
2015-07-20 06:35:32 +00:00
|
|
|
|
|
|
|
3. `pip3 install --user git+https://github.com/ludios/grab-site`
|
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
**Important usage note**: Use `~/Library/Python/3.4/bin/` instead of
|
|
|
|
`~/.local/bin/` for all instructions below!
|
2015-07-20 06:35:32 +00:00
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
To avoid having to type out `~/Library/Python/3.4/bin/` below,
|
|
|
|
add this to your `~/.bash_profile` (which may not exist yet):
|
2015-07-20 07:25:06 +00:00
|
|
|
|
|
|
|
```
|
|
|
|
PATH="$PATH:$HOME/Library/Python/3.4/bin"
|
|
|
|
```
|
|
|
|
|
2015-07-20 06:35:32 +00:00
|
|
|
|
2015-03-09 04:48:27 +00:00
|
|
|
Usage
|
|
|
|
---
|
2015-07-18 12:06:00 +00:00
|
|
|
First, start the dashboard with:
|
|
|
|
|
2015-07-18 12:09:51 +00:00
|
|
|
```
|
|
|
|
~/.local/bin/gs-server
|
|
|
|
```
|
2015-07-18 12:06:00 +00:00
|
|
|
|
|
|
|
and point your browser to http://127.0.0.1:29000/
|
|
|
|
|
|
|
|
Then, start as many crawls as you want with:
|
|
|
|
|
2015-02-05 04:27:38 +00:00
|
|
|
```
|
2015-07-18 10:41:24 +00:00
|
|
|
~/.local/bin/grab-site URL
|
2015-02-05 04:27:38 +00:00
|
|
|
```
|
|
|
|
|
2015-07-18 12:06:00 +00:00
|
|
|
Do this inside tmux unless they're very short crawls.
|
|
|
|
|
2015-07-20 08:23:35 +00:00
|
|
|
Options:
|
2015-07-17 03:59:42 +00:00
|
|
|
|
2015-07-20 08:23:35 +00:00
|
|
|
* `--igsets=blogs,forums`: use ignore sets `blogs` and `forums`.
|
|
|
|
|
|
|
|
Example: `~/.local/bin/grab-site URL --igsets=blogs,forums`
|
|
|
|
|
|
|
|
Note: `igsets` must be followed with `=` and not ` `.
|
|
|
|
|
|
|
|
* `--no-offsite-links`: avoid following links to a depth of 1 on other domains.
|
|
|
|
|
|
|
|
* `--1`: grab just `URL` and page requisites without recursing.
|
|
|
|
|
|
|
|
* `--level=N`: recurse `N` levels instead of `inf` levels.
|
|
|
|
```
|
|
|
|
|
|
|
|
Note: `URL` must always come **before** the options.
|
2015-02-05 04:27:38 +00:00
|
|
|
|
2015-03-09 05:06:44 +00:00
|
|
|
`forums` and `blogs` are some frequently-used ignore sets.
|
|
|
|
See [the full list of available ignore sets](https://github.com/ArchiveTeam/ArchiveBot/tree/master/db/ignore_patterns).
|
|
|
|
|
2015-02-05 19:32:47 +00:00
|
|
|
Just as with ArchiveBot, the [global](https://github.com/ArchiveTeam/ArchiveBot/blob/master/db/ignore_patterns/global.json)
|
|
|
|
ignore set is implied and enabled.
|
2015-02-05 19:31:50 +00:00
|
|
|
|
2015-03-09 05:06:44 +00:00
|
|
|
grab-site always grabs page requisites (e.g. inline images and stylesheets), even if
|
|
|
|
they are on other domains. By default, grab-site also grabs linked pages to a depth
|
|
|
|
of 1 on other domains. To turn off this behavior, use `--no-offsite-links`.
|
|
|
|
|
|
|
|
Using `--no-offsite-links` may prevent all kinds of useful images, video, audio, downloads,
|
|
|
|
etc from being grabbed, because these are often hosted on a CDN or subdomain, and
|
|
|
|
thus would otherwise not be included in the recursive crawl.
|
|
|
|
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-03-09 04:48:27 +00:00
|
|
|
Changing ignores during the crawl
|
|
|
|
---
|
2015-07-20 08:04:14 +00:00
|
|
|
`grab-site` outputs WARCs, logs, and control files to a new subdirectory in the
|
|
|
|
directory from which you launched `grab-site`, referred to here as "DIR".
|
|
|
|
(Use `ls -lrt` to find it.)
|
2015-03-09 04:48:27 +00:00
|
|
|
|
2015-07-18 06:23:24 +00:00
|
|
|
While the crawl is running, you can edit `DIR/ignores` and `DIR/igsets`; the
|
2015-02-05 04:59:28 +00:00
|
|
|
changes will be applied as soon as the next URL is grabbed.
|
|
|
|
|
2015-07-18 06:23:24 +00:00
|
|
|
`DIR/igsets` is a comma-separated list of ignore sets to use.
|
2015-02-05 05:37:34 +00:00
|
|
|
|
2015-02-05 19:31:50 +00:00
|
|
|
`DIR/ignores` is a newline-separated list of [Python 3 regular expressions](http://pythex.org/)
|
|
|
|
to use in addition to the ignore sets.
|
2015-02-05 05:37:34 +00:00
|
|
|
|
2015-07-18 08:23:56 +00:00
|
|
|
You can `rm DIR/igoff` to display all URLs that are being filtered out
|
|
|
|
by the ignores, and `touch DIR/igoff` to turn it back off.
|
2015-07-18 06:16:46 +00:00
|
|
|
|
|
|
|
|
2015-07-18 10:51:17 +00:00
|
|
|
Stopping a crawl
|
|
|
|
---
|
|
|
|
You can `touch DIR/stop` or press ctrl-c, which will do the same. You will
|
|
|
|
have to wait for the current downloads to finish.
|
|
|
|
|
|
|
|
|
2015-07-18 12:06:00 +00:00
|
|
|
Advanced `gs-server` options
|
2015-07-18 06:16:46 +00:00
|
|
|
---
|
2015-07-18 12:06:00 +00:00
|
|
|
These environmental variables control what `gs-server` listens on:
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-18 06:21:03 +00:00
|
|
|
* `GRAB_SITE_HTTP_INTERFACE` (default 0.0.0.0)
|
|
|
|
* `GRAB_SITE_HTTP_PORT` (default 29000)
|
|
|
|
* `GRAB_SITE_WS_INTERFACE` (default 0.0.0.0)
|
|
|
|
* `GRAB_SITE_WS_PORT` (default 29001)
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-18 09:54:07 +00:00
|
|
|
`GRAB_SITE_WS_PORT` should be 1 port higher than `GRAB_SITE_HTTP_PORT`,
|
|
|
|
or else you will have to add `?host=IP:PORT` to your dashboard URL.
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-18 06:22:47 +00:00
|
|
|
These environmental variables control which server each `grab-site` process connects to:
|
2015-07-18 06:16:46 +00:00
|
|
|
|
2015-07-18 06:21:03 +00:00
|
|
|
* `GRAB_SITE_WS_HOST` (default 127.0.0.1)
|
|
|
|
* `GRAB_SITE_WS_PORT` (default 29001)
|
2015-07-19 20:15:23 +00:00
|
|
|
|
|
|
|
|
|
|
|
Help
|
|
|
|
---
|
|
|
|
Bugs, discussion, ideas are welcome in [grab-site/issues](https://github.com/ludios/grab-site/issues).
|
|
|
|
|
2015-07-20 07:50:49 +00:00
|
|
|
If a problem happens when running just `~/.local/bin/wpull -r URL` (no grab-site),
|
|
|
|
you may want to report it to [wpull/issues](https://github.com/chfoo/wpull/issues) instead.
|