README: Fix GCE name

This commit is contained in:
Ivan Kozik 2016-02-18 04:13:18 +00:00
parent 08a90865a2
commit 6189c7c124

View File

@ -262,7 +262,7 @@ If you pay no attention to your crawls, a crawl may head down some infinite bot
grab-site does not respect `robots.txt` files, because they frequently [whitelist only approved robots](https://github.com/robots.txt), [hide embarassing news stories](https://web.archive.org/web/20140401024610/http://www.thecrimson.com/robots.txt), or block image or stylesheet resources needed for proper archival. [See also](http://www.archiveteam.org/index.php?title=Robots.txt). Because of this, very rarely you might run into a robot honeypot and receive an abuse@ complaint.
Do not run grab-site from Google Cloud Engine; as happened to me, your entire API project will probably get nuked after a few days of crawling the web, with no recourse. Good alternatives include OVH (sold under [OVH](https://www.ovh.com/us/dedicated-servers/), [So You Start](http://www.soyoustart.com/us/essential-servers/), and [Kimsufi](http://www.kimsufi.com/us/en/index.xml)) and online.net (with [dedicated](https://www.online.net/en/dedicated-server) or [puny ARM server](https://www.scaleway.com/) offerings).
Do not run grab-site on GCE (Google Compute Engine); as happened to me, your entire API project will probably get nuked after a few days of crawling the web, with no recourse. Good alternatives include OVH (sold under [OVH](https://www.ovh.com/us/dedicated-servers/), [So You Start](http://www.soyoustart.com/us/essential-servers/), and [Kimsufi](http://www.kimsufi.com/us/en/index.xml)) and online.net (with [dedicated](https://www.online.net/en/dedicated-server) or [puny ARM server](https://www.scaleway.com/) offerings).
### Tips for specific websites