From 6d3f0f901d0fc3dd3eb4e73eb30540520ac42db1 Mon Sep 17 00:00:00 2001 From: Ivan Kozik Date: Fri, 11 Dec 2015 08:40:30 +0000 Subject: [PATCH] README: Fix headings --- README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 604e6a2..a85b0b4 100644 --- a/README.md +++ b/README.md @@ -232,13 +232,13 @@ Options can come before or after the URL. The defaults usually work fine. -### Blogger / blogspot.com blogs +#### Blogger / blogspot.com blogs If you want to archive X.blogspot.com from outside the US, start the crawl on http://X.blogspot.com/ncr (ncr = no country redirect) to avoid getting redirected to another TLD. Note that /ncr sets an `NCR` cookie that expires after a few weeks. Some blogspot.com blogs use "[Dynamic Views](https://support.google.com/blogger/answer/1229061?hl=en)" themes that require JavaScript and serve absolutely no HTML content. In rare cases, you can get JavaScript-free pages by appending `?m=1` (e.g. http://happinessbeyondthought.blogspot.com/?m=1). Otherwise, you can archive parts of these blogs through Google Cache instead (e.g. https://webcache.googleusercontent.com/search?q=cache:http://blog.datomic.com/) or by using http://archive.is/ instead of grab-site. If neither of these options work, try [using grab-site with phantomjs](https://github.com/ludios/grab-site/issues/55#issuecomment-162118702). -### Tumblr blogs +#### Tumblr blogs Use `--igsets=singletumblr` to avoid crawling the homepages of other tumblr blogs. @@ -246,27 +246,27 @@ If you don't care about who liked or reblogged a post, add `\?from_c=` to the cr Some tumblr blogs appear to require JavaScript, but they are actually just hiding the page content with CSS. You are still likely to get a complete crawl. (See the links in the page source for http://X.tumblr.com/archive). -### Directory listings ("Index of ...") +#### Directory listings ("Index of ...") Use `--no-dupespotter` to avoid triggering false positives on the duplicate page detector. Without it, the crawl may miss large parts of the directory tree. -### Very large websites +#### Very large websites Use `--no-offsite-links` to stay on the main website and avoid crawling linked pages on other domains. -### Websites that are likely to ban you for crawling fast +#### Websites that are likely to ban you for crawling fast Use `--concurrency=1 --delay=500-1500`. -### MediaWiki sites with English language +#### MediaWiki sites with English language Use `--igsets=mediawiki`. Note that this ignore set ignores old page revisions. -### MediaWiki sites with non-English language +#### MediaWiki sites with non-English language You will probably have to add ignores with translated `Special:*` URLs based on [ignore_sets/mediawiki](https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/mediawiki). -### Forums +#### Forums Forums require more manual intervention with ignore patterns. `--igsets=[forums](https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/forums)` is often useful for non-SMF forums, but you will have to add other ignore patterns, including one to ignore individual-forum-post pages if there are too many posts to crawl. (Generally, crawling the thread pages is enough.)