README: Fix headings

This commit is contained in:
Ivan Kozik 2015-12-11 08:40:30 +00:00
parent 140efe082a
commit 6d3f0f901d

View File

@ -232,13 +232,13 @@ Options can come before or after the URL.
The defaults usually work fine.
### Blogger / blogspot.com blogs
#### Blogger / blogspot.com blogs
If you want to archive X.blogspot.com from outside the US, start the crawl on http://X.blogspot.com/ncr (ncr = no country redirect) to avoid getting redirected to another TLD. Note that /ncr sets an `NCR` cookie that expires after a few weeks.
Some blogspot.com blogs use "[Dynamic Views](https://support.google.com/blogger/answer/1229061?hl=en)" themes that require JavaScript and serve absolutely no HTML content. In rare cases, you can get JavaScript-free pages by appending `?m=1` (e.g. http://happinessbeyondthought.blogspot.com/?m=1). Otherwise, you can archive parts of these blogs through Google Cache instead (e.g. https://webcache.googleusercontent.com/search?q=cache:http://blog.datomic.com/) or by using http://archive.is/ instead of grab-site. If neither of these options work, try [using grab-site with phantomjs](https://github.com/ludios/grab-site/issues/55#issuecomment-162118702).
### Tumblr blogs
#### Tumblr blogs
Use `--igsets=singletumblr` to avoid crawling the homepages of other tumblr blogs.
@ -246,27 +246,27 @@ If you don't care about who liked or reblogged a post, add `\?from_c=` to the cr
Some tumblr blogs appear to require JavaScript, but they are actually just hiding the page content with CSS. You are still likely to get a complete crawl. (See the links in the page source for http://X.tumblr.com/archive).
### Directory listings ("Index of ...")
#### Directory listings ("Index of ...")
Use `--no-dupespotter` to avoid triggering false positives on the duplicate page detector. Without it, the crawl may miss large parts of the directory tree.
### Very large websites
#### Very large websites
Use `--no-offsite-links` to stay on the main website and avoid crawling linked pages on other domains.
### Websites that are likely to ban you for crawling fast
#### Websites that are likely to ban you for crawling fast
Use `--concurrency=1 --delay=500-1500`.
### MediaWiki sites with English language
#### MediaWiki sites with English language
Use `--igsets=mediawiki`. Note that this ignore set ignores old page revisions.
### MediaWiki sites with non-English language
#### MediaWiki sites with non-English language
You will probably have to add ignores with translated `Special:*` URLs based on [ignore_sets/mediawiki](https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/mediawiki).
### Forums
#### Forums
Forums require more manual intervention with ignore patterns. `--igsets=[forums](https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/forums)` is often useful for non-SMF forums, but you will have to add other ignore patterns, including one to ignore individual-forum-post pages if there are too many posts to crawl. (Generally, crawling the thread pages is enough.)