699 Commits

Author SHA1 Message Date
Ivan Kozik
8260a0d7cf Add note about Python 3.5 2015-09-30 23:44:25 +00:00
Ivan Kozik
bfe92c9605 Document how to upgrade grab-site 2015-09-30 22:31:46 +00:00
Ivan Kozik
7a63a3dcd1 Add --no-dupespotter for turning off dupespotter which sometimes has false positives 2015-09-30 22:16:56 +00:00
Ivan Kozik
17c1b9caaa Update default user agent 2015-09-25 20:32:08 +00:00
Ivan Kozik
f1548521ec Write URLs skipped by --max-content-length= to DIR/skipped_max_content_length 2015-09-02 19:15:00 +00:00
Ivan Kozik
3def2a79bc Fix for 32-bit machines: don't crash on startup with lmdb.MemoryError
lmdb.MemoryError: [...]/dupes_db: Cannot allocate memory
2015-09-02 19:04:56 +00:00
Ivan Kozik
e0ad2e9a25 Bump version 2015-08-28 04:29:24 +00:00
Ivan Kozik
5a04a38f59 Prevent twitter crawls from endlessly downloading [\?&]nav=
Credit to garyrh
2015-08-28 04:28:19 +00:00
Ivan Kozik
c28c593a83 Explain imdb ignore set 2015-08-21 08:35:57 +00:00
Ivan Kozik
b782c23389 Add --no-video option to skip the download of videos 2015-08-21 08:28:27 +00:00
Arkiver2
6f6754f81e Add --warc-max-size=BYTES option for controlling WARC size 2015-08-21 07:47:12 +00:00
Ivan Kozik
ee2684941d Add support for passing multiple URLs to grab-site 2015-08-21 07:18:31 +00:00
Ivan Kozik
524cdf2cec Fix blogspot search? ignore 2015-08-21 05:35:30 +00:00
Ivan Kozik
f379264ed1 Don't crash on --igsets=blogs even though it's gone 2015-08-21 05:31:51 +00:00
Ivan Kozik
49aa6d0dcf Remove mentions of blogs ignore set 2015-08-21 05:20:26 +00:00
Ivan Kozik
6a6dff0083 Add comment to reddit ignore set 2015-08-21 05:18:35 +00:00
Ivan Kozik
4b50c6de67 Migrate all other ignores from blogs to the global set 2015-08-21 05:16:26 +00:00
Ivan Kozik
954ab31acb Remove ignores that probably only wget needed 2015-08-21 04:48:47 +00:00
Ivan Kozik
b83bce1f2a Migrate some ignores from blogs to global set 2015-08-21 04:48:19 +00:00
Ivan Kozik
1b8b4b0077 Ignore per-post and per-comment Atom feeds on blogspot.com 2015-08-21 04:23:54 +00:00
Ivan Kozik
291b3e939b Fix: --offsite-links should be on by default 2015-08-13 12:29:11 +00:00
Ivan Kozik
a3f1ff7ed9 Cache control files for just 1.5 sec instead of 3 sec 2015-08-12 08:52:56 +00:00
Ivan Kozik
050dbc44d8 Fix very recent regression: report the pattern instead of the regexp 2015-08-12 08:51:41 +00:00
Ivan Kozik
b8b6248aab Remove no-longer-needed workaround in ignore sets 2015-08-12 07:55:08 +00:00
Ivan Kozik
1d52a28fac Increase size of compiled regexp cache; remove unused code 2015-08-12 07:52:24 +00:00
Ivan Kozik
13d73ad59f README: tweak 2015-08-12 07:49:20 +00:00
Ivan Kozik
26c7ea84d8 Implement --wpull-args for passing additional arguments to wpull 2015-08-12 06:39:49 +00:00
Ivan Kozik
1674751b1c Don't crash if DIR/concurrency is set to 0 2015-08-12 05:57:56 +00:00
Ivan Kozik
db212f6716 Reorder options in README 2015-08-12 05:39:16 +00:00
Ivan Kozik
28f5652404 Bump version 2015-08-12 05:29:44 +00:00
Ivan Kozik
ba823a34f8 Print ignores without doubling up backslashes 2015-08-12 05:26:11 +00:00
Ivan Kozik
668c03d5d2 Implement -i / --input-file, supporting both local input files and URLs 2015-08-12 05:24:09 +00:00
Ivan Kozik
37a6cb655c README: tweak 2015-08-10 13:40:52 +00:00
Ivan Kozik
9989eb5b70 Pretend to be Firefox 40; it's out tomorrow 2015-08-10 13:38:54 +00:00
Ivan Kozik
b7743e780a Implement --ua= for setting the User-Agent 2015-08-10 13:38:00 +00:00
Ivan Kozik
ee4dbe162e Implement --igon / --igoff 2015-08-10 13:23:43 +00:00
Ivan Kozik
76ba117d34 Document DIR/max_content_length 2015-08-10 13:15:20 +00:00
Ivan Kozik
bf080c7cb4 Implement --max-content-length=N for skipping large responses 2015-08-10 13:12:34 +00:00
Ivan Kozik
8b1791475d Remove unused import 2015-08-10 13:00:37 +00:00
Ivan Kozik
dfd1e8cd47 singletumblr igset: explain 2015-08-10 11:51:33 +00:00
Ivan Kozik
1cb9331939 nosortedindex igset: add comment 2015-08-10 11:49:45 +00:00
Ivan Kozik
33cc3040ed mediawiki igset: add comments 2015-08-10 11:48:53 +00:00
Ivan Kozik
40cae40dc5 blogs igset: comment more 2015-08-10 11:46:32 +00:00
Ivan Kozik
4e517e2994 blogs igset: remove ignores that are already covered by 'global' 2015-08-10 11:45:28 +00:00
Ivan Kozik
4d570d88bd Add some comments to 'blogs' ignore set 2015-08-10 11:44:20 +00:00
Ivan Kozik
6f03c5137d Move pixel.redditmedia.com from reddit to global ignore set 2015-08-10 11:42:03 +00:00
Ivan Kozik
e304c60586 Describe why various ignores are in the 'global' ignore set; add support for comments in ignore sets 2015-08-10 11:41:16 +00:00
Ivan Kozik
aa9b877843 Don't crash with "error: unrecognized arguments" if cwd contains space
Closes #32.
2015-08-02 03:51:37 +00:00
Ivan Kozik
9f071a706d setup.py: specify minimum version for all dependencies
Specifically, this solves a problem where trollius is too old to have
ensure_future.
2015-08-02 01:47:03 +00:00
Ivan Kozik
e55fa13004 Make wpull write .cdx file (its impl does one .cdx covering all WARC files) 2015-07-31 23:55:27 +00:00