Commit Graph

1158 Commits (master)

Author SHA1 Message Date
Ivan Kozik eabcf70141 README: tweak wording 2018-08-28 01:27:28 +00:00
Ivan Kozik cdd7928750 Use DIR/scrape file to control whether to scrape for new URLs in responses
present = do scrape
missing = don't scrape
2018-08-28 01:22:19 +00:00
Ivan Kozik 90c37526e1 reddit igset: apply to old.reddit.com as well 2018-08-25 14:29:14 +00:00
Ivan Kozik bf0d7d28a9 README: using Googlebot UA on tumblr no longer works 2018-08-24 00:41:35 +00:00
Ivan Kozik 0ea3d40938 Add default get_urls hook to get :orig images on Twitter and ?share=1 pages on Quora 2018-08-20 15:11:10 +00:00
Ivan Kozik a3f1c51f55 global igset: ignore amazon logging 2018-08-16 15:48:56 +00:00
Ivan Kozik 4899dcd51b global igset: ignore sitemeter.com counters 2018-08-15 17:14:18 +00:00
Ivan Kozik ca8fd22c02 singletumblr igset: don't ignore non-tumblr domains; don't apply ignores to start URLs
https://github.com/ludios/grab-site/issues/126
2018-08-06 23:36:50 +00:00
Ivan Kozik fbc0475157 dashboard: keep table aligned when a crawl has > 9 connections 2018-08-04 21:54:55 +00:00
Ivan Kozik 6d76cf5903 dashboard: keep stats rows aligned when using San Francisco font 2018-08-02 18:40:04 +00:00
Ivan Kozik 398c0cf8e6 grab-site --help: link to README.md 2018-07-28 12:36:29 +00:00
Ivan Kozik 644260c479 README: document how to bypass tumblr's GDPR consent page 2018-07-07 12:05:34 +00:00
Ivan Kozik a3537c7f2c Revert Googlebot UA to avoid breaking reddit crawls
With Googlebot in the UA, reddit says:

429 Too Many Requests https://www.reddit.com/...
2018-07-07 12:03:23 +00:00
Ivan Kozik aa01eb8293 README: mention updated UA 2018-06-25 02:13:09 +00:00
Ivan Kozik 5bc2069d9b Bump Firefox version in UA string and add Googlebot to UA to archive tumblr blogs from Europe without GDPR cookie 2018-06-25 01:51:57 +00:00
Ivan Kozik 1069dedfcd global igset: ignore two more share links 2018-05-25 06:42:22 +00:00
Ivan Kozik f47fc0a899 global igset: ignore beacon.wikia-services.com 2018-05-24 20:03:38 +00:00
Ivan Kozik a2e751f9dc README: Ubuntu 17.10 -> 18.04; show newer-distro instructions first 2018-05-19 19:31:19 +00:00
Ivan Kozik e79cbac070 README: fix macOS install steps for PyPI now requiring TLS 1.2 support
Fixes https://github.com/ludios/grab-site/issues/121
2018-05-15 20:57:38 +00:00
Ivan Kozik b97414c5a4 README: Python 3.4.7 -> 3.4.8 2018-05-15 20:51:02 +00:00
Ivan Kozik 8e8cd5895b global igset: block more reddit tracking pixels 2018-05-06 03:28:29 +00:00
Ivan Kozik 1bfb5eca99 global igset: ignore new reddit tracking pixel 2018-05-05 21:09:21 +00:00
Ivan Kozik 42ba39afb4 global igset: ignore getpocket.com/edit 2018-04-12 08:44:40 +00:00
Ivan Kozik bbe36cbe39 global igset: ignore jp.pinterest.com/pin/create/ 2018-04-12 08:43:15 +00:00
Ivan Kozik fe5dd47df8 Bump UA lie to Firefox 59 2018-04-08 07:09:59 +00:00
Ivan Kozik 5a05fa9761 Lock tornado version to 4.5.3 to avoid 5.0, which breaks with:
File "[...]/lib/python3.4/site-packages/wpull/abstract/client.py", line 9, in <module>
    from wpull.connection import ConnectionPool
  File "[...]/lib/python3.4/site-packages/wpull/connection.py", line 11, in <module>
    from tornado.netutil import SSLCertificateError
ImportError: cannot import name 'SSLCertificateError'
2018-03-06 06:29:07 +00:00
Ivan Kozik 82de2f2b2b Add --import-ignores for starting with a non-empty DIR/ignores file 2017-12-27 13:48:20 +00:00
Ivan Kozik 6b6d5785e2 README: adjust logo size 2017-12-27 13:36:32 +00:00
Ivan Kozik cea5a1f90d default_cookies.txt: skip the age gate on store.steampowered.com 2017-12-14 12:14:06 +00:00
Ivan Kozik 6d1b24f903 extra_docs/pause_resume_grab_sites.sh: only resume grab-sites if we paused the grab-sites 2017-12-14 05:25:48 +00:00
Ivan Kozik 97caf59705 README: add BrowserStack logo per terms 2017-12-13 23:05:41 +00:00
Ivan Kozik fe38081834 README: thank BrowserStack 2017-12-13 23:00:26 +00:00
Ivan Kozik 2eeab5b2bc reddit igset: ignore out.reddit.com; appears to be safe to ignore because the tracking links are redundant with the non-tracking links 2017-12-13 04:23:02 +00:00
Ivan Kozik a5b13a8393 global igset: ignore another /search.*updated-(min|max)= pattern on blogspot:
*.blogspot.com/search?q=QUERY&updated-max=2011-08-23T15:10:00-07:00&max-results=20&start=79&by-date=true
2017-12-12 03:40:15 +00:00
Ivan Kozik 9e24731262 global igset: ignore 16x16 tumblr avatars with .pnj extension (typo-prone tumblr programmer?) 2017-12-12 03:11:09 +00:00
Ivan Kozik 2f95d7f652 Bump UA lie to Firefox 57 on Windows 10 2017-12-11 11:06:37 +00:00
Ivan Kozik 703534a0ee reddit igset: ignore URLs with [\?&]utm_ 2017-12-11 02:22:01 +00:00
Ivan Kozik ff33ab8295 dashboard: adjust color to make it more obvious that stats line is a click target 2017-12-07 05:57:44 +00:00
Ivan Kozik 4568dd46f4 dashboard: help text: job -> crawl; 'job' is ArchiveBot terminology 2017-12-07 05:54:42 +00:00
Ivan Kozik c6c5bdefc7 dashboard: for Chrome 63+, use the faster `overscroll-behavior: contain` instead of attaching an onwheel event. 2017-12-07 05:19:15 +00:00
Ivan Kozik 70dc5cbe0b dashboard: add a subtle box-shadow to the log windows 2017-12-07 05:11:26 +00:00
Ivan Kozik 3b787cda83 dashboard: make the background a little less saturated 2017-12-07 05:01:25 +00:00
Ivan Kozik 4699e581fc README: add install steps for Debian 8 (jessie) 2017-12-07 02:36:14 +00:00
Ivan Kozik 26655fb28c README: switch from PPA-based python3.4 install to pyenv-based install; add install steps for Debian 9 and 10 2017-12-07 02:28:45 +00:00
Ivan Kozik 95e98ecefe README: link to wpull v1.2.3 2017-11-22 18:34:50 +00:00
Ivan Kozik b3c83f203c README: add note about gs-server listening on all interfaces by default 2017-11-22 18:09:49 +00:00
Ivan Kozik 62d4575b0c README: point to the newer ppa:deadsnakes/ppa PPA with Python 3.4.7 2017-11-22 17:57:36 +00:00
Ivan Kozik 2276adefe8 README: be less confusing about "start a new shell" 2017-11-22 17:25:31 +00:00
Ivan Kozik fc09d22028 README: ask users to file issues 2017-11-19 04:11:57 +00:00
Ivan Kozik c677c29aaf global igset: ignore new facebook like.php links
e.g. https://www.facebook.com/v2.9/plugins/like.php?href=
2017-11-16 20:23:28 +00:00