Ivan Kozik
29b9825dc5
Bump version
2018-10-02 04:07:16 +00:00
Ivan Kozik
9575ed4ec2
global igset: remove Google Finance ignore as the site no longer exists
2018-10-02 04:05:31 +00:00
Ivan Kozik
fa68cc68f0
global igset: combine some ignores
2018-10-02 04:05:08 +00:00
Ivan Kozik
2ad9d18d41
global igset: ignore telegram share URL
2018-10-02 04:02:22 +00:00
Ivan Kozik
6ea44ae862
global igset: combine some ignores
2018-10-02 04:01:58 +00:00
Ivan Kozik
a045da3b82
default_cookies.txt: skip the quarantine gate on reddit.com
2018-09-28 04:27:15 +00:00
Ivan Kozik
e664e4fd54
README: mention cookies.txt extension for Firefox
2018-09-08 04:30:01 +00:00
Ivan Kozik
424e58a173
README: document DIR/scrape
2018-08-28 01:30:35 +00:00
Ivan Kozik
eabcf70141
README: tweak wording
2018-08-28 01:27:28 +00:00
Ivan Kozik
cdd7928750
Use DIR/scrape file to control whether to scrape for new URLs in responses
...
present = do scrape
missing = don't scrape
2018-08-28 01:22:19 +00:00
Ivan Kozik
90c37526e1
reddit igset: apply to old.reddit.com as well
2018-08-25 14:29:14 +00:00
Ivan Kozik
bf0d7d28a9
README: using Googlebot UA on tumblr no longer works
2018-08-24 00:41:35 +00:00
Ivan Kozik
0ea3d40938
Add default get_urls hook to get :orig images on Twitter and ?share=1 pages on Quora
2018-08-20 15:11:10 +00:00
Ivan Kozik
a3f1c51f55
global igset: ignore amazon logging
2018-08-16 15:48:56 +00:00
Ivan Kozik
4899dcd51b
global igset: ignore sitemeter.com counters
2018-08-15 17:14:18 +00:00
Ivan Kozik
ca8fd22c02
singletumblr igset: don't ignore non-tumblr domains; don't apply ignores to start URLs
...
https://github.com/ludios/grab-site/issues/126
2018-08-06 23:36:50 +00:00
Ivan Kozik
fbc0475157
dashboard: keep table aligned when a crawl has > 9 connections
2018-08-04 21:54:55 +00:00
Ivan Kozik
6d76cf5903
dashboard: keep stats rows aligned when using San Francisco font
2018-08-02 18:40:04 +00:00
Ivan Kozik
398c0cf8e6
grab-site --help: link to README.md
2018-07-28 12:36:29 +00:00
Ivan Kozik
644260c479
README: document how to bypass tumblr's GDPR consent page
2018-07-07 12:05:34 +00:00
Ivan Kozik
a3537c7f2c
Revert Googlebot UA to avoid breaking reddit crawls
...
With Googlebot in the UA, reddit says:
429 Too Many Requests https://www.reddit.com/...
2018-07-07 12:03:23 +00:00
Ivan Kozik
aa01eb8293
README: mention updated UA
2018-06-25 02:13:09 +00:00
Ivan Kozik
5bc2069d9b
Bump Firefox version in UA string and add Googlebot to UA to archive tumblr blogs from Europe without GDPR cookie
2018-06-25 01:51:57 +00:00
Ivan Kozik
1069dedfcd
global igset: ignore two more share links
2018-05-25 06:42:22 +00:00
Ivan Kozik
f47fc0a899
global igset: ignore beacon.wikia-services.com
2018-05-24 20:03:38 +00:00
Ivan Kozik
a2e751f9dc
README: Ubuntu 17.10 -> 18.04; show newer-distro instructions first
2018-05-19 19:31:19 +00:00
Ivan Kozik
e79cbac070
README: fix macOS install steps for PyPI now requiring TLS 1.2 support
...
Fixes https://github.com/ludios/grab-site/issues/121
2018-05-15 20:57:38 +00:00
Ivan Kozik
b97414c5a4
README: Python 3.4.7 -> 3.4.8
2018-05-15 20:51:02 +00:00
Ivan Kozik
8e8cd5895b
global igset: block more reddit tracking pixels
2018-05-06 03:28:29 +00:00
Ivan Kozik
1bfb5eca99
global igset: ignore new reddit tracking pixel
2018-05-05 21:09:21 +00:00
Ivan Kozik
42ba39afb4
global igset: ignore getpocket.com/edit
2018-04-12 08:44:40 +00:00
Ivan Kozik
bbe36cbe39
global igset: ignore jp.pinterest.com/pin/create/
2018-04-12 08:43:15 +00:00
Ivan Kozik
fe5dd47df8
Bump UA lie to Firefox 59
2018-04-08 07:09:59 +00:00
Ivan Kozik
5a05fa9761
Lock tornado version to 4.5.3 to avoid 5.0, which breaks with:
...
File "[...]/lib/python3.4/site-packages/wpull/abstract/client.py", line 9, in <module>
from wpull.connection import ConnectionPool
File "[...]/lib/python3.4/site-packages/wpull/connection.py", line 11, in <module>
from tornado.netutil import SSLCertificateError
ImportError: cannot import name 'SSLCertificateError'
2018-03-06 06:29:07 +00:00
Ivan Kozik
82de2f2b2b
Add --import-ignores for starting with a non-empty DIR/ignores file
2017-12-27 13:48:20 +00:00
Ivan Kozik
6b6d5785e2
README: adjust logo size
2017-12-27 13:36:32 +00:00
Ivan Kozik
cea5a1f90d
default_cookies.txt: skip the age gate on store.steampowered.com
2017-12-14 12:14:06 +00:00
Ivan Kozik
6d1b24f903
extra_docs/pause_resume_grab_sites.sh: only resume grab-sites if we paused the grab-sites
2017-12-14 05:25:48 +00:00
Ivan Kozik
97caf59705
README: add BrowserStack logo per terms
2017-12-13 23:05:41 +00:00
Ivan Kozik
fe38081834
README: thank BrowserStack
2017-12-13 23:00:26 +00:00
Ivan Kozik
2eeab5b2bc
reddit igset: ignore out.reddit.com; appears to be safe to ignore because the tracking links are redundant with the non-tracking links
2017-12-13 04:23:02 +00:00
Ivan Kozik
a5b13a8393
global igset: ignore another /search.*updated-(min|max)= pattern on blogspot:
...
*.blogspot.com/search?q=QUERY&updated-max=2011-08-23T15:10:00-07:00&max-results=20&start=79&by-date=true
2017-12-12 03:40:15 +00:00
Ivan Kozik
9e24731262
global igset: ignore 16x16 tumblr avatars with .pnj extension (typo-prone tumblr programmer?)
2017-12-12 03:11:09 +00:00
Ivan Kozik
2f95d7f652
Bump UA lie to Firefox 57 on Windows 10
2017-12-11 11:06:37 +00:00
Ivan Kozik
703534a0ee
reddit igset: ignore URLs with [\?&]utm_
2017-12-11 02:22:01 +00:00
Ivan Kozik
ff33ab8295
dashboard: adjust color to make it more obvious that stats line is a click target
2017-12-07 05:57:44 +00:00
Ivan Kozik
4568dd46f4
dashboard: help text: job -> crawl; 'job' is ArchiveBot terminology
2017-12-07 05:54:42 +00:00
Ivan Kozik
c6c5bdefc7
dashboard: for Chrome 63+, use the faster overscroll-behavior: contain
instead of attaching an onwheel event.
2017-12-07 05:19:15 +00:00
Ivan Kozik
70dc5cbe0b
dashboard: add a subtle box-shadow to the log windows
2017-12-07 05:11:26 +00:00
Ivan Kozik
3b787cda83
dashboard: make the background a little less saturated
2017-12-07 05:01:25 +00:00