562 Commits

Author SHA1 Message Date
Ivan Kozik
0f9ccc4846 Ignore another share link 2015-07-29 07:33:50 +00:00
Ivan Kozik
3f7b022e7c Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
b55a89ecb0 Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
Ivan Kozik
12c8536cd3 Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
Ivan Kozik
3817170f6d Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
David Yip
4b192e63c5 db: Use correct delimiter for {primary_netloc} in singletumblr. #104. 2015-07-29 07:33:49 +00:00
David Yip
483c9ac2d2 db: Remove trailing space in singletumblr ignore set. #104. 2015-07-29 07:33:49 +00:00
David Yip
6be228fe0b pipeline: Switch to templates for placeholders. #104.
string.format() substitutes all occurrences of {token} with a token in
the formatting map.  Unfortunately, {m,} is also regex syntax for
"match m or more repetitions of preceding regex", and we use {3,} in a
global ignore.

Solution: Use a different delimiter.  Python's string templates look
like they give us enough power to do what we need to do, and they won't
clobber repetition ranges.

Unfortunately, we can't use the default $ delimiter, because $ is a
regex metacharacter.  %# seems sufficiently unlikely to appear in URLs.
2015-07-29 07:33:49 +00:00
David Yip
fd1d4f74d3 db: Add an ignore set to restrict !a *.tumblr.com to the target. #104.
(This is the sort of thing that #104 is useful for.)
2015-07-29 07:33:49 +00:00
Ivan Kozik
673f23960c Fix typo in /js/chartbeat.js 2015-07-29 07:33:49 +00:00
Ivan Kozik
1126169737 Ignore Special:Log/ 2015-07-29 07:33:49 +00:00
Ivan Kozik
1114e93271 Ignore another streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
6366e07906 Ignore more of streamtheworld.com
Sample URL:
http://7579.live.streamtheworld.com/977_90?type=.flv
2015-07-29 07:33:49 +00:00
Ivan Kozik
cc13f8f7cc Ignore imageshack.com/lost 2015-07-29 07:33:49 +00:00
David Yip
543c0ca86d Ignore sets: fix JSON errors. 2015-07-29 07:33:49 +00:00
Start
ae33daa88d fix ignore 2015-07-29 07:33:49 +00:00
PressStartandSelect
13d921a2a0 add social media ignores and safari user agent 2015-07-29 07:33:49 +00:00
Ivan Kozik
46aae55eaa Add blogspot.sg 2015-07-29 07:33:49 +00:00
David Yip
c46406bb43 Add Meetup Everywhere ignore set.
Added to help out with a bunch of Meetup Everywhere jobs.
2015-07-29 07:33:49 +00:00
Ivan Kozik
6cb33929b2 Ignore Windows 7 .iso's that we've already grabbed 2015-07-29 07:33:49 +00:00
Ivan Kozik
5cb7e2acca Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
51dfe02202 Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
27b64dd2a7 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
584746b60f Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
7748204e2f Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
fc51c61050 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
89565717af Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
ca85f5f803 Ignore more share links 2015-07-29 07:33:49 +00:00
Ivan Kozik
e3c8b96b82 Ignore more do=markread 2015-07-29 07:33:49 +00:00
Ivan Kozik
7ea9331fd6 Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
2179192043 Ignore some vbulletin loops 2015-07-29 07:33:49 +00:00
Ivan Kozik
46a45eb391 Ignore /ucp\.php\?mode=delete_cookies 2015-07-29 07:33:49 +00:00
Ivan Kozik
644f787151 Fix licdn.com ignore for new wpull URL encoding behavior 2015-07-29 07:33:49 +00:00
Ivan Kozik
d46def8308 Ignore blogger.com/blog_this.pyra 2015-07-29 07:33:49 +00:00
Ivan Kozik
ec8151fcb6 Move blogger.com ignore to global 2015-07-29 07:33:49 +00:00
Ivan Kozik
7483dcbae7 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
74b96843c5 Ignore more JavaScript non-URLs 2015-07-29 07:33:49 +00:00
Ivan Kozik
c05ecaf70e Ignore more mp3 streaming sites 2015-07-29 07:33:49 +00:00
Ivan Kozik
5dc41cf274 Ignore *.corp.ne1.yahoo.com - drops traffic 2015-07-29 07:33:49 +00:00
Ivan Kozik
02bb21afd2 Ignore more mp3 streaming sites 2015-07-29 07:33:49 +00:00
Ivan Kozik
fef513ef9d Ignore more mp3 streaming sites 2015-07-29 07:33:49 +00:00
Ivan Kozik
27451df729 Ignore more mp3 streaming sites 2015-07-29 07:33:49 +00:00
Ivan Kozik
71164d0f8a Update global.json 2015-07-29 07:33:49 +00:00
Ivan Kozik
4f0295f473 Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
60c8f47f72 Remove unnecessary ignore
" is quoted
2015-07-29 07:33:49 +00:00
Ivan Kozik
dd85e1f295 Add ignores for wpull@develop
It does not quote as many URLs
2015-07-29 07:33:49 +00:00
Ivan Kozik
5021267c8c Ignore bad /js/chartbeat.js links 2015-07-29 07:33:49 +00:00
Ivan Kozik
884dac1e51 Ignore bad linkedin URLs found by wpull 2015-07-29 07:33:49 +00:00
Ivan Kozik
a1da3de9af Ignore more twitter share links 2015-07-29 07:33:49 +00:00
Ivan Kozik
14843fe6dd Ignore &action=edit&section=new 2015-07-29 07:33:49 +00:00