776 Commits

Author SHA1 Message Date
David Yip
d5ca9e0ce9 db: More ic.cz patterns.
In particular:

- harizzzma.com and nahraj.net no longer resolve, so don't waste time
  trying
- ignore new/register links for forums
- ignore another "add to cart" link
2015-07-29 07:33:50 +00:00
David Yip
174b1815ef db: ic.cz ignore set - further refinements.
In particular:

- ignore more guestbook links
- remove viewtopic.php.*start= from set, because as it turns out that's
  a totally valid method for paging through a thread (one way of many,
  sigh)
2015-07-29 07:33:50 +00:00
David Yip
ebc858ae32 db: ic.cz: Also ignore &start=\d+ on forums.
This appears to be a pagination thing that we don't need.
2015-07-29 07:33:50 +00:00
David Yip
d19bea710a db: More troublesome infinite-calendar loops on ic.cz. 2015-07-29 07:33:50 +00:00
David Yip
a709dfa6c2 db: An ignore set for unwanted URLs on ic.cz.
This could be broken up later, but this is much more convenient for now.
2015-07-29 07:33:50 +00:00
David Yip
089faa5cf9 db: coppermine: also ignore last-commented-by order. 2015-07-29 07:33:50 +00:00
David Yip
a3e21ad5fc db: Restrict Coppermine album selector to displayimage.php. 2015-07-29 07:33:50 +00:00
David Yip
2ba9dc0187 db: Also ignore Coppermine's lastupby pseudo-album. 2015-07-29 07:33:50 +00:00
David Yip
da76445850 db: Also ignore addfav.php for Coppermine. 2015-07-29 07:33:50 +00:00
David Yip
85e8113f6a db: Add an ignore set for Coppermine Photo Gallery.
ic.cz has TONS of these things.
2015-07-29 07:33:50 +00:00
Ivan Kozik
4ad23c6118 Ignore more twitter share links 2015-07-29 07:33:50 +00:00
Ivan Kozik
661f8be5a7 Ignore non-Icecast mp3 streaming sites 2015-07-29 07:33:50 +00:00
Ivan Kozik
97db1927ac Ignore more dokuwiki nonsense 2015-07-29 07:33:50 +00:00
Ivan Kozik
aacc472354 Ignore some junk wordpress URLs 2015-07-29 07:33:50 +00:00
Ivan Kozik
0f9ccc4846 Ignore another share link 2015-07-29 07:33:50 +00:00
Ivan Kozik
3f7b022e7c Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
b55a89ecb0 Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
Ivan Kozik
12c8536cd3 Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
Ivan Kozik
3817170f6d Work around https://github.com/ArchiveTeam/ArchiveBot/issues/138#issuecomment-68352100 2015-07-29 07:33:49 +00:00
David Yip
4b192e63c5 db: Use correct delimiter for {primary_netloc} in singletumblr. #104. 2015-07-29 07:33:49 +00:00
David Yip
483c9ac2d2 db: Remove trailing space in singletumblr ignore set. #104. 2015-07-29 07:33:49 +00:00
David Yip
6be228fe0b pipeline: Switch to templates for placeholders. #104.
string.format() substitutes all occurrences of {token} with a token in
the formatting map.  Unfortunately, {m,} is also regex syntax for
"match m or more repetitions of preceding regex", and we use {3,} in a
global ignore.

Solution: Use a different delimiter.  Python's string templates look
like they give us enough power to do what we need to do, and they won't
clobber repetition ranges.

Unfortunately, we can't use the default $ delimiter, because $ is a
regex metacharacter.  %# seems sufficiently unlikely to appear in URLs.
2015-07-29 07:33:49 +00:00
David Yip
fd1d4f74d3 db: Add an ignore set to restrict !a *.tumblr.com to the target. #104.
(This is the sort of thing that #104 is useful for.)
2015-07-29 07:33:49 +00:00
Ivan Kozik
673f23960c Fix typo in /js/chartbeat.js 2015-07-29 07:33:49 +00:00
Ivan Kozik
1126169737 Ignore Special:Log/ 2015-07-29 07:33:49 +00:00
Ivan Kozik
1114e93271 Ignore another streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
6366e07906 Ignore more of streamtheworld.com
Sample URL:
http://7579.live.streamtheworld.com/977_90?type=.flv
2015-07-29 07:33:49 +00:00
Ivan Kozik
cc13f8f7cc Ignore imageshack.com/lost 2015-07-29 07:33:49 +00:00
David Yip
543c0ca86d Ignore sets: fix JSON errors. 2015-07-29 07:33:49 +00:00
Start
ae33daa88d fix ignore 2015-07-29 07:33:49 +00:00
PressStartandSelect
13d921a2a0 add social media ignores and safari user agent 2015-07-29 07:33:49 +00:00
Ivan Kozik
46aae55eaa Add blogspot.sg 2015-07-29 07:33:49 +00:00
David Yip
c46406bb43 Add Meetup Everywhere ignore set.
Added to help out with a bunch of Meetup Everywhere jobs.
2015-07-29 07:33:49 +00:00
Ivan Kozik
6cb33929b2 Ignore Windows 7 .iso's that we've already grabbed 2015-07-29 07:33:49 +00:00
Ivan Kozik
5cb7e2acca Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
51dfe02202 Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
27b64dd2a7 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
584746b60f Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
7748204e2f Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
fc51c61050 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00
Ivan Kozik
89565717af Ignore another share link 2015-07-29 07:33:49 +00:00
Ivan Kozik
ca85f5f803 Ignore more share links 2015-07-29 07:33:49 +00:00
Ivan Kozik
e3c8b96b82 Ignore more do=markread 2015-07-29 07:33:49 +00:00
Ivan Kozik
7ea9331fd6 Ignore another Icecast site 2015-07-29 07:33:49 +00:00
Ivan Kozik
2179192043 Ignore some vbulletin loops 2015-07-29 07:33:49 +00:00
Ivan Kozik
46a45eb391 Ignore /ucp\.php\?mode=delete_cookies 2015-07-29 07:33:49 +00:00
Ivan Kozik
644f787151 Fix licdn.com ignore for new wpull URL encoding behavior 2015-07-29 07:33:49 +00:00
Ivan Kozik
d46def8308 Ignore blogger.com/blog_this.pyra 2015-07-29 07:33:49 +00:00
Ivan Kozik
ec8151fcb6 Move blogger.com ignore to global 2015-07-29 07:33:49 +00:00
Ivan Kozik
7483dcbae7 Ignore another mp3 streaming site 2015-07-29 07:33:49 +00:00