1189 Commits

Author SHA1 Message Date
Mike Fährmann
0b2ff406f6
[plurk] add timeline- and post-extractors (#212) 2019-04-14 21:48:38 +02:00
Mike Fährmann
d6ddb74cde
update test results
- deviantart: 'index' is now an integer
- flickr: image file with lower quality
- paheal: image server name changed
- rule34: post got deleted
2019-04-12 09:59:48 +02:00
Mike Fährmann
87b0929bec
Revert "[flickr] restore image quality"
This reverts commit 3f513f10564a10ece8650e64d2233d8482fc14c7.

Both live.staticflickr and farmN.staticflickr servers now produce the
same image file with a lower overall quality than before this change in
Flickr's end.
2019-04-11 20:31:05 +02:00
Mike Fährmann
e7cd5510d5
[pixnet] add extractors (closes #177)
for:
- users/blogs: http://albertayu773.pixnet.net/
- folders: https://albertayu773.pixnet.net/album/folder/1405768
- sets   : https://albertayu773.pixnet.net/album/set/15078995
- photos : https://albertayu773.pixnet.net/album/photo/159443828
2019-04-11 19:27:02 +02:00
Mike Fährmann
155e1faeaf
[imagebam] support galleries with >100 images (fixes #219) 2019-04-11 19:12:27 +02:00
Mike Fährmann
9587aea98f
[deviantart] don't rewrite URLs for newer deviations
The '/intermediary/' trick stopped working for recently posted
deviations, but it still appears to be functional for older ones.
2019-04-11 10:37:01 +02:00
Mike Fährmann
f2220938cb
[mangoxo] improve channel extraction (#184) 2019-04-10 18:56:21 +02:00
Mike Fährmann
d9b94a585d
[mangoxo] add login support (#184)
A very recent change: It is now only possible to see more
than the first 5 images of an album if you are logged in.
2019-04-10 18:55:25 +02:00
Mike Fährmann
49a6522c38
ensure consistent headers and params ordering
Necessary to avoid being labeled a bot and getting a CAPTCHA response
after solving a Cloudflare challenge.
2019-04-09 10:52:27 +02:00
Mike Fährmann
e730fc9045
[twitter] add login support (#214) 2019-04-09 09:27:49 +02:00
Mike Fährmann
2c32dc76cb
[yaplog] update metadata structure (#190)
Put all blog post related fields in its own dict.

'image_id' -> 'id'
'post_id'  -> 'post[id]'
'title'    -> 'post[title]'
etc ...
2019-04-06 16:40:07 +02:00
Mike Fährmann
35919a9bb8
[livedoor] add blog- and post-extractors (#190) 2019-04-06 16:27:48 +02:00
Mike Fährmann
3f513f1056
[flickr] restore image quality
Flickr started serving images from live.staticflickr.com (see ec88ff1),
but the old farmN.staticflickr.com URLs still work - at least for the
time being.
Filesize (and most likely quality as well) for images from live.…  is
severely reduced compared to images from farmN.… for non-original files,
so all live URLs are replaced to point to a randomly chosen farm server.
2019-04-06 11:26:10 +02:00
Mike Fährmann
060859cc68
fix URL patterns
allow https:// as well as http://
2019-04-05 23:15:19 +02:00
Mike Fährmann
13526f3624
[yaplog] fix archive_id and posts with more than 24 images
- 'post_id' and 'image_id' are only unique per user
- /image/ pages only show a maximum of 24 images, but there can be more
  images than that in a blog post
- let extraction run in its own thread and maybe improve speed
- #190
2019-04-05 23:15:03 +02:00
Mike Fährmann
2ff043edfa
[yaplog] add user- and post-extractors (#190) 2019-04-04 17:56:56 +02:00
Mike Fährmann
790f15a56f
[photobucket] use HTTPS 2019-04-03 18:30:45 +02:00
Mike Fährmann
6da665f32e
[mangoxo] add album- and channel-extractors (closes #184) 2019-04-03 07:55:51 +02:00
Mike Fährmann
21e80d60ff
[wikiart] docstring fixes 2019-04-03 07:28:10 +02:00
Mike Fährmann
c70b21248d
[wikiart] add extractors (#179)
for
- artists:          https://www.wikiart.org/en/thomas-cole
- artist-listings:  https://www.wikiart.org/en/artists-by-century/12
- artwork-listings: https://www.wikiart.org/en/paintings-by-media/grisaille
2019-04-02 17:34:57 +02:00
Mike Fährmann
0f02e85961
[reactor] use "/full/" URLs (closes #210)
Putting a "/full/" in image URLs potentially gives higher resolution
and better quality.
2019-03-30 22:14:57 +01:00
Mike Fährmann
17c11393f5
[weibo] allow user-ids in status URLs 2019-03-30 18:38:58 +01:00
Mike Fährmann
ec88ff1562
[flickr] relax unit test results
Images are now randomly served from the 'live.staticflickr.com' domain
instead of the "old" 'farmN.staticflickr.com' one, making it impossible
to use static 'url' and 'keyword' hashes as results.

Image quality doesn't appear to be effected by which image-server is
used. Files from 'farmN' and 'live' are the same.
2019-03-30 18:31:59 +01:00
Mike Fährmann
00d604cafb
[luscious] fix SearchExtractor URL-pattern 2019-03-29 15:58:08 +01:00
Mike Fährmann
1384ebf907
[luscious] fix metadata extraction
- remove 'artist', 'language', and 'lang' fields
- replace 'section' with 'genre'
- provide 'tags' as list
- use GalleryExtractor as base class
2019-03-29 13:06:02 +01:00
Mike Fährmann
5398bfbd69
[exhentai] fix search and favorite extraction
removes basically all metadata, but that can be compensated for with the
right search query. writing "parsers" for all 4 possible views that have
been introduced in the latest changes is too much of a hassle ...
2019-03-28 16:22:02 +01:00
Leonardo Taccari
790b1336a6 [instagram] Add support for hashtags
Add support for hashtags (TagPage-s), i.e. explore/tags/<tag> URLs.

This also introduce a get_metadata() method in order to append
possible further metadata per-(sub)extractor.

Refactor and generalize _extract_profilepage() to _extract_page()
in order to be reused by _extract_profilepage() and _extract_tagpage()
simply by passing the type of page (`ProfilePage' or `TagPage') and picking up
the respective fields in shared data.
2019-03-24 14:05:34 +01:00
Mike Fährmann
a9bdd0f153
[instagram] fix syntax for Python 3.4
Python 3.4 doesn't like '**common' in dict literals.
This also makes '_ytdl_index' zero-based.
2019-03-24 11:25:42 +01:00
Mike Fährmann
eacebf41e4
fix typo in README 2019-03-24 11:03:02 +01:00
Leonardo Taccari
1e38f65996 [instagram] Add support for GraphSidecar media types (#201)
* [instagram] Add support for GraphSidecar media types

Refactor _extract_postpage() to always return a list of medias.

Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding `sidecar_media_id' and
`sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.

While here join the copyright comment lines in a single one.

Closes #178.

* [instagram] Use `yield from' instead of `for ... yield' (thanks @mikf)!

* [instagram] Adjust filename for GraphSidecar medias

Add a possible leading `media_id' of the sidecar for GraphSidecar
media.

Thanks to @mikf for the suggestion!

* [instagram] Add extra metadata for youtube-dl in GraphSidecar childrens

GraphSidecar children ytdl: URLs when consumed by youtube-dl
redirects to the URL of their parent.  In GraphSidecar-s with
multiple GraphVideo-s this leads to downloading the same video
multiple times.

Add a `_ytdl_index' field to indicate the index of the youtube-dl
playlist corresponding the children of the sidecar.

This will be used by the `ytdl' downloader.
2019-03-24 11:02:32 +01:00
Mike Fährmann
6ba67b0537
[hypnohub] add extractors (closes #196) 2019-03-23 09:50:39 +01:00
Mike Fährmann
fe27154a10
[komikcast] fix extraction
... again
2019-03-23 09:50:39 +01:00
Mike Fährmann
5ec55ec4fc
[deviantart] improve URLs for non-downloadable deviations 2019-03-21 15:37:22 +01:00
Mike Fährmann
c7a6b0ed90
[deviantart] add 'metadata' option (#189) 2019-03-21 14:49:42 +01:00
Mike Fährmann
8d96a8ce4c
[500px] add user-, gallery-, and image-extractors (#185) 2019-03-20 17:32:36 +01:00
Mike Fährmann
d0f88c35be
[komikcast] fix extraction 2019-03-18 11:12:19 +01:00
Mike Fährmann
6277a739e4
[35photo] add user-, genre-, and image-extractors (#162) 2019-03-18 01:11:30 +01:00
Mike Fährmann
fb14f80d62
[tumblr] fix avatar URLs for non-OAuth1.0 calls (closes #193) 2019-03-17 11:07:22 +01:00
Mike Fährmann
973a720a7a
[weibo] fix unit test URL patterns 2019-03-15 15:19:39 +01:00
Mike Fährmann
a2af2d2965
adjust cache maxage values 2019-03-14 22:21:49 +01:00
Mike Fährmann
f612284d24
cache cfclearance cookies 2019-03-14 16:14:29 +01:00
Mike Fährmann
591a07f20c
small code changes and cleanups 2019-03-13 22:03:02 +01:00
Mike Fährmann
6f57d44ec2
[seaotterscans] remove extractor
http://seaotterscans.com/ now redirects to their MangaDex profile
2019-03-13 22:02:45 +01:00
Mike Fährmann
6dae6bee37
automatically detect and bypass cloudflare challenge pages
TODO: cache and re-apply cfclearance cookies
2019-03-10 15:31:33 +01:00
Mike Fährmann
25aaf55514
[smugmug] improve format selection (closes #183)
- use original image if available
- support video formats
- remove user info for ImageExtractor (it is no longer possible to get
  image owner information for a single image)
2019-03-10 15:20:35 +01:00
Mike Fährmann
7c1cb923a4
[myportfolio] replace unit test
the old gallery got removed
2019-03-10 15:06:16 +01:00
Mike Fährmann
fffbfd3dce
[imgspice] fix extraction 2019-03-09 20:29:23 +01:00
Mike Fährmann
4ca4631bad
simplify auto-disabling certificate verification
if no certificate bundle is found
2019-03-08 16:34:01 +01:00
Mike Fährmann
09d872a2b1
generalize extractor creation code 2019-03-07 22:55:26 +01:00
Mike Fährmann
8dc6be246b
[shopify] add custom retry logic for 430 status codes (#175) 2019-03-07 15:31:15 +01:00