1522 Commits

Author SHA1 Message Date
Mike Fährmann
e730fc9045
[twitter] add login support (#214) 2019-04-09 09:27:49 +02:00
Mike Fährmann
2c32dc76cb
[yaplog] update metadata structure (#190)
Put all blog post related fields in its own dict.

'image_id' -> 'id'
'post_id'  -> 'post[id]'
'title'    -> 'post[title]'
etc ...
2019-04-06 16:40:07 +02:00
Mike Fährmann
35919a9bb8
[livedoor] add blog- and post-extractors (#190) 2019-04-06 16:27:48 +02:00
Mike Fährmann
3f513f1056
[flickr] restore image quality
Flickr started serving images from live.staticflickr.com (see ec88ff1),
but the old farmN.staticflickr.com URLs still work - at least for the
time being.
Filesize (and most likely quality as well) for images from live.…  is
severely reduced compared to images from farmN.… for non-original files,
so all live URLs are replaced to point to a randomly chosen farm server.
2019-04-06 11:26:10 +02:00
Mike Fährmann
060859cc68
fix URL patterns
allow https:// as well as http://
2019-04-05 23:15:19 +02:00
Mike Fährmann
13526f3624
[yaplog] fix archive_id and posts with more than 24 images
- 'post_id' and 'image_id' are only unique per user
- /image/ pages only show a maximum of 24 images, but there can be more
  images than that in a blog post
- let extraction run in its own thread and maybe improve speed
- #190
2019-04-05 23:15:03 +02:00
Mike Fährmann
2ff043edfa
[yaplog] add user- and post-extractors (#190) 2019-04-04 17:56:56 +02:00
Mike Fährmann
790f15a56f
[photobucket] use HTTPS 2019-04-03 18:30:45 +02:00
Mike Fährmann
6da665f32e
[mangoxo] add album- and channel-extractors (closes #184) 2019-04-03 07:55:51 +02:00
Mike Fährmann
21e80d60ff
[wikiart] docstring fixes 2019-04-03 07:28:10 +02:00
Mike Fährmann
c70b21248d
[wikiart] add extractors (#179)
for
- artists:          https://www.wikiart.org/en/thomas-cole
- artist-listings:  https://www.wikiart.org/en/artists-by-century/12
- artwork-listings: https://www.wikiart.org/en/paintings-by-media/grisaille
2019-04-02 17:34:57 +02:00
Mike Fährmann
9ebd29fcc1
update cloudflare bypass (wip)
This commit adds support for the two new JS expressions embedded in the
overall challenge code.

It does compute the correct 'js_answer' value, but the HTTP request to
/cdn-cgi/l/chk_jschl to get the 'cf_clearance' cookie always results in
a 403 response with a CAPTCHA inside (hence 'wip')

All steps to make this HTTP request indistinguishable from a regular web
browser (which passes the test) show no effect. This includes:
- using the exact same HTTP headers as a web browser
- follow query argument order
- different wait times
2019-04-01 15:14:59 +02:00
Mike Fährmann
0f02e85961
[reactor] use "/full/" URLs (closes #210)
Putting a "/full/" in image URLs potentially gives higher resolution
and better quality.
2019-03-30 22:14:57 +01:00
Mike Fährmann
17c11393f5
[weibo] allow user-ids in status URLs 2019-03-30 18:38:58 +01:00
Mike Fährmann
ec88ff1562
[flickr] relax unit test results
Images are now randomly served from the 'live.staticflickr.com' domain
instead of the "old" 'farmN.staticflickr.com' one, making it impossible
to use static 'url' and 'keyword' hashes as results.

Image quality doesn't appear to be effected by which image-server is
used. Files from 'farmN' and 'live' are the same.
2019-03-30 18:31:59 +01:00
Mike Fährmann
bc2020e86c
release version 1.8.1 2019-03-29 17:37:11 +01:00
Mike Fährmann
00d604cafb
[luscious] fix SearchExtractor URL-pattern 2019-03-29 15:58:08 +01:00
Mike Fährmann
1384ebf907
[luscious] fix metadata extraction
- remove 'artist', 'language', and 'lang' fields
- replace 'section' with 'genre'
- provide 'tags' as list
- use GalleryExtractor as base class
2019-03-29 13:06:02 +01:00
Mike Fährmann
5398bfbd69
[exhentai] fix search and favorite extraction
removes basically all metadata, but that can be compensated for with the
right search query. writing "parsers" for all 4 possible views that have
been introduced in the latest changes is too much of a hassle ...
2019-03-28 16:22:02 +01:00
Mike Fährmann
5476404a5c
update and fix Cloudflare bypass 2019-03-25 22:53:36 +01:00
Leonardo Taccari
790b1336a6 [instagram] Add support for hashtags
Add support for hashtags (TagPage-s), i.e. explore/tags/<tag> URLs.

This also introduce a get_metadata() method in order to append
possible further metadata per-(sub)extractor.

Refactor and generalize _extract_profilepage() to _extract_page()
in order to be reused by _extract_profilepage() and _extract_tagpage()
simply by passing the type of page (`ProfilePage' or `TagPage') and picking up
the respective fields in shared data.
2019-03-24 14:05:34 +01:00
Mike Fährmann
114b8eecc5
[downloader;ytdl] utilize '_ytdl_index' metadata fields 2019-03-24 11:27:20 +01:00
Mike Fährmann
a9bdd0f153
[instagram] fix syntax for Python 3.4
Python 3.4 doesn't like '**common' in dict literals.
This also makes '_ytdl_index' zero-based.
2019-03-24 11:25:42 +01:00
Mike Fährmann
eacebf41e4
fix typo in README 2019-03-24 11:03:02 +01:00
Leonardo Taccari
1e38f65996 [instagram] Add support for GraphSidecar media types (#201)
* [instagram] Add support for GraphSidecar media types

Refactor _extract_postpage() to always return a list of medias.

Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding `sidecar_media_id' and
`sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.

While here join the copyright comment lines in a single one.

Closes #178.

* [instagram] Use `yield from' instead of `for ... yield' (thanks @mikf)!

* [instagram] Adjust filename for GraphSidecar medias

Add a possible leading `media_id' of the sidecar for GraphSidecar
media.

Thanks to @mikf for the suggestion!

* [instagram] Add extra metadata for youtube-dl in GraphSidecar childrens

GraphSidecar children ytdl: URLs when consumed by youtube-dl
redirects to the URL of their parent.  In GraphSidecar-s with
multiple GraphVideo-s this leads to downloading the same video
multiple times.

Add a `_ytdl_index' field to indicate the index of the youtube-dl
playlist corresponding the children of the sidecar.

This will be used by the `ytdl' downloader.
2019-03-24 11:02:32 +01:00
Mike Fährmann
e7d0d98c88
improve FFmpeg arguments for --ugoira-conv 2019-03-23 09:50:39 +01:00
Mike Fährmann
6ba67b0537
[hypnohub] add extractors (closes #196) 2019-03-23 09:50:39 +01:00
Mike Fährmann
fe27154a10
[komikcast] fix extraction
... again
2019-03-23 09:50:39 +01:00
Mike Fährmann
5ec55ec4fc
[deviantart] improve URLs for non-downloadable deviations 2019-03-21 15:37:22 +01:00
Mike Fährmann
c7a6b0ed90
[deviantart] add 'metadata' option (#189) 2019-03-21 14:49:42 +01:00
Mike Fährmann
8d96a8ce4c
[500px] add user-, gallery-, and image-extractors (#185) 2019-03-20 17:32:36 +01:00
Mike Fährmann
d0f88c35be
[komikcast] fix extraction 2019-03-18 11:12:19 +01:00
Mike Fährmann
6277a739e4
[35photo] add user-, genre-, and image-extractors (#162) 2019-03-18 01:11:30 +01:00
Mike Fährmann
fb14f80d62
[tumblr] fix avatar URLs for non-OAuth1.0 calls (closes #193) 2019-03-17 11:07:22 +01:00
Mike Fährmann
8c20443839
release version 1.8.0 2019-03-15 15:27:11 +01:00
Mike Fährmann
973a720a7a
[weibo] fix unit test URL patterns 2019-03-15 15:19:39 +01:00
Mike Fährmann
a2af2d2965
adjust cache maxage values 2019-03-14 22:21:49 +01:00
Mike Fährmann
f612284d24
cache cfclearance cookies 2019-03-14 16:14:29 +01:00
Mike Fährmann
34ea0d6a10
rewrite cache module
less complexity, better performance,
but some duplicate code here and there
2019-03-14 15:55:48 +01:00
Mike Fährmann
12482553bd
update links to youtube-dl 2019-03-13 22:03:02 +01:00
Mike Fährmann
591a07f20c
small code changes and cleanups 2019-03-13 22:03:02 +01:00
Mike Fährmann
6f57d44ec2
[seaotterscans] remove extractor
http://seaotterscans.com/ now redirects to their MangaDex profile
2019-03-13 22:02:45 +01:00
Mike Fährmann
6dae6bee37
automatically detect and bypass cloudflare challenge pages
TODO: cache and re-apply cfclearance cookies
2019-03-10 15:31:33 +01:00
Mike Fährmann
25aaf55514
[smugmug] improve format selection (closes #183)
- use original image if available
- support video formats
- remove user info for ImageExtractor (it is no longer possible to get
  image owner information for a single image)
2019-03-10 15:20:35 +01:00
Mike Fährmann
7c1cb923a4
[myportfolio] replace unit test
the old gallery got removed
2019-03-10 15:06:16 +01:00
Mike Fährmann
fffbfd3dce
[imgspice] fix extraction 2019-03-09 20:29:23 +01:00
Mike Fährmann
4ca4631bad
simplify auto-disabling certificate verification
if no certificate bundle is found
2019-03-08 16:34:01 +01:00
Mike Fährmann
09d872a2b1
generalize extractor creation code 2019-03-07 22:55:26 +01:00
Mike Fährmann
8dc6be246b
[shopify] add custom retry logic for 430 status codes (#175) 2019-03-07 15:31:15 +01:00
Mike Fährmann
0887fb61f4
[komikcast] update test results 2019-03-07 14:55:52 +01:00