115 Commits

Author SHA1 Message Date
Mike Fährmann
3201fe3521
add global SENTINEL object 2020-05-19 22:32:53 +02:00
Mike Fährmann
c8787647ed
add global WINDOWS bool 2020-05-19 22:32:53 +02:00
Mike Fährmann
ece73b5b2a
make 'path' and 'keywords' available in logging messages
Wrap all loggers used by job, extractor, downloader, and postprocessor
objects into a (custom) LoggerAdapter that provides access to the
underlying job, extractor, pathfmt, and kwdict objects and their
properties.

__init__() signatures for all downloader and postprocessor classes have
been changed to take the current Job object as their first argument,
instead of the current extractor or pathfmt.

(#574, #575)
2020-05-18 19:04:51 +02:00
Mike Fährmann
abbd8fbbd9
reset filenames on empty file extensions (#733) 2020-05-18 19:04:50 +02:00
Mike Fährmann
38bc6430d3
[downloader:http] don't overwrite existing '_mtime' fields 2020-04-10 23:08:03 +02:00
Mike Fährmann
9159cb8fb3
remove trailing dots and spaces from directory names (#647) 2020-03-19 21:12:18 +01:00
Mike Fährmann
90e4c645ba
[formatter] allow multiple "special" format specifiers (#595)
It is now, for example, possible to specify multiple replacement
operations per format replacement field: {name:Ra/b/Rc/d/}
2020-02-16 21:47:08 +01:00
Mike Fährmann
219c4cc78c
[formatter] allow for numeric list and string indices 2020-02-15 22:46:22 +01:00
Mike Fährmann
7d1da614d9
[formatter] implement field name alternatives (#525)
The format string '{a|b|c}' will now try to use the value from 'a' and
fall back to 'b' and 'c' if accessing a field raises an exception or
if its value is None.
2020-02-15 17:58:21 +01:00
Mike Fährmann
56f1c96168
implement 'parent-directory' option (#551) 2020-01-29 18:32:37 +01:00
Mike Fährmann
2a9be48511
improve util.load/save_cookiestxt() and add tests
- take a file object as argument instead of an filename
- accept whitespace before comments ("   # comment")
- map expiration "0" to None and not the number 0
2020-01-25 23:02:15 +01:00
Mike Fährmann
c1a6862863
implement functions to load/save cookies.txt files (closes #586)
The methods of the standard libraries' MozillaCookieJar have
several shortcomings (#HttpOnly_ cookies, 0 expiration timestamps, etc.)
and require construction of an ultimately pointless CookieJar object.
2020-01-21 21:59:36 +01:00
Mike Fährmann
760b9b4db4
add remove_file() and remove_directory() helpers
these functions call os.unlink() or os.rmdir()
while catching and suppressing potential OSErrors
2020-01-18 00:21:26 +01:00
Mike Fährmann
b2d542ad40
improve PathFormat._enum_file()
open only one try-except block for the whole loop,
instead of one for each iteration in os.path.exists()
2020-01-18 00:21:25 +01:00
Mike Fährmann
025f6e3398
add fallback for missing WITHOUT ROWID support (#553) 2020-01-03 22:58:28 +01:00
Mike Fährmann
58391d492d
cache archive keys generated in __contains__() (#524)
To avoid writing a different key to the archive than what was checked
against before the file download.
2019-12-20 16:43:08 +01:00
Mike Fährmann
0f1538af78
split filename formatting into its own function 2019-11-29 22:32:07 +01:00
Mike Fährmann
3fc1e12949
[postprocessor:metadata] filter private entries
i.e. keys starting with an underscore
2019-11-21 16:58:44 +01:00
Mike Fährmann
d5e3910270
adjust 'util.raises()' 2019-10-28 15:06:17 +01:00
Mike Fährmann
c887493a80
overhaul exception stuff 2019-10-27 23:53:37 +01:00
Mike Fährmann
776e9e073f
close archive on job completion (#417) 2019-09-10 22:43:51 +02:00
Mike Fährmann
0ce98169b8
improve path generation
- fix 'abspath()' results for Python <3.7 (closes #402)
  - 'abspath()' in Python 3.7+ removes trailing path separators
  - in Python <3.7 it doesn't
- filter empty path segments
2019-08-28 23:25:18 +02:00
Mike Fährmann
3284c62f22
ensure PathFormat.directory ends with a path separator
... plus some other small optimizations
2019-08-20 00:25:13 +02:00
Mike Fährmann
e77a656437
optimize directory path generation
- use str.join() instead of os.path.join()
  (less "features", but 10x as fast)
- cache directory formatters
- detect and optimize field access for 1-element format strings
2019-08-19 15:56:20 +02:00
Mike Fährmann
454bf1ebf9
preserve enumeration index after 'set_extension()' (#306) 2019-08-16 23:12:33 +02:00
Mike Fährmann
f5039b897f
replace DownloadArchive.check() with __contains__()
Interestingly enough, 'a in obj' is slightly faster than
'obj.check(a)' and is also nicer to look at, I think.
2019-08-16 23:12:32 +02:00
Mike Fährmann
5a210991b6
Remove control characters from filesystem paths
- add 'path-remove' option to specify the set of characters that
 should be removed
- rename 'restrict-filenames' to 'path-restrict'
- #348, #380
2019-08-16 23:12:16 +02:00
Mike Fährmann
0bb873757a
update PathFormat class
- change 'has_extension' from a simple flag/bool to a field that
  contains the original filename extension
- rename 'keywords' to 'kwdict' and some other stuff as well
- inline 'adjust_path()'
- put enumeration index before filename extension (#306)
2019-08-12 21:40:37 +02:00
Mike Fährmann
8dc42bb178
implement 'enumerate' for 'extractor.skip' (#306)
[ci skip]
2019-08-08 18:37:54 +02:00
Mike Fährmann
b1bea8aaeb
add 'restrict-filenames' option (#348) 2019-07-23 17:41:24 +02:00
Mike Fährmann
7b77ecc35a
fix paths for files without extension (#220) 2019-07-15 16:39:03 +02:00
Mike Fährmann
16c582aaf9
implement 'mtime' post-processor (#332)
This can set a file's modification time according to a UNIX timestamp
or a datetime object from its metadata.
2019-07-14 22:39:17 +02:00
Mike Fährmann
40da44b17f
Merge branch 'v1.9.0' 2019-06-29 15:39:52 +02:00
Mike Fährmann
95b1e4c3c0
implement R<old>/<new>/ format option (#318) 2019-06-23 22:45:44 +02:00
Mike Fährmann
f4ba98771d
use Last-Modified header to set file modification time
(#236, #277)
2019-06-19 23:16:32 +02:00
Mike Fährmann
523ebc9b0b
Fix serialization of 'datetime' objects in '--write-metadata'
Simplified universal serialization support in json.dump() can be achieved
by passing 'default=str', which was already the case in DataJob.run()
for -j/--dump-json, but not for the 'metadata' post-processor.

This commit introduces util.dump_json() that (more or less) unifies the
JSON output procedure of both --write-metadata and --dump-json.

(#251, #252)
2019-05-09 16:49:22 +02:00
Mike Fährmann
23baecb29e
fix 'CONVERSIONS' variable name 2019-03-05 22:50:56 +01:00
Mike Fährmann
105097ddcf
add 'S' conversion options for format string fields
Same as 's' (convert to string), but has a better, human-readable
conversion for lists.
2019-03-04 21:13:34 +01:00
Mike Fährmann
148b8f15d0
update tests for util.py 2019-02-14 11:15:19 +01:00
Mike Fährmann
ae353ed3b0
provide "extractor" and "job" keys for logging output
This allows for stuff like "{extractor.url}" and "{extractor.category}"
in logging format strings.
Accessing 'extractor' and 'job' in any way will return "None" if those
fields aren't defined, i.e. in general logging messages.
2019-02-14 11:09:58 +01:00
Mike Fährmann
79c01ec7ae
implement J<separator>/ format option
J joins list elements by calling <separator>.join(list):

Example:
{f:J - /} -> "a - b - c" (if "f" is ["a", "b", "c"])
2019-01-17 17:01:58 +01:00
Mike Fährmann
c5d4f558c9
allow missing field access keys in format strings (#136) 2018-12-22 13:54:14 +01:00
Mike Fährmann
d3d7f01543
add 'prepare()' step for post-processors
This allows post-processors to modify the destination path before
checking if a file already exists.
2018-10-18 22:32:03 +02:00
Mike Fährmann
6ed629f2b6
allow specifying number of skips before abort/exit (closes #115)
In addition to 'abort' and 'exit', it is now possible to specify
'abort:N' and 'exit:N' (where N is any integer) as value for 'skip'
to abort/exit after consecutively skipping N downloads.
2018-10-13 17:21:55 +02:00
Mike Fährmann
48a8717a7c
add 'output.num-to-str' option
... to convert any numeric values to string when outputting them as JSON
(during '--dump-json' or otherwise)
2018-10-08 20:28:54 +02:00
Mike Fährmann
0514d6a0ae
make --filter and --range config-file options
The functionality of --(chapter-)filter and --(chapter-)range are now
also exposed as the following config-file options:

- extractor.*.image-filter
- extractor.*.image-range
- extractor.*.chapter-filter
- extractor.*.chapter-range

TODO: update configuration.rst
2018-10-07 21:39:56 +02:00
Mike Fährmann
590c0b3ad5
re-implement and improve filename formatter
A format string now gets parsed only once instead of re-parsing it each
time it is applied to a set of data.

The initial parsing causes directory path creation to be at about 2x
slower than before, since each format string there is used only once,
but building a filename, the more common operation, is at least 2x
faster. The "directory slowness" cancels at about 5 filenames and
everything above that is significantly faster.
2018-08-25 10:45:14 +02:00
Mike Fährmann
c83fc62abc
prioritize archive over disk access (#87) 2018-07-30 17:48:23 +02:00
Mike Fährmann
e0dd8dff5f
implement L<maxlen>/<replacement>/ format option
The L option allows for the contents of a format field to be replaced
with <replacement> if its length is greater than <maxlen>.

Example:
{f:L5/too long/} -> "foo"      (if "f" is "foo")
                 -> "too long" (if "f" is "foobar")

(#92) (#94)
2018-07-29 13:52:07 +02:00
Mike Fährmann
8fe9056b16
implement string slicing for format strings
It is now possible to slice string (or list) values of format string
replacement fields with the same syntax as in regular Python code.

"{digits}"       -> "0123456789"
"{digits[2:-2]}" -> "234567"
"{digits[:5]}"   -> "01234"

The optional third parameter (step) has been left out to simplify things.
2018-07-14 09:53:15 +02:00