50 Commits

Author SHA1 Message Date
Mike Fährmann
bddcec49f1
implement 'text.root_from_url()'
use domain from input URL for kemono
2022-03-01 03:09:57 +01:00
Mike Fährmann
bc0e853d30
combine KeyError & IndexError to common base class LookupError 2022-02-11 00:42:49 +01:00
Mike Fährmann
bc868e7bb8
consider apparently long extensions as part of the filename
(#1516)
2021-05-02 21:15:50 +02:00
Mike Fährmann
387fe415d5
unescape items in text.split_html() 2021-03-29 02:12:29 +02:00
Mike Fährmann
78fd63b8f0
remove 'text.clean_xml()'
was not used anywhere
2021-03-28 04:05:16 +02:00
Mike Fährmann
8553b218d9
replace calls to 'os.path.splitext()' with 'str.rpartition()'
Makes functions who used it more than twice as fast
and we can get rid of an import as well.
2021-03-28 04:01:27 +02:00
Mike Fährmann
a09f42f6b3
improve filename_from_url() performance
Manually extracting the part between the last '/' and '?' instead of
relying on the standard libraries' 'urllib.parse.urlsplit()' increases
performance by ~400%.

urlsplit() : 3.64 secs per 1.000.000 iterations
partition(): 0.87 secs per 1.000.000 iterations
2020-10-23 00:14:06 +02:00
Mike Fährmann
37d71f6e09
strip microseconds in text.parse_datetime() 2020-06-17 21:40:16 +02:00
Mike Fährmann
6294e2c540
add 'text.ensure_http_scheme()' 2020-05-19 22:32:53 +02:00
Mike Fährmann
a0f4c295c0
add optional 'utcoffset' argument to 'parse_datetime()' 2020-04-11 02:05:00 +02:00
Mike Fährmann
f6c5edb76b
pre-compile regex pattern for remove_html() and split_html() 2020-03-13 23:31:54 +01:00
Mike Fährmann
b1bea8aaeb
add 'restrict-filenames' option (#348) 2019-07-23 17:41:24 +02:00
Mike Fährmann
1740086d8a
add 'repl' and 'sep' arguments to text.replace_html() 2019-07-17 14:48:24 +02:00
Mike Fährmann
b171befa87
implement 'parse_unicode_escapes()' 2019-06-16 21:47:24 +02:00
Mike Fährmann
2b1999476e
implement 'text.rextract()' 2019-05-28 21:03:41 +02:00
Mike Fährmann
2316e0ed3d
fix strptime workaround from b0e85a4
Don't return a modified version of 'date_time' if strptime fails.
2019-05-25 23:22:26 +02:00
Mike Fährmann
b0e85a42e3
apply workaround from 4736912 in parse_datetime() itself 2019-05-09 21:53:17 +02:00
Mike Fährmann
d09864b581
implement text.parse_datetime() 2019-05-08 15:43:59 +02:00
Mike Fährmann
6264a46212
use 'utcfromtimestamp()'
'fromtimestamp()' converts its results to the local timezone and causes
problems when running tests on a different machine.
2019-04-21 16:22:53 +02:00
Mike Fährmann
d670de0344
implement 'text.parse_timestamp()' 2019-04-21 15:28:27 +02:00
Mike Fährmann
21a7e395a7
implement convenience wrapper for text.extract functionality 2019-04-19 22:30:11 +02:00
Mike Fährmann
8f249f1d54
improve text.extract_iter() performance
by roughly 40% through
- inlining code
- pre-calculating reused values
- entering a try-except block only once
2019-04-18 23:37:17 +02:00
Mike Fährmann
5530871b5a
change results of text.nameext_from_url()
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)

Example: "https://example.org/path/filename.ext"

before:
- filename : filename.ext
- name     : filename
- extension: ext

now:
- filename : filename
- extension: ext
2019-02-14 16:07:17 +01:00
Mike Fährmann
e1d3e9a926
add 'ext_from_url' to text.py 2019-01-31 12:23:25 +01:00
Mike Fährmann
2d2953a5bf
add 'text.parse_float()' + cleanup in text.py 2019-01-29 16:46:21 +01:00
Mike Fährmann
ae9a37a528
implement text.split_html() 2018-05-27 15:00:41 +02:00
Mike Fährmann
cc36f88586
rename safe_int to parse_int; move parse_* to text module 2018-04-20 14:53:21 +02:00
Mike Fährmann
4ffa94f634
remove 'shorten_path()' and 'shorten_filename()' 2018-04-15 18:44:13 +02:00
Mike Fährmann
27eab4e467
rewrite text tests and improve functions
- test more edge cases
- consistently return an empty string for invalid arguments
- remove the ungreedy-flag in 'remove_html()'
2018-04-15 18:13:46 +02:00
Mike Fährmann
e3f2bd4087
add tests for 'text.clean_xml()' and improve it 2018-04-14 22:07:01 +02:00
Mike Fährmann
6d8b191ea7
improve 'parse_query()' and add tests
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
  just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
  created
2018-04-13 19:21:32 +02:00
Mike Fährmann
731ffd4986
improve text.filename_from_url() performance
- urlsplit() is faster than urlparse()
- rpartition() is faster than rindex() + slicing
- new version is 2.3 times as fast
2018-02-18 16:50:07 +01:00
Mike Fährmann
f7cdfd4c25
add a simplified version of 'parse_qs'
This version only returns a dict of plain string to string key-value
pairs and ignores multiple values for the same query variable.
2017-08-24 20:55:58 +02:00
Mike Fährmann
e5f79ae839
[deviantart] add support for all media types
- this includes
  - images
  - videos
  - flash-animations
  - journals

- also renamed some of the extractors
  - User  -> Gallery
  - Image -> Deviation
2017-05-10 16:45:45 +02:00
Mike Fährmann
ed94d9b92d
fix/improve various things 2017-03-17 09:39:46 +01:00
Mike Fährmann
619c74159a
[seiga] fix file extension and xml parsing
- The file extension of the first image had been used for all further
  images
- API responses can contain invalid characters, which cause the XML
  parser to fail (http://seiga.nicovideo.jp/user/illust/26377934
  contains several \x08 characters)
2017-03-14 09:09:04 +01:00
Mike Fährmann
4f123b8513
code adjustments according to pep8 2017-01-30 19:40:15 +01:00
Mike Fährmann
8780abcc77
fix a small spelling error 2017-01-10 14:24:58 +01:00
Mike Fährmann
00074a71d7
several changes to make travis build work
- fixed html.unescape not being available on Python3.3
- removed inconsistent test result
- added username/password pairs for authenticating extractors
2017-01-10 13:41:00 +01:00
Mike Fährmann
91c446805b
replace platform.system() with os.name 2016-10-25 15:44:36 +02:00
Mike Fährmann
8a49a28d13
replace deprecated 'unescape' method 2016-02-18 15:54:58 +01:00
Mike Fährmann
99b4fbb081
implement text.extract_iter 2015-11-28 01:46:34 +01:00
Mike Fährmann
7fd284a705
always provide lowercase fileextensions 2015-11-16 17:40:05 +01:00
Mike Fährmann
ca523b9f64
add helper method to text module 2015-11-16 03:46:43 +01:00
Mike Fährmann
d0bebd9ce3
allow adding values to existing dict 2015-11-03 00:05:18 +01:00
Mike Fährmann
629133a27a
document text.extract 2015-11-02 15:52:26 +01:00
Mike Fährmann
692d0c95cc
reimplement text.extract_all 2015-11-02 15:51:32 +01:00
Mike Fährmann
db479f881d implement text.shorten_path/filename methods 2015-10-31 00:21:02 +01:00
Mike Fährmann
89f938ee55 handle non string-like arguemnts for clean_path 2015-10-11 16:21:55 +02:00
Mike Fährmann
c5801c9770 combine text related functions in new module 2015-10-03 12:53:45 +02:00