Transforming scalars to non-adjacent form shrinks the number of
precomputations down to 8, while still processing 4 bits at a time.
However, real-world benchmarks show that the transform is only
really useful with large precomputation tables and for batch
signature verification. So, do it for batch verification only.
https://github.com/cfrg/draft-irtf-cfrg-hash-to-curve
This is quite an important feature to have since many other standards
being worked on depend on this operation.
Brings a couple useful arithmetic operations on field elements by the way.
This PR also adds comments to the functions we expose in 25519/field
so that they can appear in the generated documentation.
We currently have ciphers optimized for performance, for
compatibility, for size and for specific CPUs.
However we lack a class of ciphers that is becoming increasingly
important, as Zig is being used for embedded systems, but also as
hardware-level side channels keep being found on (Intel) CPUs.
Here is ISAPv2, a construction specifically designed for resilience
against leakage and fault attacks.
ISAPv2 is obviously not optimized for performance, but can be an
option for highly sensitive data, when the runtime environment cannot
be trusted.
This is a trivial implementation that just does a or[xor] loop.
However, this pattern is used by virtually all crypto libraries and
in practice, even without assembly barriers, LLVM never turns it into
code with conditional jumps, even if one of the parameters is constant.
This has been verified to still be the case with LLVM 11.0.0.
As documented in the comment right above the finalization function,
Gimli can be used as a XOF, i.e. the output doesn't have a fixed
length.
So, allow it to be used that way, just like BLAKE3.
With the simple rule that whenever we have or will have 2 similar
functions, they should be in their own namespace.
Some of these new namespaces currently contain a single function.
This is to prepare for reduced-round versions that are likely to
be added later.
We read and write bytes directly from the state, but in the init
function, we potentially endian-swap them.
Initialize bytes in native format since we will be reading them
in native format as well later.
Also use the public interface in the "permute" test rather than an
internal interface. The state itself is not meant to be accessed directly,
even in tests.
BLAKE2 includes the expected output length in the initial state.
This length is actually distinct from the actual output length
used at finalization.
BLAKE2b-256/128 is thus not the same as BLAKE2b-128.
This behavior can be a little bit surprising, and has been "fixed"
in BLAKE3.
In order to support this, we may want to provide an option to set the
length used for domain separation.
In Zig, there is another reason to allow this: we assume that the
output length is defined at comptime.
But BLAKE2 doesn't have a fixed output length. For an output length that
is not known at comptime, we can't take the full block size and
truncate it due to the reason above.
What we can do now is set that length as an option to get the correct
initial state, and truncate the output if necessary.
Leverage result location semantics for X25519 like we do everywhere
else in 25519/*
Also add the edwards25519->curve25519 map by the way since many
applications seem to use this to share the same key pair for encryption
and signature.
Intel keeps changing the latency & throughput of the aes* and clmul
instructions every time they release a new model.
Adjust `optimal_parallel_blocks` accordingly, keeping 8 as a safe
default for unknown data.
Gives a ~40% speedup on x86_64.
However, the generic code remains faster on aarch64.
This is still processing only one block at a time for now.
I'm pretty confident that processing more blocks per round
will eventually give a substantial performance improvement on
all platforms with vector units.
The bcrypt function intentionally requires quite a lot of CPU cycles
to complete.
In addition to that, not having its full state constantly in the
CPU L1 cache causes a massive performance drop.
These properties slow down brute-force attacks against low-entropy
inputs (typically passwords), and GPU-based attacks get little
to no advantages over CPUs.
The NaCl constructions are available in pretty much all programming
languages, making them a solid choice for applications that require
interoperability.
Go includes them in the standard library, JavaScript has the popular
tweetnacl.js module, and reimplementations and ports of TweetNaCl
have been made everywhere.
Zig has almost everything that NaCl has at this point, the main
missing component being the Salsa20 cipher, on top on which NaCl's
secretboxes, boxes, and sealedboxes can be implemented.
So, here they are!
And clean the X25519 API up a little bit by the way.
- use `PascalCase` for all types. So, AES256GCM is now Aes256Gcm.
- consistently use `_length` instead of mixing `_size` and `_length` for the
constants we expose
- Use `minimum_key_length` when it represents an actual minimum length.
Otherwise, use `key_length`.
- Require output buffers (for ciphertexts, macs, hashes) to be of the right
size, not at least of that size in some functions, and the exact size elsewhere.
- Use a `_bits` suffix instead of `_length` when a size is represented as a
number of bits to avoid confusion.
- Functions returning a constant-sized slice are now defined as a slice instead
of a pointer + a runtime assertion. This is the case for most hash functions.
- Use `camelCase` for all functions instead of `snake_case`.
No functional changes, but these are breaking API changes.
`DefaultCsprng` is documented as a cryptographically secure RNG.
While `ISAAC` is a CSPRNG, the variant we have, `ISAAC64` is not.
A 64 bit seed is a bit small to satisfy that claim.
We also saw it being used with the current date as a seed, that
also defeats the point of a CSPRNG.
Set `DefaultCsprng` to `Gimli` instead of `ISAAC64`, rename
the parameter from `init_s` to `secret_seed` + add a comment to
clarify what kind of seed is expected here.
Instead of directly touching the internals of the Gimli implementation
(which can change/be architecture-specific), add an `init()` function
to the state.
Our Gimli-based CSPRNG was also not backtracking resistant. Gimli
is a permutation; it can be reverted. So, if the state was ever leaked,
future secrets, but also all the previously generated ones could be
recovered. Clear the rate after a squeeze in order to prevent this.
Finally, a dumb test was added just to exercise `DefaultCsprng` since
we don't use it anywhere.
HMAC is a generic construction, so we allow it to be instantiated
with any hash function.
In practice, HMAC is almost exclusively used with MD5, SHA1 and SHA2,
so it makes sense to define some shortcuts for them.
However, defining `HmacBlake2s256` is a bit weird (and why
specifically that one, and not other hash functions we also support?).
There would be nothing wrong with that construction, but it's not
used in any standard protocol and would be a curious choice.
BLAKE2 being a keyed hash function, it doesn't need HMAC to be used as
a MAC, so that also doesn't make it a good example of a possible hash
function for HMAC.
This commit doesn't remove the ability to use a Hmac(Blake2s256) type
if, for some reason, applications really need this, but it removes
HmacBlake2s256 as a constant.
This is slightly slower but makes our verification function compatible
with batch signatures. Which, in turn, makes blockchain people happy.
And we want to make our users happy.
Add convenience functions to substract edwards25519 points and to
clear the cofactor.
Brings a 30% speed boost on x86_64 even though we still process only
one block at a time for now.
Only enabled on x86_64 since the non-vectorized implementation seems
to currently perform better on some architectures (at least on aarch64).
But the non-vectorized implementation still gets a little speed boost
as well (~17%) with these changes.
Performance increases from ~400 MiB/s to 450 MiB/s at the expense of
extra code. Thus, aggregation is disabled on ReleaseSmall.
Since the multiplication cost is significant compared to the reduction,
aggregating more than 2 blocks is probably not worth it.