Since GPR#247 (stack backtraces aware of inlining) was merged, frame tables contain two kinds of addresses of labels: code labels (as before) and data labels (new, pointing to sub-frames).
On ARM in Thumb mode, the two kinds of pointers must be distinguished, because pointers to Thumb code have the low bit set, and the assembler needs to know whether a label denotes code or data to set the low bit or not.
This commit fixes this problem by splitting the "efa_label" action of record Emitaux.emit_frame_actions into two actions, "efa_code_label" and "efa_data_label". On all ports except ARM, the two actions are identical. On ARM, the actions add the appropriate ".type" declaration.
Tested on ARM-32 and x86-64 only. CI will test the other platforms.
The latencies are based on wild guesses for the z10. Since newer z processors are out-of-order, basic-block scheduling could also be turned off entirely.
The locgr instruction is not available in z10, the baseline for this port.
Instead, generate pedestrian code with a conditional branch.
Pass -march=z10 to the assembler to enforce z10 compliance.
Without the special reloading implemented here, a 2-address instruction such as x := x + y' could be reloaded as 'x1 := x2 + y' with two different temporaries x1, x2 for x.
In PIC mode, Itailcall_imm should jumpt to the PLT of the called function.
Also: use %r7 rather than %r1 to pass the function pointer argument to caml_c_call. It can be that caml_c_call is in a different shared object than the caller. In this case, %r0 and %r1 can be destroyed by PLT stub code, according to the ELF ABI.
Use la/lay when possible for add immediate and sub immediate,
because these instructions support the case result <> argument.
Use 'and/or/xor immediate over low 32 bits' instructions.
Do this only if the top 32 bits of the constant are 0 (or/xor) or -1 (and).
Move the cold path (the one that calls the GC when alloc_ptr < alloc_limit)
as much as possible to the end of the function.
Use la and lay to produce shorter code.
New function emit_stack_adjust, which chooses the shortest instruction
that performs the required adjustment.
Later, this will be a good place to put cfi_adjust directives.
In emit.mlp, write %rN and %fN directly in `...` strings, instead of going through emit_gpr and emit_fpr.
Justification: for other ports like Power, several concrete asm syntaxes for register names exist, so it makes sense to abstract over them. This is not the case for z systems under Linux. Plus, using the concrete syntax directly makes it easier to review emit.mlp.
Following the previous commit, %r12 becomes usable as a normal register.
However it must be saved in caml_call_gc.
Independently: change Proc.loc_external_arguments to account for the
160 reserved bytes at bottom of stack. Then, caml_c_call and
emission of code for Iextcall(false) no longer need to account for
those reserved bytes.
Taking a leaf from recent versions of GCC, in PIC mode, we use a PC-relative load with GOTENT relocation to access the global offset table. This way, we don't have to save, setup and reload %r12 as GOT pointer in every function.
- Ibased addressing is removed. The code generated for an Ibased load/store is no better than the code we generate for an Iindexed load/store preceded by a Iconst_symbol instruction that loads the address of the global variable. Plus, we now get opportunities for CSE of the Iconst_symbol.
- Iindexed2 addressing is extended with a constant displacement, to take full advantage of the ofs(%r1, %r2) addressing mode of the processor.
- During selection instruction, make sure that the constant displacement of Iindexed and Iindexed2 is within range (20 bit signed).
Using the low bit of return addresses to mark already-scanned stack frames improves GC time on architectures that ignore this bit in 'return' instructions, like Power. Otherwise, as is the case for zSystem, clearing up this bit before every 'return' instruction costs too much in running time.
asmrun/stack.h: turn off the marking of return addresses for z
asmcomp/s390x/emit.mlp: suppress clearing of low bit of return addresses