c1) lazily call lm_init*()
c2) the hybrid-mode which prefer calling mmap(2) to calling user-mode lm_mmap().
c3) the user-mode which exclusively call lm_mmap()
o. fix a bug about c1) mentioned above
target, so $@ is working fine for specifying the target.
I added another target to the recipe but forgot change $@
accordingly.
Thank Yichun for figuring out this problem.
The bug is triggered when ljmm_mremap() tries to shrink the
a mmapped block. The problem is about mistakenly calculating
the starting address of the stretch of memory that is going
to be disposed.
The unit testing suite does not have a case to catch the
memory shrinking. I will add a testing case tomorrow.
As of this revision. We are able to run CPU2000int with refernece input
successfully! These benchmarks are linked against ptmalloc3 (not part of
this project) and the libadaptor.so.
Running CPU2000int is just a way to stress test this work.
and page initialization via zero-filling.
Suppose a allocated block B1, whose virtual address is [ad1, ad2], is going
to deallocated. One Linux, it seems the only way to deallocate the pages
associated with the block is to call madvise(..MADV_DONTNEED...) (
hereinafter, call it madvise() for short unless otherwise noted).
madvise() *immediately* remove all the pages involved, and invalidate the
related TLB entries. So, if later on we allocate a block overlapping with
B1 in virtual address; accessing to the overlapping space will result in
re-establishing TLB entries, and zero-fill-pages, which is bit expensive.
This cost can be reduced by keeping few blocks in memory, and re-use the
memory resident pages over and over again. This is the rationale behind the
"block cache". The "cache" here may be a misnomer; it dosen't cache any data,
it just provide a way to keep small sum of idle pages in memory to avoid
cost of TLB manipulation and page initialization via zero-filling.
give 2.5% speedup the bzip2 in SPEC2000int benchmark. 2.5% is kind of
big improve to bzip2. As of I write this commit log, I have no idea
where the speedup is coming from. This revision might expose some
opportunity of malloc implementation.
I only run bzip2 with only one ref-input, the "input.souce".
o. If I run bzip2 with following command:
time -p (./bzip2 input.source 58 >run.source.out 2>run.source.err)
the user time is 15.39s.
o. If I run the same command line but with ptmalloc3 on and ljmm off,
time -p (ENABLE_LJMM=0 LD_PRELOAD="/home/syang/develop/luajit/ljmm.gh/libptmalloc3.so /home/syang/develop/luajit/ljmm.gh/libadaptor.so" ./bzip2 input.source 58 >run.source.out 2>run.source.err)
the user time is about the same (if not exactly the same).
(this implies ptmalloc3 and the malloc implementatio in GNU libc 2.19
have about the same performance).
o. However, if I change the above command line by flipping
ENABLE_LJMM=0 to ENABLE_LJMM=1, time-command to 14.9s.
(Yes, the ouput the command line *is* correct).
The performance improvement deserve a closer look.