hardened_malloc/README.md

# Hardened malloc

This is a security-focused general purpose memory allocator providing the
malloc API along with various extensions. It provides substantial hardening
against heap corruption vulnerabilities. The security-focused design also leads
to much less metadata overhead and memory waste from fragmentation than a more
traditional allocator design. It aims to provide decent overall performance
with a focus on long-term performance and memory usage rather than allocator
micro-benchmarks. It offers scalability via a configurable number of entirely
independently arenas, with the internal locking within arenas further divided
up per size class.

This project currently supports Bionic (Android), musl and glibc. It may
support other non-Linux operating systems in the future. For Android, there's
custom integration and other hardening features which is also planned for musl
in the future. The glibc support will be limited to replacing the malloc
implementation because musl is a much more robust and cleaner base to build on
and can cover the same use cases.

This allocator is intended as a successor to a previous implementation based on
extending OpenBSD malloc with various additional security features. It's still
heavily based on the OpenBSD malloc design, albeit not on the existing code
other than reusing the hash table implementation for the time being. The main
differences in the design are that it is solely focused on hardening rather
than finding bugs, uses finer-grained size classes along with slab sizes going
beyond 4k to reduce internal fragmentation, doesn't rely on the kernel having
fine-grained mmap randomization and only targets 64-bit to make aggressive use
of the large address space.  There are lots of smaller differences in the
implementation approach. It incorporates the previous extensions made to
OpenBSD malloc including adding padding to allocations for canaries (distinct
from the current OpenBSD malloc canaries), write-after-free detection tied to
the existing clearing on free, queues alongside the existing randomized arrays
for quarantining allocations and proper double-free detection for quarantined
allocations. The per-size-class memory regions with their own random bases were
loosely inspired by the size and type-based partitioning in PartitionAlloc. The
planned changes to OpenBSD malloc ended up being too extensive and invasive so
this project was started as a fresh implementation better able to accomplish
the goals. For 32-bit, a port of OpenBSD malloc with small extensions can be
used instead as this allocator fundamentally doesn't support that environment.

## Dependencies

Debian stable determines the most ancient set of supported dependencies:

* glibc 2.24
* Linux 4.9
* Clang 3.8 or GCC 6.3

However, using more recent releases is highly recommended. Older versions of
the dependencies may be compatible at the moment but are not tested and will
explicitly not be supported.

For external malloc replacement with musl, musl 1.1.20 is required. However,
there will be custom integration offering better performance in the future
along with other hardening for the C standard library implementation.

For Android, only current generation Android Open Source Project branches will
be supported, which currently means pie-qpr2-release.

## Testing

The `preload.sh` script can be used for testing with dynamically linked
executables using glibc or musl:

    ./preload.sh krita --new-image RGBA,U8,500,500

It can be necessary to substantially increase the `vm.max_map_count` sysctl to
accomodate the large number of mappings caused by guard slabs and large
allocation guard regions. The number of mappings can also be drastically
reduced via a significant increase to `CONFIG_GUARD_SLABS_INTERVAL` but the
feature has a low performance and memory usage cost so that isn't recommended.

It can offer slightly better performance when integrated into the C standard
library and there are other opportunities for similar hardening within C
standard library and dynamic linker implementations. For example, a library
region can be implemented to offer similar isolation for dynamic libraries as
this allocator offers across different size classes. The intention is that this
will be offered as part of hardened variants of the Bionic and musl C standard
libraries.

## Configuration

You can set some configuration options at compile-time via arguments to the
make command as follows:

    make CONFIG_EXAMPLE=false

Configuration options are provided when there are significant compromises
between portability, performance, memory usage or security. The core design
choices are not configurable and the allocator remains very security-focused
even with all the optional features disabled.

The following boolean configuration options are available:

* `CONFIG_NATIVE`: `true` (default) or `false` to control whether the code is
  optimized for the detected CPU on the host. If this is disabled, setting up a
  custom `-march` higher than the baseline architecture is highly recommended
  due to substantial performance benefits for this code.
* `CONFIG_CXX_ALLOCATOR`: `true` (default) or `false` to control whether the
  C++ allocator is replaced for slightly improved performance and detection of
  mismatched sizes for sized deallocation (often type confusion bugs). This
  will result in linking against the C++ standard library.
* `CONFIG_ZERO_ON_FREE`: `true` (default) or `false` to control whether small
  allocations are zeroed on free, to mitigate use-after-free and uninitialized
  use vulnerabilities along with purging lots of potentially sensitive data
  from the process as soon as possible. This has a performance cost scaling to
  the size of the allocation, which is usually acceptable.
* `CONFIG_WRITE_AFTER_FREE_CHECK`: `true` (default) or `false` to control
  sanity checking that new allocations contain zeroed memory. This can detect
  writes caused by a write-after-free vulnerability and mixes well with the
  features for making memory reuse randomized / delayed. This has a performance
  cost scaling to the size of the allocation, which is usually acceptable.
* `CONFIG_SLOT_RANDOMIZE`: `true` (default) or `false` to randomize selection
  of free slots within slabs. This has a measurable performance cost and isn't
  one of the important security features, but the cost has been deemed more
  than acceptable to be enabled by default.
* `CONFIG_SLAB_CANARY`: `true` (default) or `false` to enable support for
  adding 8 byte canaries to the end of memory allocations. The primary purpose
  of the canaries is to render small fixed size buffer overflows harmless by
  absorbing them. The first byte of the canary is always zero, containing
  overflows caused by a missing C string NUL terminator. The other 7 bytes are
  a per-slab random value. On free, integrity of the canary is checked to
  detect attacks like linear overflows or other forms of heap corruption caused
  by imprecise exploit primitives. However, checking on free will often be too
  late to prevent exploitation so it's not the main purpose of the canaries.
* `CONFIG_SEAL_METADATA`: `true` or `false` (default) to control whether Memory
  Protection Keys are used to disable access to all writable allocator state
  outside of the memory allocator code. It's currently disabled by default due
  to being extremely experimental and a significant performance cost for this
  use case on current generation hardware, which may become drastically lower
  in the future. Whether or not this feature is enabled, the metadata is all
  contained within an isolated memory region with high entropy random guard
  regions around it.

The following integer configuration options are available. Proper sanity checks
for the chosen values are not written yet, so use them at your own peril:

* `CONFIG_SLAB_QUARANTINE_RANDOM_LENGTH`: `1` (default) to control the number
  of slots in the random array used to randomize reuse for small memory
  allocations. This sets the length for the largest size class (currently
  16384) and the quarantine length for smaller size classes is scaled to match
  the total memory of the quarantined allocations (1 becomes 1024 for 16 byte
  allocations).
* `CONFIG_SLAB_QUARANTINE_QUEUE_LENGTH`: `1` (default) to control the number of
  slots in the queue used to delay reuse for small memory allocations. This
  sets the length for the largest size class (currently 16384) and the
  quarantine length for smaller size classes is scaled to match the total
  memory of the quarantined allocations (1 becomes 1024 for 16 byte
  allocations).
* `CONFIG_GUARD_SLABS_INTERVAL`: `1` (default) to control the number of slabs
  before a slab is skipped and left as an unused memory protected guard slab
* `CONFIG_GUARD_SIZE_DIVISOR`: `2` (default) to control the maximum size of the
  guard regions placed on both sides of large memory allocations, relative to
  the usable size of the memory allocation
* `CONFIG_REGION_QUARANTINE_RANDOM_LENGTH`: `128` (default) to control the
  number of slots in the random array used to randomize region reuse for large
  memory allocations
* `CONFIG_REGION_QUARANTINE_QUEUE_LENGTH`: `1024` (default) to control the
  number of slots in the queue used to delay region reuse for large memory
  allocations
* `CONFIG_REGION_QUARANTINE_SKIP_THRESHOLD`: `33554432` (default) to control
  the size threshold where large allocations will not be quarantined
* `CONFIG_FREE_SLABS_QUARANTINE_RANDOM_LENGTH`: `32` (default) to control the
  number of slots in the random array used to randomize free slab reuse
* `CONFIG_CLASS_REGION_SIZE`: `34359738368` (default) to control the size of
  the size class regions
* `CONFIG_N_ARENA`: `1` (default) to control the number of arenas
* `CONFIG_STATS`: `false` (default) to control whether stats on allocation /
  deallocation count and active allocations are tracked. This is currently only
  exposed via the mallinfo APIs on Android.
* `CONFIG_EXTENDED_SIZE_CLASSES`: `true` (default) to control whether small
  size class go up to 64k instead of the minimum requirement for avoiding
  memory waste of 16k. The option to extend it even further will be offered in
  the future when better support for larger slab allocations is added.
* `CONFIG_LARGE_SIZE_CLASSES`: `true` (default) to control whether large
  allocations use the slab allocation size class scheme instead of page size
  granularity (see the section on size classes below)

There will be more control over enabled features in the future along with
control over fairly arbitrarily chosen values like the size of empty slab
caches (making them smaller improves security and reduces memory usage while
larger caches can substantially improves performance).

## Basic design

The current design is very simple and will become a bit more sophisticated as
the basic features are completed and the implementation is hardened and
optimized. The allocator is exclusive to 64-bit platforms in order to take full
advantage of the abundant address space without being constrained by needing to
keep the design compatible with 32-bit.

Small allocations are always located in a large memory region reserved for slab
allocations. It can be determined that an allocation is one of the small size
classes from the address range. Each small size class has a separate reserved
region within the larger region, and the size of a small allocation can simply
be determined from the range. Each small size class has a separate out-of-line
metadata array outside of the overall allocation region, with the index of the
metadata struct within the array mapping to the index of the slab within the
dedicated size class region. Slabs are a multiple of the page size and are
page aligned. The entire small size class region starts out memory protected
and becomes readable / writable as it gets allocated, with idle slabs beyond
the cache limit having their pages dropped and the memory protected again.

Large allocations are tracked via a global hash table mapping their address to
their size and guard size. They're simply memory mappings and get mapped on
allocation and then unmapped on free.

This allocator is aimed at production usage, not aiding with finding and fixing
memory corruption bugs for software development. It does find many latent bugs
but won't include features like the option of generating and storing stack
traces for each allocation to include the allocation site in related error
messages. The design choices are based around minimizing overhead and
maximizing security which often leads to different decisions than a tool
attempting to find bugs. For example, it uses zero-based sanitization on free
and doesn't minimize slack space from size class rounding between the end of an
allocation and the canary / guard region. Zero-based filling has the least
chance of uncovering latent bugs, but also the best chance of mitigating
vulnerabilities. The canary feature is primarily meant to act as padding
absorbing small overflows to render them harmless, so slack space is helpful
rather than harmful despite not detecting the corruption on free. The canary
needs detection on free in order to have any hope of stopping other kinds of
issues like a sequential overflow, which is why it's included.  It's assumed
that an attacker can figure out the allocator is in use so the focus is
explicitly not on detecting bugs that are impossible to exploit with it in use
like an 8 byte overflow. The design choices would be different if performance
was a bit less important and if a core goal was finding latent bugs.

## Security properties

* Fully out-of-line metadata/state with protection from corruption
    * Address space for allocator state is entirely reserved during
      initialization and never reused for allocations or anything else
    * State within global variables is entirely read-only after initialization
      with pointers to the isolated allocator state so leaking the address of
      the library doesn't leak the address of writable state
    * Allocator state is located within a dedicated region with high entropy
      randomly sized guard regions around it
    * Protection via Memory Protection Keys (MPK) on x86\_64 (disabled by
      default due to low benefit-cost ratio on top of baseline protections)
    * [future] Protection via MTE on ARMv8.5+
* Deterministic detection of any invalid free (unallocated, unaligned, etc.)
    * Validation of the size passed for C++14 sized deallocation by `delete`
      even for code compiled with earlier standards (detects type confusion if
      the size is different) and by various containers using the allocator API
      directly
* Isolated memory region for slab allocations
    * Divided up into isolated inner regions for each size class
        * High entropy random base for each size class region
        * No deterministic / low entropy offsets between allocations with
          different size classes
    * Metadata is completely outside the slab allocation region
        * No references to metadata within the slab allocation region
        * No deterministic / low entropy offsets to metadata
    * Entire slab region starts out non-readable and non-writable
    * Slabs beyond the cache limit are purged and become non-readable and
      non-writable memory again
        * Placed into a queue for reuse in FIFO order to maximize the time
          spent memory protected
        * Randomized array is used to add a random delay for reuse
* Fine-grained randomization within memory regions
    * Randomly sized guard regions for large allocations
    * Random slot selection within slabs
    * Randomized delayed free for slab allocations
    * [in-progress] Randomized choice of slabs
    * [in-progress] Randomized allocation of slabs
* Slab allocations are zeroed on free
* Detection of write-after-free for slab allocations by verifying zero filling
  is intact at allocation time
* Large allocations are purged and memory protected on free with the memory
  mapping kept reserved in a quarantine to detect use-after-free
    * The quarantine is primarily based on a FIFO ring buffer, with the oldest
      mapping in the quarantine being unmapped to make room for the most
      recently freed mapping
    * Another layer of the quarantine swaps with a random slot in an array to
      randomize the number of large deallocations required to push mappings out
      of the quarantine
* Memory in fresh allocations is consistently zeroed due to it either being
  fresh pages or zeroed on free after previous usage
* Delayed free via a combination of FIFO and randomization for slab allocations
* Random canaries placed after each slab allocation to *absorb*
  and then later detect overflows/underflows
    * High entropy per-slab random values
    * Leading byte is zeroed to contain C string overflows
* Possible slab locations are skipped and remain memory protected, leaving slab
  size class regions interspersed with guard pages
* Zero size allocations are a dedicated size class with the entire region
  remaining non-readable and non-writable
* Extension for retrieving the size of allocations with fallback
  to a sentinel for pointers not managed by the allocator
    * Can also return accurate values for pointers *within* small allocations
    * The same applies to pointers within the first page of large allocations,
      otherwise it currently has to return a sentinel
* No alignment tricks interfering with ASLR like jemalloc, PartitionAlloc, etc.
* No usage of the legacy brk heap
* Aggressive sanity checks
    * Errors other than ENOMEM from mmap, munmap, mprotect and mremap treated
      as fatal, which can help to detect memory management gone wrong elsewhere
      in the process.
* [future] Memory tagging for slab allocations via MTE on ARMv8.5+
    * random memory tags as the baseline, providing probabilistic protection
      against various forms of memory corruption
    * dedicated tag for free slots, set on free, for deterministic protection
      against accessing freed memory
    * store previous random tag within freed slab allocations, and increment it
      to get the next tag for that slot to provide deterministic use-after-free
      detection through multiple cycles of memory reuse
    * guarantee distinct tags for adjacent memory allocations by incrementing
      past matching values for deterministic detection of linear overflows

## Randomness

The current implementation of random number generation for randomization-based
mitigations is based on generating a keystream from a stream cipher (ChaCha8)
in small chunks. A separate CSPRNG is used for each small size class, large
allocations, etc. in order to fit into the existing fine-grained locking model
without needing to waste memory per thread by having the CSPRNG state in Thread
Local Storage. Similarly, it's protected via the same approach taken for the
rest of the metadata. The stream cipher is regularly reseeded from the OS to
provide backtracking and prediction resistance with a negligible cost. The
reseed interval simply needs to be adjusted to the point that it stops
registering as having any significant performance impact. The performance
impact on recent Linux kernels is primarily from the high cost of system calls
and locking since the implementation is quite efficient (ChaCha20), especially
for just generating the key and nonce for another stream cipher (ChaCha8).

ChaCha8 is a great fit because it's extremely fast across platforms without
relying on hardware support or complex platform-specific code. The security
margins of ChaCha20 would be completely overkill for the use case. Using
ChaCha8 avoids needing to resort to a non-cryptographically secure PRNG or
something without a lot of scrunity. The current implementation is simply the
reference implementation of ChaCha8 converted into a pure keystream by ripping
out the XOR of the message into the keystream.

The random range generation functions are a highly optimized implementation
too. Traditional uniform random number generation within a range is very high
overhead and can easily dwarf the cost of an efficient CSPRNG.

## Size classes

The zero byte size class is a special case of the smallest regular size class.
It's allocated in a dedicated region like other size classes but with the slabs
never being made readable and writable so the only memory usage is for the slab
metadata.

The choice of size classes for slab allocation is the same as jemalloc, which
is a careful balance between minimizing internal and external fragmentation. If
there are more size classes, more memory is wasted on free slots available only
to allocation requests of those sizes (external fragmentation). If there are
fewer size classes, the spacing between them is larger and more memory is
wasted due to rounding up to the size classes (internal fragmentation). There
are 4 special size classes for the smallest sizes (16, 32, 48, 64) that are
simply spaced out by the minimum spacing (16). Afterwards, there are four size
classes for every power of two spacing which results in bounding the internal
fragmentation below 20% for each size class. This also means there are 4 size
classes for each doubling in size.

The slot counts tied to the size classes are specific to this allocator rather
than being taken from jemalloc. Slabs are always a span of pages so the slot
count needs to be tuned to minimize waste due to rounding to the page size. For
now, this allocator is set up only for 4096 byte pages as a small page size is
desirable for finer-grained memory protection and randomization. It could be
ported to larger page sizes in the future. The current slot counts are only a
preliminary set of values.

| size class | worst case internal fragmentation | slab slots | slab size | internal fragmentation for slabs |
| - | - | - | - | - |
| 16 | 93.75% | 256 | 4096 | 0.0% |
| 32 | 46.875% | 128 | 4096 | 0.0% |
| 48 | 31.25% | 85 | 4096 | 0.390625% |
| 64 | 23.4375% | 64 | 4096 | 0.0% |
| 80 | 18.75% | 51 | 4096 | 0.390625% |
| 96 | 15.625% | 42 | 4096 | 1.5625% |
| 112 | 13.392857142857139% | 36 | 4096 | 1.5625% |
| 128 | 11.71875% | 64 | 8192 | 0.0% |
| 160 | 19.375% | 51 | 8192 | 0.390625% |
| 192 | 16.145833333333343% | 64 | 12288 | 0.0% |
| 224 | 13.839285714285708% | 54 | 12288 | 1.5625% |
| 256 | 12.109375% | 64 | 16384 | 0.0% |
| 320 | 19.6875% | 64 | 20480 | 0.0% |
| 384 | 16.40625% | 64 | 24576 | 0.0% |
| 448 | 14.0625% | 64 | 28672 | 0.0% |
| 512 | 12.3046875% | 64 | 32768 | 0.0% |
| 640 | 19.84375% | 64 | 40960 | 0.0% |
| 768 | 16.536458333333343% | 64 | 49152 | 0.0% |
| 896 | 14.174107142857139% | 64 | 57344 | 0.0% |
| 1024 | 12.40234375% | 64 | 65536 | 0.0% |
| 1280 | 19.921875% | 16 | 20480 | 0.0% |
| 1536 | 16.6015625% | 16 | 24576 | 0.0% |
| 1792 | 14.229910714285708% | 16 | 28672 | 0.0% |
| 2048 | 12.451171875% | 16 | 32768 | 0.0% |
| 2560 | 19.9609375% | 8 | 20480 | 0.0% |
| 3072 | 16.634114583333343% | 8 | 24576 | 0.0% |
| 3584 | 14.2578125% | 8 | 28672 | 0.0% |
| 4096 | 12.4755859375% | 8 | 32768 | 0.0% |
| 5120 | 19.98046875% | 8 | 40960 | 0.0% |
| 6144 | 16.650390625% | 8 | 49152 | 0.0% |
| 7168 | 14.271763392857139% | 8 | 57344 | 0.0% |
| 8192 | 12.48779296875% | 8 | 65536 | 0.0% |
| 10240 | 19.990234375% | 6 | 61440 | 0.0% |
| 12288 | 16.658528645833343% | 5 | 61440 | 0.0% |
| 14336 | 14.278738839285708% | 4 | 57344 | 0.0% |
| 16384 | 12.493896484375% | 4 | 65536 | 0.0% |

The slab allocation size classes end at 16384 since that's the final size for
2048 byte spacing and the next spacing class matches the page size of 4096
bytes on the target platforms. This is the minimum set of small size classes
required to avoid substantial waste from rounding.

The `CONFIG_EXTENDED_SIZE_CLASSES` option extends the size classes up to
131072, with a final spacing class of 16384. This offers improved performance
compared to the minimum set of size classes. The security story is complicated,
since the slab allocation has both advantages like size class isolation
completely avoiding reuse of any of the address space for any other size
classes or other data. It also has disadvantages like caching a small number of
empty slabs and deterministic guard sizes. The cache will be configurable in
the future, making it possible to disable slab caching for the largest slab
allocation sizes, to force unmapping them immediately and putting them in the
slab quarantine, which eliminates most of the security disadvantage at the
expense of also giving up most of the performance advantage, but while
retaining the isolation.

| size class | worst case internal fragmentation | slab slots | slab size | internal fragmentation for slabs |
| - | - | - | - | - |
| 20480 | 19.9951171875% | 2 | 40960 | 0.0% |
| 24576 | 16.66259765625% | 2 | 49152 | 0.0% |
| 28672 | 14.2822265625% | 2 | 57344 | 0.0% |
| 32768 | 12.4969482421875% | 2 | 65536 | 0.0% |
| 40960 | 19.99755859375% | 1 | 40960 | 0.0% |
| 49152 | 16.664632161458343% | 1 | 49152 | 0.0% |
| 57344 | 14.283970424107139% | 1 | 57344 | 0.0% |
| 65536 | 12.49847412109375% | 1 | 65536 | 0.0% |
| 81920 | 19.998779296875% | 1 | 81920 | 0.0% |
| 98304 | 16.6656494140625% | 1 | 98304 | 0.0% |
| 114688 | 14.284842354910708% | 1 | 114688 | 0.0% |
| 131072 | 12.499237060546875% | 1 | 131072 | 0.0% |

The `CONFIG_LARGE_SIZE_CLASSES` option controls whether large allocations use
the same size class scheme providing 4 size classes for every doubling of size.
It increases virtual memory consumption but drastically improves performance
where realloc is used without proper growth factors, which is fairly common and
destroys performance in some commonly used programs. If large size classes are
disabled, the granularity is instead the page size, which is currently always
4096 bytes on supported platforms.

## Scalability

### Small (slab) allocations

As a baseline form of fine-grained locking, the slab allocator has entirely
separate allocators for each size class. Each size class has a dedicated lock,
CSPRNG and other state.

The slab allocator's scalability primarily comes from dividing up the slab
allocation region into independent arenas assigned to threads. The arenas are
just entirely separate slab allocators with their own sub-regions for each size
class. Using 4 arenas reserves a region 4 times as large and the relevant slab
allocator metadata is determined based on address, as part of the same approach
to finding the per-size-class metadata. The part that's still open to different
design choices is how arenas are assigned to threads. One approach is
statically assigning arenas via round-robin like the standard jemalloc
implementation, or statically assigning to a random arena which is essentially
the current implementation.  Another option is dynamic load balancing via a
heuristic like `sched_getcpu` for per-CPU arenas, which would offer better
performance than randomly choosing an arena each time while being more
predictable for an attacker. There are actually some security benefits from
this assignment being completely static, since it isolates threads from each
other. Static assignment can also reduce memory usage since threads may have
varying usage of size classes.

When there's substantial allocation or deallocation pressure, the allocator
does end up calling into the kernel to purge / protect unused slabs by
replacing them with fresh `PROT_NONE` regions along with unprotecting slabs
when partially filled and cached empty slabs are depleted.  There will be
configuration over the amount of cached empty slabs, but it's not entirely a
performance vs. memory trade-off since memory protecting unused slabs is a nice
opportunistic boost to security. However, it's not really part of the core
security model or features so it's quite reasonable to use much larger empty
slab caches when the memory usage is acceptable. It would also be reasonable to
attempt to use heuristics for dynamically tuning the size, but there's not a
great one size fits all approach so it isn't currently part of this allocator
implementation.

#### Thread caching (or lack thereof)

Thread caches are a commonly implemented optimization in modern allocators but
aren't very suitable for a hardened allocator even when implemented via arrays
like jemalloc rather than free lists. They would prevent the allocator from
having perfect knowledge about which memory is free in a way that's both race
free and works with fully out-of-line metadata. It would also interfere with
the quality of fine-grained randomization even with randomization support in
the thread caches. The caches would also end up with much weaker protection
than the dedicated metadata region. Potentially worst of all, it's inherently
incompatible with the important quarantine feature.

The primary benefit from a thread cache is performing batches of allocations
and batches of deallocations to amortize the cost of the synchronization used
by locking. The issue is not contention but rather the cost of synchronization
itself. Performing operations in large batches isn't necessarily a good thing
in terms of reducing contention to improve scalability. Large thread caches
like TCMalloc are a legacy design choice and aren't a good approach for a
modern allocator. In jemalloc, thread caches are fairly small and have a form
of garbage collection to clear them out when they aren't being heavily used.
Since this is a hardened allocator with a bunch of small costs for the security
features, the synchronization is already a smaller percentage of the overall
time compared to a much leaner performance-oriented allocator. These benefits
could be obtained via allocation queues and deallocation queues which would
avoid bypassing the quarantine and wouldn't have as much of an impact on
randomization. However, deallocation queues would also interfere with having
global knowledge about what is free. An allocation queue alone wouldn't have
many drawbacks, but it isn't currently planned even as an optional feature
since it probably wouldn't be enabled by default and isn't worth the added
complexity.

The secondary benefit of thread caches is being able to avoid the underlying
allocator implementation entirely for some allocations and deallocations when
they're mixed together rather than many allocations being done together or many
frees being done together. The value of this depends a lot on the application
and it's entirely unsuitable / incompatible with a hardened allocator since it
bypasses all of the underlying security and would destroy much of the security
value.

### Large allocations

The expectation is that the allocator does not need to perform well for large
allocations, especially in terms of scalability. When the performance for large
allocations isn't good enough, the approach will be to enable more slab
allocation size classes. Doubling the maximum size of slab allocations only
requires adding 4 size classes while keeping internal waste bounded below 20%.

Large allocations are implemented as a wrapper on top of the kernel memory
mapping API. The addresses and sizes are tracked in a global data structure
with a global lock. The current implementation is a hash table and could easily
use fine-grained locking, but it would have little benefit since most of the
locking is in the kernel. Most of the contention will be on the `mmap_sem` lock
for the process in the kernel. Ideally, it could simply map memory when
allocating and unmap memory when freeing. However, this is a hardened allocator
and the security features require extra system calls due to lack of direct
support for this kind of hardening in the kernel. Randomly sized guard regions
are placed around each allocation which requires mapping a `PROT_NONE` region
including the guard regions and then unprotecting the usable area between them.
The quarantine implementation requires clobbering the mapping with a fresh
`PROT_NONE` mapping using `MAP_FIXED` on free to hold onto the region while
it's in the quarantine, until it's eventually unmapped when it's pushed out of
the quarantine. This means there are 2x as many system calls for allocating and
freeing as there would be if the kernel supported these features directly.

## Memory tagging

Integrating extensive support for ARMv8.5 memory tagging is planned and this
section will be expanded cover the details on the chosen design. The approach
for slab allocations is currently covered, but it can also be used for the
allocator metadata region and large allocations.

Memory allocations are already always multiples of naturally aligned 16 byte
units, so memory tags are a natural fit into a malloc implementation due to the
16 byte alignment requirement. The only extra memory consumption will come from
the hardware supported storage for the tag values (4 bits per 16 bytes).

The baseline policy will be to generate random tags for each slab allocation
slot on first use. The highest value will be reserved for marking freed memory
allocations to detect any accesses to freed memory so it won't be part of the
generated range. Adjacent slots will be guaranteed to have distinct memory tags
in order to guarantee that linear overflows are detected. There are a few ways
of implementing this and it will end up depending on the performance costs of
different approaches. If there's an efficient way to fetch the adjacent tag
values without wasting extra memory, it will be possible to check for them and
skip them either by generating a new random value in a loop or incrementing
past them since the tiny bit of bias wouldn't matter. Another approach would be
alternating odd and even tag values but that would substantially reduce the
overall randomness of the tags and there's very little entropy from the start.

Once a slab allocation has been freed, the tag will be set to the reserved
value for free memory and the previous tag value will be stored inside the
allocation itself. The next time the slot is allocated, the chosen tag value
will be the previous value incremented by one to provide use-after-free
detection between generations of allocations. The stored tag will be wiped
before retagging the memory, to avoid leaking it and as part of preserving the
security property of newly allocated memory being zeroed due to zero-on-free.
It will eventually wrap all the way around, but this ends up providing a strong
guarantee for many allocation cycles due to the combination of 4 bit tags with
the FIFO quarantine feature providing delayed free. It also benefits from
random slot allocation and the randomized portion of delayed free, which result
in a further delay along with preventing a deterministic bypass by forcing a
reuse after a certain number of allocation cycles. Similarly to the initial tag
generation, tag values for adjacent allocations will be skipped by incrementing
past them.

For example, consider this slab of allocations that are not yet used with 16
representing the tag for free memory. For the sake of simplicity, there will be
no quarantine or other slabs for this example:

    | 16 | 16 | 16 | 16 | 16 | 16 |

Three slots are randomly chosen for allocations, with random tags assigned (2,
15, 7) since these slots haven't ever been used and don't have saved values:

    | 16 | 2  | 16 | 15 |  7 | 16 |

The 2nd allocation slot is freed, and is set back to the tag for free memory
(16), but with the previous tag value stored in the freed space:

    | 16 | 16 | 16 | 7  | 15 | 16 |

The first slot is allocated for the first time, receiving the random value 3:

    | 3  | 16 | 16 | 7  | 15 | 16 |

The 2nd slot is randomly chosen again, so the previous tag (2) is retrieved and
incremented to 3 as part of the use-after-free mitigation. An adjacent
allocation already uses the tag 3, so the tag is further incremented to 4 (it
would be incremented to 5 if one of the adjacent tags was 4):

    | 3  | 4  | 16 | 7  | 15 | 16 |

The last slot is randomly chosen for the next alocation, and is assigned the
random value 15. However, it's placed next to an allocation with the tag 15 so
the tag is incremented and wraps around to 0:

    | 3  | 4  | 16 | 7  | 15 | 0 |

## API extensions

The `void free_sized(void *ptr, size_t expected_size)` function exposes the
sized deallocation sanity checks for C. A performance-oriented allocator could
use the same API as an optimization to avoid a potential cache miss from
reading the size from metadata.

The `size_t malloc_object_size(void *ptr)` function returns an *upper bound* on
the accessible size of the relevant object (if any) by querying the malloc
implementation. It's similar to the `__builtin_object_size` intrinsic used by
`_FORTIFY_SOURCE` but via dynamically querying the malloc implementation rather
than determining constant sizes at compile-time. The current implementation is
just a naive placeholder returning much looser upper bounds than the intended
implementation. It's a valid implementation of the API already, but it will
become fully accurate once it's finished. This function is **not** currently
safe to call from signal handlers, but another API will be provided to make
that possible with a compile-time configuration option to avoid the necessary
overhead if the functionality isn't being used (in a way that doesn't change
break API compatibility based on the configuration).

The `size_t malloc_object_size_fast(void *ptr)` is comparable, but avoids
expensive operations like locking or even atomics. It provides significantly
less useful results falling back to higher upper bounds, but is very fast. In
this implementation, it retrieves an upper bound on the size for small memory
allocations based on calculating the size class region. This function is safe
to use from signal handlers already.

## System calls

This is intended to aid with creating system call whitelists via seccomp-bpf
and will change over time.

System calls used by all build configurations:

* `futex(uaddr, FUTEX_WAIT_PRIVATE, val, NULL)` (via `pthread_mutex_lock`)
* `futex(uaddr, FUTEX_WAKE_PRIVATE, val)` (via `pthread_mutex_unlock`)
* `getrandom(buf, buflen, 0)` (to seed and regularly reseed the CSPRNG)
* `mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0)`
* `mmap(ptr, size, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0)`
* `mprotect(ptr, size, PROT_READ)`
* `mprotect(ptr, size, PROT_READ|PROT_WRITE)`
* `mremap(old, old_size, new_size, 0)`
* `mremap(old, old_size, new_size, MREMAP_MAYMOVE|MREMAP_FIXED, new)`
* `munmap`
* `write(STDERR_FILENO, buf, len)` (before aborting due to memory corruption)

Additional system calls when `CONFIG_SEAL_METADATA=true` is set:

* `pkey_alloc`
* `pkey_mprotect` instead of `mprotect` with an additional `pkey` parameter,
  but otherwise the same (regular `mprotect` is never called)
* `uname` (to detect old buggy kernel versions)

Additional system calls for Android builds with `LABEL_MEMORY`:

* `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr, size, name)`