darling-dyld/IMPCaches.md
2023-04-29 11:24:58 -07:00

10 KiB

IMP caches generation

Principle

When you call an Objective-C method, Objective-C looks for the actual IMP (function pointer) given the class and selector. It then stores, in malloced memory, this [selector, IMP] pair in a hash table.

These hash tables are:

  • imperfect: the hash function is just a mask applied to the selector's address, so if two selectors have the same low bits, you will get a collision. Collisions are resolved through linear probing
  • expensive: the footprint of these hash tables is often around 70MB total on a carry device.

We want to replace this with static hash tables, inside the shared cache, where you have:

  • a perfect hashing function
  • and the tables are in clean memory

Hash function

Because the hash function is executed as part of objc_msgSend, it has to be blazingly fast. The algorithm below explains how we can make a hash function of the form h(x) = (x >> shift) & mask work, where shift and mask are two class-specific parameters.

Note that we cannot use traditional perfect hash tables techniques as:

The idea is that because the input of the hash function is a selector's address, we have some control over the input... because the dyld shared cache builder is the one placing all the selectors.

So now the problem becomes : given a set of classes and the selectors they implement, how do you place the selectors, and for each class, how do you find a shift and mask so that the hash table generated by (selector >> shift) & mask is perfect?

(Note that the shift + mask idiom lets us use various "bit stripes" of the selector's address for various hash tables).

Algorithm

There are basically two steps to the main algorithm:

Finding shifts and masks

Assign some of the high bits of the selector's address, and find a shift and a mask for each class. This is a backtracking algorithm which goes through each class, one after another. As it goes through classes, it finds a shift and a mask compatible with the bit ranges that are already set on each selector's address, and assigns the corresponding bits of the selector's address. Note that because addresses are partial at this point (some of the bits are unset) it's very difficult to check for collisions, so selectors will end up at the same address.

At each step of the algorithm, we go through a set of possible shifts and masks until we find one which works. If none work, we let the hash table grow to one more bit to make our job slightly easier. In practice few hash tables grow one more bit (82 out of 18k). If we cannot find a suitable shift and mask after backtracking a few times, we'll also allow ourselves to drop a class from the set and not generate an IMP cache for that one. This happens for a dozen classes or so with the current data set.

Next we have to deal with collisions, and constraints on the addresses themselves because each selector is... a char*, which has a length. You cannot have an overlap between two selectors. So the idea here is to try to get rid of this constraint by assigning in step 1 just the address of a "bucket" which is 128 bytes long (7 bits). So step 1 will assign the high bits of the address (which will be the address of the bucket) and we can then place the selectors linearly in the buckets.

Shuffling selectors around

Then you have to deal with address collisions. If the lengths of selectors in a given bucket add up to more than 128, you have to move some selectors out. So step 2 goes through all the selectors, checks if it fits within the bucket it's supposed to be in, and if it doesn't finds another suitable bucket.

To do so, it iterates through all the classes the selector is in, and builds a "constraint set" applying to that selector's bucket's address (by basically looking at which slots in each hash table are free and combining all these constraints, which impact a different "stripe" of the address due to the shifts and masks). Once we have the constraint set, we can find a different bucket for the selector without changing the shifts and masks. (If there is no other possible bucket, we allow ourselves to drop the classes involving this selector, too).

Once each selector has a valid bucket, you can simply assign the low 7 bits of each address by looking at which selectors are in each bucket and looking at their lengths.

The problem is hard but not that hard given the number of classes we are targeting with this optimization: we are roughly targeting ~ 20k classes out of 120k and ~ 200k selectors out of 900k, so we have lots of "free space".

Note that any holes we leave in the bucketization of the selectors can be filled later by placing all the selectors not targeted by the optimization and any selectors from OS executables inside them.

Shared cache builder setup

The optimization is guided by a file dropped by the OrderFiles project (/AppleInternal/OrderFiles/shared-cache-objc-optimizations.json). The performance team is responsible for updating this file by looking at the caches on live or StressCycler devices with objcdt dump-imp-caches.

This file specifies a list of classes and metaclasses the algorithm targets, and a list of flattening roots.

We haven't explained flattening yet. When a class D(aughter) inherits from a class M(other), we should in theory add all of M's methods to D's IMP cache. However

  • This constrains the problem too much. The solver's job will be too difficult if the same selectors are in thousands of classes.
  • This makes the IMP cache sizes blow up.

So our scheme is to target only classes which are leaves in the inheritance tree with this optimization. For any selector that comes from parent classes, the cache lookup will fail and fall back to looking for the method in super. Because super is not a leaf class, it will have a dynamically allocated IMP cache and can cache there any selector that comes from the parents.

However, this means that some very interesting classes from a memory standpoint (because we find their caches in many processes) get excluded because they have child classes. A solution to this is to turn on selector inheritance (add Mother's selectors to Daughter's cache) starting at some flattening root. Then the IMP cache will have a "fallback class" that is the superclass of the flattening root, and objc_msgSend will fallback to that class if it cannot find the selector in the child cache, skipping over all the classes up to the flattening root (because we know all the selectors in that chain will be present in the child cache).

Very early into the shared cache builder's life, the algorithm described above runs. To do so, it parses all of the source dylibs to find out which methods will end up in which class's cache (see section "what ends up in each cache" below). The output of the algorithm is:

  • for each class:
    • a shift and a mask
    • a list of methods that will end up in the cache with (source install name, source class name, source category name)
  • for each selector: an offset at which it needs to be placed relative to the beginning of the selectors section
  • a list of holes in the selector address space (or more accurately offset space) that can be used to add additional selectors not targeted by the algorithm

Then, we do all the selector deduping work, and when we get to the ObjC optimizer:

  • we go through all the classes again to build a map from (source install name, source class name, source category name) to the actual IMP's address (which is not known before the shared cache is laid out)
  • we go through all the IMP caches returned by the algorithm, find the corresponding IMP, and emit the actual IMP cache in the objc optimizer's RO region.

Some code pointers

Most of the logic lives in the single C++ file IMPCaches.cpp and the two headers IMPCaches.hpp and IMPCachesBuilder.hpp. A tour:

Types

IMPCaches::Selector : A unique selector. Has a name, a list of classes it's used in, and an address (which is actually an offset from the beginning of the selectors section). Due to how the placement algorithm work, it also has a current partial address and the corresponding bitmask of fixed bits. The algorithm adds bits to the partial address as it makes progress and updates the mask accordingly

IMPCaches::AddressSpace : a representation of the address space available in the selectors section, so that step 2 of the algorithm can check if selector buckets are overflowing.

IMPCaches::HoleMap : represents the holes left by the selector placement algorithm, to be filled later with other selectors we did not target.

IMPCaches::Constraint : represents a constraint on some of the bits of an address. It stores a set of allowed values for a given range of bits (shift and mask)

IMPCachesBuilder : what actually computes the caches (part of the algorithm ; what actually lays down the caches lives in the ObjC optimizer)

Entry points

  • IMPCachesBuilder::parseDylibs : parses the source dylibs to figure out which method ends up in which cache.

  • IMPCachesBuilder::findShiftsAndMasks : finds the shift and mask for each class (see "Findings shifts and masks" above).

  • IMPCachesBuilder::solveGivenShiftsAndMasks : moves selectors around to make sure each bucket only contains 128 bytes worth of selectors.

Then in OptimizerObjC:

  • IMPMapBuilder : Builds a map from (install name, class name, method name) to actual IMPs

  • IMPCachesEmitter : emits the actual IMP caches in the shared cache.

What ends up in each cache?

IMPCachesBuilder::parseDylibs goes through all the classes and categories to decide, for each (class,selector) pair, which implementation we will actually call at runtime. The sequence is:

  • Get all the methods from the method lists
  • Attach any methods from non-cross-image categories (when we have cross-image categories, we prevent the class from getting an IMP cache). If there is any collision at this point we'll use the implementation from the main class (or from the first category we found).
  • Then we may inline some of the implementations from superclasses, see the explanation on flattening above.