mirror of
https://gitee.com/openharmony/kernel_linux
synced 2025-01-18 22:41:14 +00:00
Daniel Borkmann
3f60875683
bpf: Fix leakage due to insufficient speculative store bypass mitigation
mainline inclusion from mainline-v5.14-rc5 commit 2039f26f3aca5b0e419b98f65dd36481337b86ee category: bugfix issue: #I458AV CVE: CVE-2021-35477 --------------------------- Spectre v4 gadgets make use of memory disambiguation, which is a set of techniques that execute memory access instructions, that is, loads and stores, out of program order; Intel's optimization manual, section 2.4.4.5: A load instruction micro-op may depend on a preceding store. Many microarchitectures block loads until all preceding store addresses are known. The memory disambiguator predicts which loads will not depend on any previous stores. When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache. Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed. af86ca4e3088 ("bpf: Prevent memory disambiguation attack") tried to mitigate this attack by sanitizing the memory locations through preemptive "fast" (low latency) stores of zero prior to the actual "slow" (high latency) store of a pointer value such that upon dependency misprediction the CPU then speculatively executes the load of the pointer value and retrieves the zero value instead of the attacker controlled scalar value previously stored at that location, meaning, subsequent access in the speculative domain is then redirected to the "zero page". The sanitized preemptive store of zero prior to the actual "slow" store is done through a simple ST instruction based on r10 (frame pointer) with relative offset to the stack location that the verifier has been tracking on the original used register for STX, which does not have to be r10. Thus, there are no memory dependencies for this store, since it's only using r10 and immediate constant of zero; hence af86ca4e3088 /assumed/ a low latency operation. However, a recent attack demonstrated that this mitigation is not sufficient since the preemptive store of zero could also be turned into a "slow" store and is thus bypassed as well: [...] // r2 = oob address (e.g. scalar) // r7 = pointer to map value 31: (7b) *(u64 *)(r10 -16) = r2 // r9 will remain "fast" register, r10 will become "slow" register below 32: (bf) r9 = r10 // JIT maps BPF reg to x86 reg: // r9 -> r15 (callee saved) // r10 -> rbp // train store forward prediction to break dependency link between both r9 // and r10 by evicting them from the predictor's LRU table. 33: (61) r0 = *(u32 *)(r7 +24576) 34: (63) *(u32 *)(r7 +29696) = r0 35: (61) r0 = *(u32 *)(r7 +24580) 36: (63) *(u32 *)(r7 +29700) = r0 37: (61) r0 = *(u32 *)(r7 +24584) 38: (63) *(u32 *)(r7 +29704) = r0 39: (61) r0 = *(u32 *)(r7 +24588) 40: (63) *(u32 *)(r7 +29708) = r0 [...] 543: (61) r0 = *(u32 *)(r7 +25596) 544: (63) *(u32 *)(r7 +30716) = r0 // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain // in hardware registers. rbp becomes slow due to push/pop latency. below is // disasm of bpf_ringbuf_output() helper for better visual context: // // ffffffff8117ee20: 41 54 push r12 // ffffffff8117ee22: 55 push rbp // ffffffff8117ee23: 53 push rbx // ffffffff8117ee24: 48 f7 c1 fc ff ff ff test rcx,0xfffffffffffffffc // ffffffff8117ee2b: 0f 85 af 00 00 00 jne ffffffff8117eee0 <-- jump taken // [...] // ffffffff8117eee0: 49 c7 c4 ea ff ff ff mov r12,0xffffffffffffffea // ffffffff8117eee7: 5b pop rbx // ffffffff8117eee8: 5d pop rbp // ffffffff8117eee9: 4c 89 e0 mov rax,r12 // ffffffff8117eeec: 41 5c pop r12 // ffffffff8117eeee: c3 ret 545: (18) r1 = map[id:4] 547: (bf) r2 = r7 548: (b7) r3 = 0 549: (b7) r4 = 4 550: (85) call bpf_ringbuf_output#194288 // instruction 551 inserted by verifier \ 551: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here // storing map value pointer r7 at fp-16 | since value of r10 is "slow". 552: (7b) *(u64 *)(r10 -16) = r7 / // following "fast" read to the same memory location, but due to dependency // misprediction it will speculatively execute before insn 551/552 completes. 553: (79) r2 = *(u64 *)(r9 -16) // in speculative domain contains attacker controlled r2. in non-speculative // domain this contains r7, and thus accesses r7 +0 below. 554: (71) r3 = *(u8 *)(r2 +0) // leak r3 As can be seen, the current speculative store bypass mitigation which the verifier inserts at line 551 is insufficient since /both/, the write of the zero sanitation as well as the map value pointer are a high latency instruction due to prior memory access via push/pop of r10 (rbp) in contrast to the low latency read in line 553 as r9 (r15) which stays in hardware registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally, fp-16 can still be r2. Initial thoughts to address this issue was to track spilled pointer loads from stack and enforce their load via LDX through r10 as well so that /both/ the preemptive store of zero /as well as/ the load use the /same/ register such that a dependency is created between the store and load. However, this option is not sufficient either since it can be bypassed as well under speculation. An updated attack with pointer spill/fills now _all_ based on r10 would look as follows: [...] // r2 = oob address (e.g. scalar) // r7 = pointer to map value [...] // longer store forward prediction training sequence than before. 2062: (61) r0 = *(u32 *)(r7 +25588) 2063: (63) *(u32 *)(r7 +30708) = r0 2064: (61) r0 = *(u32 *)(r7 +25592) 2065: (63) *(u32 *)(r7 +30712) = r0 2066: (61) r0 = *(u32 *)(r7 +25596) 2067: (63) *(u32 *)(r7 +30716) = r0 // store the speculative load address (scalar) this time after the store // forward prediction training. 2068: (7b) *(u64 *)(r10 -16) = r2 // preoccupy the CPU store port by running sequence of dummy stores. 2069: (63) *(u32 *)(r7 +29696) = r0 2070: (63) *(u32 *)(r7 +29700) = r0 2071: (63) *(u32 *)(r7 +29704) = r0 2072: (63) *(u32 *)(r7 +29708) = r0 2073: (63) *(u32 *)(r7 +29712) = r0 2074: (63) *(u32 *)(r7 +29716) = r0 2075: (63) *(u32 *)(r7 +29720) = r0 2076: (63) *(u32 *)(r7 +29724) = r0 2077: (63) *(u32 *)(r7 +29728) = r0 2078: (63) *(u32 *)(r7 +29732) = r0 2079: (63) *(u32 *)(r7 +29736) = r0 2080: (63) *(u32 *)(r7 +29740) = r0 2081: (63) *(u32 *)(r7 +29744) = r0 2082: (63) *(u32 *)(r7 +29748) = r0 2083: (63) *(u32 *)(r7 +29752) = r0 2084: (63) *(u32 *)(r7 +29756) = r0 2085: (63) *(u32 *)(r7 +29760) = r0 2086: (63) *(u32 *)(r7 +29764) = r0 2087: (63) *(u32 *)(r7 +29768) = r0 2088: (63) *(u32 *)(r7 +29772) = r0 2089: (63) *(u32 *)(r7 +29776) = r0 2090: (63) *(u32 *)(r7 +29780) = r0 2091: (63) *(u32 *)(r7 +29784) = r0 2092: (63) *(u32 *)(r7 +29788) = r0 2093: (63) *(u32 *)(r7 +29792) = r0 2094: (63) *(u32 *)(r7 +29796) = r0 2095: (63) *(u32 *)(r7 +29800) = r0 2096: (63) *(u32 *)(r7 +29804) = r0 2097: (63) *(u32 *)(r7 +29808) = r0 2098: (63) *(u32 *)(r7 +29812) = r0 // overwrite scalar with dummy pointer; same as before, also including the // sanitation store with 0 from the current mitigation by the verifier. 2099: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here 2100: (7b) *(u64 *)(r10 -16) = r7 | since store unit is still busy. // load from stack intended to bypass stores. 2101: (79) r2 = *(u64 *)(r10 -16) 2102: (71) r3 = *(u8 *)(r2 +0) // leak r3 [...] Looking at the CPU microarchitecture, the scheduler might issue loads (such as seen in line 2101) before stores (line 2099,2100) because the load execution units become available while the store execution unit is still busy with the sequence of dummy stores (line 2069-2098). And so the load may use the prior stored scalar from r2 at address r10 -16 for speculation. The updated attack may work less reliable on CPU microarchitectures where loads and stores share execution resources. This concludes that the sanitizing with zero stores from af86ca4e3088 ("bpf: Prevent memory disambiguation attack") is insufficient. Moreover, the detection of stack reuse from af86ca4e3088 where previously data (STACK_MISC) has been written to a given stack slot where a pointer value is now to be stored does not have sufficient coverage as precondition for the mitigation either; for several reasons outlined as follows: 1) Stack content from prior program runs could still be preserved and is therefore not "random", best example is to split a speculative store bypass attack between tail calls, program A would prepare and store the oob address at a given stack slot and then tail call into program B which does the "slow" store of a pointer to the stack with subsequent "fast" read. From program B PoV such stack slot type is STACK_INVALID, and therefore also must be subject to mitigation. 2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr) condition, for example, the previous content of that memory location could also be a pointer to map or map value. Without the fix, a speculative store bypass is not mitigated in such precondition and can then lead to a type confusion in the speculative domain leaking kernel memory near these pointer types. While brainstorming on various alternative mitigation possibilities, we also stumbled upon a retrospective from Chrome developers [0]: [...] For variant 4, we implemented a mitigation to zero the unused memory of the heap prior to allocation, which cost about 1% when done concurrently and 4% for scavenging. Variant 4 defeats everything we could think of. We explored more mitigations for variant 4 but the threat proved to be more pervasive and dangerous than we anticipated. For example, stack slots used by the register allocator in the optimizing compiler could be subject to type confusion, leading to pointer crafting. Mitigating type confusion for stack slots alone would have required a complete redesign of the backend of the optimizing compiler, perhaps man years of work, without a guarantee of completeness. [...] >From BPF side, the problem space is reduced, however, options are rather limited. One idea that has been explored was to xor-obfuscate pointer spills to the BPF stack: [...] // preoccupy the CPU store port by running sequence of dummy stores. [...] 2106: (63) *(u32 *)(r7 +29796) = r0 2107: (63) *(u32 *)(r7 +29800) = r0 2108: (63) *(u32 *)(r7 +29804) = r0 2109: (63) *(u32 *)(r7 +29808) = r0 2110: (63) *(u32 *)(r7 +29812) = r0 // overwrite scalar with dummy pointer; xored with random 'secret' value // of 943576462 before store ... 2111: (b4) w11 = 943576462 2112: (af) r11 ^= r7 2113: (7b) *(u64 *)(r10 -16) = r11 2114: (79) r11 = *(u64 *)(r10 -16) 2115: (b4) w2 = 943576462 2116: (af) r2 ^= r11 // ... and restored with the same 'secret' value with the help of AX reg. 2117: (71) r3 = *(u8 *)(r2 +0) [...] While the above would not prevent speculation, it would make data leakage infeasible by directing it to random locations. In order to be effective and prevent type confusion under speculation, such random secret would have to be regenerated for each store. The additional complexity involved for a tracking mechanism that prevents jumps such that restoring spilled pointers would not get corrupted is not worth the gain for unprivileged. Hence, the fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC instruction which the x86 JIT translates into a lfence opcode. Inserting the latter in between the store and load instruction is one of the mitigations options [1]. The x86 instruction manual notes: [...] An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible. [...] The latter meaning that the preceding store instruction finished execution and the store is at minimum guaranteed to be in the CPU's store queue, but it's not guaranteed to be in that CPU's L1 cache at that point (globally visible). The latter would only be guaranteed via sfence. So the load which is guaranteed to execute after the lfence for that local CPU would have to rely on store-to-load forwarding. [2], in section 2.3 on store buffers says: [...] For every store operation that is added to the ROB, an entry is allocated in the store buffer. This entry requires both the virtual and physical address of the target. Only if there is no free entry in the store buffer, the frontend stalls until there is an empty slot available in the store buffer again. Otherwise, the CPU can immediately continue adding subsequent instructions to the ROB and execute them out of order. On Intel CPUs, the store buffer has up to 56 entries. [...] One small upside on the fix is that it lifts constraints from af86ca4e3088 where the sanitize_stack_off relative to r10 must be the same when coming from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX or BPF_ST instruction. This happens either when we store a pointer or data value to the BPF stack for the first time, or upon later pointer spills. The former needs to be enforced since otherwise stale stack data could be leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST | BPF_NOSPEC mapping is currently optimized away, but others could emit a speculation barrier as well if necessary. For real-world unprivileged programs e.g. generated by LLVM, pointer spill/fill is only generated upon register pressure and LLVM only tries to do that for pointers which are not used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC sanitation for the STACK_INVALID case when the first write to a stack slot occurs e.g. upon map lookup. In future we might refine ways to mitigate the latter cost. [0] https://arxiv.org/pdf/1902.05178.pdf [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/ [2] https://arxiv.org/pdf/1905.05725.pdf Fixes: af86ca4e3088 ("bpf: Prevent memory disambiguation attack") Fixes: f7cf25b2026d ("bpf: track spill/fill of constants") Co-developed-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Conflicts: kernel/bpf/verifier.c Signed-off-by: He Fengqing<hefengqing@huawei.com> Reviewed-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Yu Changchun <yuchangchun1@huawei.com>
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Contributions to OpenHarmony linux kernel project ========================================== Sign DCO -------- Before submitting any Contributions to OpenHarmony kernel, you have to sign DCO. See: https://dco.openharmony.io/sign-dco Steps of submitting patches --------------------------- 1. Compile and test your patches successfully. You should test your patch in OpenHamrony supported boards, hi3516dv300, etc. 2. Generate patches Your patches should be based on top of latest OpenHarmony branch, and should use git-format-patch to generate patches, and if it's a patchset, it's better to use --cover-letter option to describe what the patchset does. Using scripts/checkpatch.pl to make sure there's no coding style issue. And make sure your patch follow unified OpenHarmony patch format describe below. *Tips*: Learn more about linux kernel coding style en: https://gitee.com/openharmony/kernel_linux/blob/master/Documentation/process/coding-style.rst zh: https://gitee.com/openharmony/kernel_linux/blob/master/Documentation/translations/zh_CN/coding-style.rst 3. Send patch to OpenHarmony mailing list Use this command to send patches to OpenHarmony kernel mailing list: git send-email *.patch -to="kernel@openharmony.io" --suppress-cc=all *NOTE*: that you must add --suppress-cc=all if you use git send-email, otherwise the email will be cced to the people in upstream community and mailing lists. *Tips*: Subscribe the mailing list https://lists.openatom.io/postorius/lists/kernel.openharmony.io/ *See*: How to send patches using git-send-email https://git-scm.com/docs/git-send-email 4. Mark "v1, v2, v3 ..." in your patch subject if you have multiple versions to send out. Use --subject-prefix="PATCH v2" option to add v2 tag for patchset. git format-patch --subject-prefix="PATCH v2" -1 Subject examples: Subject: [PATCH v2 01/27] fork: fix some -Wmissing-prototypes warnings Subject: [PATCH v3] ext2: improve scalability of bitmap searching 5. Upstream your kernel patch to kernel community is strongly recommended. OpenHarmony will sync up with kernel master timely. 6. Sign your work - the Developer's Certificate of Origin As the same of upstream kernel community, you also need to sign your patch. See: https://www.kernel.org/doc/html/latest/process/submitting-patches.html The sign-off is a simple line at the end of the explanation for the patch, which certifies that you wrote it or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify the below: Developer's Certificate of Origin 1.1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. then you just add a line saying: Signed-off-by: Random J Developer <random@developer.example.org> using your real name (sorry, no pseudonyms or anonymous contributions.) Use unified patch format ------------------------ Reasons: 1. long term maintainability OpenHarmony will merge massive patches. If all patches are merged by casual changelog format without a unified format, the git log will be messy, and then it's hard to figure out the original patch. 2. kernel upgrade We definitely will upgrade our OpenHarmony kernel in someday, using strict patch management will alleviate the pain to migrate patches during big upgrade. 3. easy for script parsing Keyword highlighting is necessary for script parsing. Patch format definition ----------------------- [M] stands for "mandatory" [O] stands for "option" $category can be: bug preparation, bugfix, perf, feature, doc, other... If category is feature, then we also need to add feature name like below: category: feature feature: YYY (the feature name) If the patch is related to CVE or issue/bugzilla, then we need add the corresponding tag like below (In general, it should include at least one of the following): CVE: $cve-id or NA issue: $issue-id or NA bugzilla: $bug-id or NA issue: https://gitee.com/openharmony/kernel_linux/issues Additional changelog should include at least one of the flollwing: 1) Why we should apply this patch 2) What real problem in product does this patch resolved 3) How could we reproduce this bug or how to test 4) Other useful information for help to understand this patch or problem The detail information is very useful for porting patch to another kenrel branch. 1. stable patch stable inclusion [M] from $stable-version [M] commit $id [M] bugzilla: $bug-id [O] issue: $issue-id [O] CVE: $cve-id [O] additional changelog [O] -------------------------------- original changelog Signed-off-by: $your_name <$your_mail> [M] ($stable-version would be stable-4.19.156, stable-4.19.157, etc... $id would be stable commit) 2. mainline patch: mainline inclusion [M] from $mainline-version [M] commit $id [M] category: $category [M] bugzilla: $bug-id [O] issue: $issue-id [O] CVE: $cve-id [O] additional changelog [O] -------------------------------- original changelog Signed-off-by: $your_name <$your_mail> [M] ($mainline-version could be mainline-5.6, mainline-5.10, etc... $id would be mainline commit) 3. ohos patch ohos inclusion [M] category: $category [M] bugzilla: $bug-id or NA [O] issue: $issue-id or NA [O] CVE: $cve-id or NA [O] -------------------------------- changelog Signed-off-by: $your_name <$your_mail> [M] Examples -------- mainline inclusion from mainline-4.10 commit 0becc0ae5b42828785b589f686725ff5bc3b9b25 category: bugfix issue: 1000 CVE: NA The patch fixes a BUG_ON in the product: injecting single bit ECC error to memory before system boot use hardware inject tools, which cause a large amount of CMCI during system booting . [ 1.146580] mce: [Hardware Error]: Machine check events logged [ 1.152908] ------------[ cut here ]------------ [ 1.157751] kernel BUG at kernel/timer.c:951! [ 1.162321] invalid opcode: 0000 [#1] SMP ... ------------------------------------------------- original changelog <original S-O-B> Signed-off-by: Zhang San <zhangsan@163.com> Tested-by: Li Si <lisi@163.com> Email Client - Thunderbird Settings ----------------------------------- If you are newly developer in the kernel community, it is highly recommended to use thunderbird mail client. 1. Thunderbird Installation Get English version Thunderbird from http://www.mozilla.org/ and install it on your system. Download url: https://www.thunderbird.net/en-US/thunderbird/all/ 2. Settings 2.1 Use plain text format instead of HTML format Options -> Account Settings -> Composition & Addressing, do *NOT* select "Compose message in HTML format". 2.2 Editor Settings Tools->Options->Advanced->Config editor. - To bring up the thunderbird's registry editor, and set: "mailnews.send_plaintext_flowed" to "false". - Disable HTML Format: Set "mail.identity.id1.compose_html" to "false". - Enable UTF8: Set "prefs.converted-to-utf8" to "true". - View message in UTF-8: Set "mailnews.view_default_charset" to "UTF-8". - Set mailnews.wraplength to 9999 for avoiding auto-wrap Linux kernel ============ There are several guides for kernel developers and users. These guides can be rendered in a number of formats, like HTML and PDF. Please read Documentation/admin-guide/README.rst first. In order to build the documentation, use ``make htmldocs`` or ``make pdfdocs``. The formatted documentation can also be read online at: https://www.kernel.org/doc/html/latest/ There are various text files in the Documentation/ subdirectory, several of them using the Restructured Text markup notation. See Documentation/00-INDEX for a list of what is contained in each file. Please read the Documentation/process/changes.rst file, as it contains the requirements for building and running the kernel, and information about the problems which may result by upgrading your kernel.
Description
Languages
C
97.7%
Assembly
1.5%
Makefile
0.3%
Shell
0.2%
Perl
0.1%
Other
0.1%