Daniel Borkmann 3f60875683 bpf: Fix leakage due to insufficient speculative store bypass mitigation
mainline inclusion
from mainline-v5.14-rc5
commit 2039f26f3aca5b0e419b98f65dd36481337b86ee
category: bugfix
issue: #I458AV
CVE: CVE-2021-35477

---------------------------

Spectre v4 gadgets make use of memory disambiguation, which is a set of
techniques that execute memory access instructions, that is, loads and
stores, out of program order; Intel's optimization manual, section 2.4.4.5:

  A load instruction micro-op may depend on a preceding store. Many
  microarchitectures block loads until all preceding store addresses are
  known. The memory disambiguator predicts which loads will not depend on
  any previous stores. When the disambiguator predicts that a load does
  not have such a dependency, the load takes its data from the L1 data
  cache. Eventually, the prediction is verified. If an actual conflict is
  detected, the load and all succeeding instructions are re-executed.

af86ca4e3088 ("bpf: Prevent memory disambiguation attack") tried to mitigate
this attack by sanitizing the memory locations through preemptive "fast"
(low latency) stores of zero prior to the actual "slow" (high latency) store
of a pointer value such that upon dependency misprediction the CPU then
speculatively executes the load of the pointer value and retrieves the zero
value instead of the attacker controlled scalar value previously stored at
that location, meaning, subsequent access in the speculative domain is then
redirected to the "zero page".

The sanitized preemptive store of zero prior to the actual "slow" store is
done through a simple ST instruction based on r10 (frame pointer) with
relative offset to the stack location that the verifier has been tracking
on the original used register for STX, which does not have to be r10. Thus,
there are no memory dependencies for this store, since it's only using r10
and immediate constant of zero; hence af86ca4e3088 /assumed/ a low latency
operation.

However, a recent attack demonstrated that this mitigation is not sufficient
since the preemptive store of zero could also be turned into a "slow" store
and is thus bypassed as well:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  31: (7b) *(u64 *)(r10 -16) = r2
  // r9 will remain "fast" register, r10 will become "slow" register below
  32: (bf) r9 = r10
  // JIT maps BPF reg to x86 reg:
  //  r9  -> r15 (callee saved)
  //  r10 -> rbp
  // train store forward prediction to break dependency link between both r9
  // and r10 by evicting them from the predictor's LRU table.
  33: (61) r0 = *(u32 *)(r7 +24576)
  34: (63) *(u32 *)(r7 +29696) = r0
  35: (61) r0 = *(u32 *)(r7 +24580)
  36: (63) *(u32 *)(r7 +29700) = r0
  37: (61) r0 = *(u32 *)(r7 +24584)
  38: (63) *(u32 *)(r7 +29704) = r0
  39: (61) r0 = *(u32 *)(r7 +24588)
  40: (63) *(u32 *)(r7 +29708) = r0
  [...]
  543: (61) r0 = *(u32 *)(r7 +25596)
  544: (63) *(u32 *)(r7 +30716) = r0
  // prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
  // to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
  // in hardware registers. rbp becomes slow due to push/pop latency. below is
  // disasm of bpf_ringbuf_output() helper for better visual context:
  //
  // ffffffff8117ee20: 41 54                 push   r12
  // ffffffff8117ee22: 55                    push   rbp
  // ffffffff8117ee23: 53                    push   rbx
  // ffffffff8117ee24: 48 f7 c1 fc ff ff ff  test   rcx,0xfffffffffffffffc
  // ffffffff8117ee2b: 0f 85 af 00 00 00     jne    ffffffff8117eee0 <-- jump taken
  // [...]
  // ffffffff8117eee0: 49 c7 c4 ea ff ff ff  mov    r12,0xffffffffffffffea
  // ffffffff8117eee7: 5b                    pop    rbx
  // ffffffff8117eee8: 5d                    pop    rbp
  // ffffffff8117eee9: 4c 89 e0              mov    rax,r12
  // ffffffff8117eeec: 41 5c                 pop    r12
  // ffffffff8117eeee: c3                    ret
  545: (18) r1 = map[id:4]
  547: (bf) r2 = r7
  548: (b7) r3 = 0
  549: (b7) r4 = 4
  550: (85) call bpf_ringbuf_output#194288
  // instruction 551 inserted by verifier    \
  551: (7a) *(u64 *)(r10 -16) = 0            | /both/ are now slow stores here
  // storing map value pointer r7 at fp-16   | since value of r10 is "slow".
  552: (7b) *(u64 *)(r10 -16) = r7           /
  // following "fast" read to the same memory location, but due to dependency
  // misprediction it will speculatively execute before insn 551/552 completes.
  553: (79) r2 = *(u64 *)(r9 -16)
  // in speculative domain contains attacker controlled r2. in non-speculative
  // domain this contains r7, and thus accesses r7 +0 below.
  554: (71) r3 = *(u8 *)(r2 +0)
  // leak r3

As can be seen, the current speculative store bypass mitigation which the
verifier inserts at line 551 is insufficient since /both/, the write of
the zero sanitation as well as the map value pointer are a high latency
instruction due to prior memory access via push/pop of r10 (rbp) in contrast
to the low latency read in line 553 as r9 (r15) which stays in hardware
registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
fp-16 can still be r2.

Initial thoughts to address this issue was to track spilled pointer loads
from stack and enforce their load via LDX through r10 as well so that /both/
the preemptive store of zero /as well as/ the load use the /same/ register
such that a dependency is created between the store and load. However, this
option is not sufficient either since it can be bypassed as well under
speculation. An updated attack with pointer spill/fills now _all_ based on
r10 would look as follows:

  [...]
  // r2 = oob address (e.g. scalar)
  // r7 = pointer to map value
  [...]
  // longer store forward prediction training sequence than before.
  2062: (61) r0 = *(u32 *)(r7 +25588)
  2063: (63) *(u32 *)(r7 +30708) = r0
  2064: (61) r0 = *(u32 *)(r7 +25592)
  2065: (63) *(u32 *)(r7 +30712) = r0
  2066: (61) r0 = *(u32 *)(r7 +25596)
  2067: (63) *(u32 *)(r7 +30716) = r0
  // store the speculative load address (scalar) this time after the store
  // forward prediction training.
  2068: (7b) *(u64 *)(r10 -16) = r2
  // preoccupy the CPU store port by running sequence of dummy stores.
  2069: (63) *(u32 *)(r7 +29696) = r0
  2070: (63) *(u32 *)(r7 +29700) = r0
  2071: (63) *(u32 *)(r7 +29704) = r0
  2072: (63) *(u32 *)(r7 +29708) = r0
  2073: (63) *(u32 *)(r7 +29712) = r0
  2074: (63) *(u32 *)(r7 +29716) = r0
  2075: (63) *(u32 *)(r7 +29720) = r0
  2076: (63) *(u32 *)(r7 +29724) = r0
  2077: (63) *(u32 *)(r7 +29728) = r0
  2078: (63) *(u32 *)(r7 +29732) = r0
  2079: (63) *(u32 *)(r7 +29736) = r0
  2080: (63) *(u32 *)(r7 +29740) = r0
  2081: (63) *(u32 *)(r7 +29744) = r0
  2082: (63) *(u32 *)(r7 +29748) = r0
  2083: (63) *(u32 *)(r7 +29752) = r0
  2084: (63) *(u32 *)(r7 +29756) = r0
  2085: (63) *(u32 *)(r7 +29760) = r0
  2086: (63) *(u32 *)(r7 +29764) = r0
  2087: (63) *(u32 *)(r7 +29768) = r0
  2088: (63) *(u32 *)(r7 +29772) = r0
  2089: (63) *(u32 *)(r7 +29776) = r0
  2090: (63) *(u32 *)(r7 +29780) = r0
  2091: (63) *(u32 *)(r7 +29784) = r0
  2092: (63) *(u32 *)(r7 +29788) = r0
  2093: (63) *(u32 *)(r7 +29792) = r0
  2094: (63) *(u32 *)(r7 +29796) = r0
  2095: (63) *(u32 *)(r7 +29800) = r0
  2096: (63) *(u32 *)(r7 +29804) = r0
  2097: (63) *(u32 *)(r7 +29808) = r0
  2098: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; same as before, also including the
  // sanitation store with 0 from the current mitigation by the verifier.
  2099: (7a) *(u64 *)(r10 -16) = 0         | /both/ are now slow stores here
  2100: (7b) *(u64 *)(r10 -16) = r7        | since store unit is still busy.
  // load from stack intended to bypass stores.
  2101: (79) r2 = *(u64 *)(r10 -16)
  2102: (71) r3 = *(u8 *)(r2 +0)
  // leak r3
  [...]

Looking at the CPU microarchitecture, the scheduler might issue loads (such
as seen in line 2101) before stores (line 2099,2100) because the load execution
units become available while the store execution unit is still busy with the
sequence of dummy stores (line 2069-2098). And so the load may use the prior
stored scalar from r2 at address r10 -16 for speculation. The updated attack
may work less reliable on CPU microarchitectures where loads and stores share
execution resources.

This concludes that the sanitizing with zero stores from af86ca4e3088 ("bpf:
Prevent memory disambiguation attack") is insufficient. Moreover, the detection
of stack reuse from af86ca4e3088 where previously data (STACK_MISC) has been
written to a given stack slot where a pointer value is now to be stored does
not have sufficient coverage as precondition for the mitigation either; for
several reasons outlined as follows:

 1) Stack content from prior program runs could still be preserved and is
    therefore not "random", best example is to split a speculative store
    bypass attack between tail calls, program A would prepare and store the
    oob address at a given stack slot and then tail call into program B which
    does the "slow" store of a pointer to the stack with subsequent "fast"
    read. From program B PoV such stack slot type is STACK_INVALID, and
    therefore also must be subject to mitigation.

 2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
    condition, for example, the previous content of that memory location could
    also be a pointer to map or map value. Without the fix, a speculative
    store bypass is not mitigated in such precondition and can then lead to
    a type confusion in the speculative domain leaking kernel memory near
    these pointer types.

While brainstorming on various alternative mitigation possibilities, we also
stumbled upon a retrospective from Chrome developers [0]:

  [...] For variant 4, we implemented a mitigation to zero the unused memory
  of the heap prior to allocation, which cost about 1% when done concurrently
  and 4% for scavenging. Variant 4 defeats everything we could think of. We
  explored more mitigations for variant 4 but the threat proved to be more
  pervasive and dangerous than we anticipated. For example, stack slots used
  by the register allocator in the optimizing compiler could be subject to
  type confusion, leading to pointer crafting. Mitigating type confusion for
  stack slots alone would have required a complete redesign of the backend of
  the optimizing compiler, perhaps man years of work, without a guarantee of
  completeness. [...]

>From BPF side, the problem space is reduced, however, options are rather
limited. One idea that has been explored was to xor-obfuscate pointer spills
to the BPF stack:

  [...]
  // preoccupy the CPU store port by running sequence of dummy stores.
  [...]
  2106: (63) *(u32 *)(r7 +29796) = r0
  2107: (63) *(u32 *)(r7 +29800) = r0
  2108: (63) *(u32 *)(r7 +29804) = r0
  2109: (63) *(u32 *)(r7 +29808) = r0
  2110: (63) *(u32 *)(r7 +29812) = r0
  // overwrite scalar with dummy pointer; xored with random 'secret' value
  // of 943576462 before store ...
  2111: (b4) w11 = 943576462
  2112: (af) r11 ^= r7
  2113: (7b) *(u64 *)(r10 -16) = r11
  2114: (79) r11 = *(u64 *)(r10 -16)
  2115: (b4) w2 = 943576462
  2116: (af) r2 ^= r11
  // ... and restored with the same 'secret' value with the help of AX reg.
  2117: (71) r3 = *(u8 *)(r2 +0)
  [...]

While the above would not prevent speculation, it would make data leakage
infeasible by directing it to random locations. In order to be effective
and prevent type confusion under speculation, such random secret would have
to be regenerated for each store. The additional complexity involved for a
tracking mechanism that prevents jumps such that restoring spilled pointers
would not get corrupted is not worth the gain for unprivileged. Hence, the
fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
instruction which the x86 JIT translates into a lfence opcode. Inserting the
latter in between the store and load instruction is one of the mitigations
options [1]. The x86 instruction manual notes:

  [...] An LFENCE that follows an instruction that stores to memory might
  complete before the data being stored have become globally visible. [...]

The latter meaning that the preceding store instruction finished execution
and the store is at minimum guaranteed to be in the CPU's store queue, but
it's not guaranteed to be in that CPU's L1 cache at that point (globally
visible). The latter would only be guaranteed via sfence. So the load which
is guaranteed to execute after the lfence for that local CPU would have to
rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:

  [...] For every store operation that is added to the ROB, an entry is
  allocated in the store buffer. This entry requires both the virtual and
  physical address of the target. Only if there is no free entry in the store
  buffer, the frontend stalls until there is an empty slot available in the
  store buffer again. Otherwise, the CPU can immediately continue adding
  subsequent instructions to the ROB and execute them out of order. On Intel
  CPUs, the store buffer has up to 56 entries. [...]

One small upside on the fix is that it lifts constraints from af86ca4e3088
where the sanitize_stack_off relative to r10 must be the same when coming
from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
or BPF_ST instruction. This happens either when we store a pointer or data
value to the BPF stack for the first time, or upon later pointer spills.
The former needs to be enforced since otherwise stale stack data could be
leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
BPF_NOSPEC mapping is currently optimized away, but others could emit a
speculation barrier as well if necessary. For real-world unprivileged
programs e.g. generated by LLVM, pointer spill/fill is only generated upon
register pressure and LLVM only tries to do that for pointers which are not
used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
sanitation for the STACK_INVALID case when the first write to a stack slot
occurs e.g. upon map lookup. In future we might refine ways to mitigate
the latter cost.

  [0] https://arxiv.org/pdf/1902.05178.pdf
  [1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
  [2] https://arxiv.org/pdf/1905.05725.pdf

Fixes: af86ca4e3088 ("bpf: Prevent memory disambiguation attack")
Fixes: f7cf25b2026d ("bpf: track spill/fill of constants")
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Conflicts:
	kernel/bpf/verifier.c
Signed-off-by: He Fengqing<hefengqing@huawei.com>
Reviewed-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Yu Changchun <yuchangchun1@huawei.com>
2021-08-28 11:01:30 +08:00

Contributions to OpenHarmony linux kernel project
==========================================

Sign DCO
--------

Before submitting any Contributions to OpenHarmony kernel, you have to sign
DCO.

See:
     https://dco.openharmony.io/sign-dco

Steps of submitting patches
---------------------------

1. Compile and test your patches successfully.
   You should test your patch in OpenHamrony supported boards, hi3516dv300,
   etc.

2. Generate patches
   Your patches should be based on top of latest OpenHarmony branch, and
   should use git-format-patch to generate patches, and if it's a patchset,
   it's better to use --cover-letter option to describe what the patchset
   does.

   Using scripts/checkpatch.pl to make sure there's no coding style issue.

   And make sure your patch follow unified OpenHarmony patch format describe
   below.

   *Tips*: Learn more about linux kernel coding style
    en:
    https://gitee.com/openharmony/kernel_linux/blob/master/Documentation/process/coding-style.rst
    zh:
    https://gitee.com/openharmony/kernel_linux/blob/master/Documentation/translations/zh_CN/coding-style.rst

3. Send patch to OpenHarmony mailing list
   Use this command to send patches to OpenHarmony kernel mailing list:

     git send-email *.patch -to="kernel@openharmony.io" --suppress-cc=all

   *NOTE*: that you must add --suppress-cc=all if you use git send-email,
   otherwise the email will be cced to the people in upstream community and
   mailing lists.

   *Tips*: Subscribe the mailing list
    https://lists.openatom.io/postorius/lists/kernel.openharmony.io/

   *See*: How to send patches using git-send-email
    https://git-scm.com/docs/git-send-email

4. Mark "v1, v2, v3 ..." in your patch subject if you have multiple versions
   to send out.

   Use --subject-prefix="PATCH v2" option to add v2 tag for patchset.
     git format-patch --subject-prefix="PATCH v2" -1

   Subject examples:
     Subject: [PATCH v2 01/27] fork: fix some -Wmissing-prototypes warnings
     Subject: [PATCH v3] ext2: improve scalability of bitmap searching

5. Upstream your kernel patch to kernel community is strongly recommended.
   OpenHarmony will sync up with kernel master timely.

6. Sign your work - the Developer's Certificate of Origin
   As the same of upstream kernel community, you also need to sign your
   patch.

   See:
   https://www.kernel.org/doc/html/latest/process/submitting-patches.html

   The sign-off is a simple line at the end of the explanation for the
   patch, which certifies that you wrote it or otherwise have the right to
   pass it on as an open-source patch. The rules are pretty simple: if you
   can certify the below:

   Developer's Certificate of Origin 1.1
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

     By making a contribution to this project, I certify that:

     (a) The contribution was created in whole or in part by me and I have
         the right to submit it under the open source license indicated in
         the file; or

     (b  The contribution is based upon previous work that, to the best of
         my knowledge, is covered under an appropriate open source license
         and I have the right under that license to submit that work with
         modifications, whether created in whole or in part by me, under
         the same open source license (unless I am permitted to submit under
         a different license), as indicated in the file; or

     (c) The contribution was provided directly to me by some other person
         who certified (a), (b) or (c) and I have not modified it.

     (d) I understand and agree that this project and the contribution are
         public and that a record of the contribution (including all
         personal information I submit with it, including my sign-off) is
         maintained indefinitely and may be redistributed consistent with
         this project or the open source license(s) involved.

   then you just add a line saying:

     Signed-off-by: Random J Developer <random@developer.example.org>

   using your real name (sorry, no pseudonyms or anonymous contributions.)

Use unified patch format
------------------------

Reasons:

1. long term maintainability
   OpenHarmony will merge massive patches. If all patches are merged by
   casual changelog format without a unified format, the git log will be
   messy, and then it's hard to figure out the original patch.

2. kernel upgrade
   We definitely will upgrade our OpenHarmony kernel in someday, using
   strict patch management will alleviate the pain to migrate patches
   during big upgrade.

3. easy for script parsing
   Keyword highlighting is necessary for script parsing.

Patch format definition
-----------------------

[M] stands for "mandatory"
[O] stands for "option"
$category can be: bug preparation, bugfix, perf, feature, doc, other...

If category is feature, then we also need to add feature name like below:
	category: feature
	feature: YYY (the feature name)

If the patch is related to CVE or issue/bugzilla, then we need add the
corresponding tag like below (In general, it should include at least one
of the following):
	CVE: $cve-id or NA
	issue: $issue-id or NA
	bugzilla: $bug-id or NA

issue: https://gitee.com/openharmony/kernel_linux/issues

Additional changelog should include at least one of the flollwing:
	1) Why we should apply this patch
	2) What real problem in product does this patch resolved
	3) How could we reproduce this bug or how to test
	4) Other useful information for help to understand this patch or problem

The detail information is very useful for porting patch to another kenrel
branch.

1. stable patch
	stable inclusion	[M]
	from $stable-version	[M]
	commit $id		[M]
	bugzilla: $bug-id       [O]
	issue: $issue-id	[O]
	CVE: $cve-id            [O]

	additional changelog	[O]

	--------------------------------

	original changelog

	Signed-off-by: $your_name <$your_mail>	[M]

	($stable-version would be stable-4.19.156, stable-4.19.157, etc...
	 $id would be stable commit)

2. mainline patch:

	mainline inclusion      [M]
	from $mainline-version  [M]
	commit $id              [M]
	category: $category     [M]
	bugzilla: $bug-id       [O]
	issue: $issue-id	[O]
	CVE: $cve-id            [O]

	additional changelog    [O]

	--------------------------------

	original changelog

	Signed-off-by: $your_name <$your_mail>   [M]

	($mainline-version could be mainline-5.6, mainline-5.10, etc...
	 $id would be mainline commit)

3. ohos patch
	ohos inclusion			[M]
	category: $category		[M]
	bugzilla: $bug-id or NA		[O]
	issue: $issue-id or NA		[O]
	CVE: $cve-id or NA		[O]

	--------------------------------

	changelog

	Signed-off-by: $your_name <$your_mail>	[M]

Examples
--------

mainline inclusion
from mainline-4.10
commit 0becc0ae5b42828785b589f686725ff5bc3b9b25
category: bugfix
issue: 1000
CVE: NA

The patch fixes a BUG_ON in the product: injecting single bit ECC error
to memory before system boot use hardware inject tools, which cause a
large amount of CMCI during system booting .

[    1.146580] mce: [Hardware Error]: Machine check events logged
[    1.152908] ------------[ cut here ]------------
[    1.157751] kernel BUG at kernel/timer.c:951!
[    1.162321] invalid opcode: 0000 [#1] SMP
...

-------------------------------------------------

original changelog

<original S-O-B>
Signed-off-by: Zhang San <zhangsan@163.com>
Tested-by: Li Si <lisi@163.com>

Email Client - Thunderbird Settings
-----------------------------------

If you are newly developer in the kernel community, it is highly recommended
to use thunderbird mail client.

1. Thunderbird Installation
   Get English version Thunderbird from http://www.mozilla.org/ and install
   it on your system.

   Download url: https://www.thunderbird.net/en-US/thunderbird/all/

2. Settings
   2.1 Use plain text format instead of HTML format
       Options -> Account Settings -> Composition & Addressing, do *NOT*
       select "Compose message in HTML format".

   2.2 Editor Settings
       Tools->Options->Advanced->Config editor.

       - To bring up the thunderbird's registry editor, and set:
         "mailnews.send_plaintext_flowed" to "false".
       - Disable HTML Format: Set "mail.identity.id1.compose_html" to
         "false".
       - Enable UTF8: Set "prefs.converted-to-utf8" to "true".
       - View message in UTF-8: Set "mailnews.view_default_charset" to
         "UTF-8".
       - Set mailnews.wraplength to 9999 for avoiding auto-wrap

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.
See Documentation/00-INDEX for a list of what is contained in each file.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.
Description
No description provided
Readme 2.1 GiB
Languages
C 97.7%
Assembly 1.5%
Makefile 0.3%
Shell 0.2%
Perl 0.1%
Other 0.1%