Impact: future-proof the split_large_page() function
Linus noticed that split_large_page() is not safe wrt. the
PAT bit: it is bit 12 on the 1GB and 2MB page table level
(_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
table level (_PAGE_BIT_PAT).
Currently it is not a problem because we never set
_PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
should this happen in the future the split_large_page() would
silently lift bit 12 into the lowlevel 4K pte and would start
corrupting the physical page frame offset. Not fun.
So add a debug warning, to make sure if something ever sets
the PAT bit then this function gets updated too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to prevent hard lockup on bad PMD permissions
If the PMD does not have the correct permissions for a page access,
but the PTE does, the spurious fault handler will mistake the fault
as a lazy TLB transaction. This will result in an infinite loop of:
fault -> spurious_fault check (pass) -> return to code -> fault
This patch adds a check and a warn on if the PTE passes the permissions
but the PMD does not.
[ Updated: Ingo Molnar suggested using WARN_ONCE with some text ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().
The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.
The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.
With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.
Also update the comment.
This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.
[ Huang Ying also experienced problems in this area when writing
the EFI code, but the real bug in split_large_page() was not
realized back then. ]
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new dynamic allocator, unified access to static/dynamic
percpu memory
Convert to the new dynamic percpu allocator.
* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA
Signed-off-by: Tejun Heo <tj@kernel.org>
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race
Impact: Flush the lazy MMU only once
Pending mmu updates only need to be flushed once to bring the
in-memory pagetable state up to date.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup
Make the max_low_pfn logic a bit more standard between
lowmem_pfn_init() and highmem_pfn_init().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Split find_low_pfn_range() into two functions:
- lowmem_pfn_init()
- highmem_pfn_init()
The former gets called if all of RAM fits into lowmem,
otherwise we call highmem_pfn_init().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeff Mahoney reported:
> With Suse's hwinfo tool, on -tip:
> WARNING: at arch/x86/mm/pat.c:637 reserve_pfn_range+0x5b/0x26d()
reserve_pfn_range() is not tracking the memory range below 1MB
as non-RAM and as such is inconsistent with similar checks in
reserve_memtype() and free_memtype()
Rename the pagerange_is_ram() to pat_pagerange_is_ram() and add the
"track legacy 1MB region as non RAM" condition.
And also, fix reserve_pfn_range() to return -EINVAL, when the pfn
range is RAM. This is to be consistent with this API design.
Reported-and-tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix race leading to crash under KVM and Xen
The CPA code may be called while we're in lazy mmu update mode - for
example, when using DEBUG_PAGE_ALLOC and doing a slab allocation
in an interrupt handler which interrupted a lazy mmu update. In this
case, the in-memory pagetable state may be out of date due to pending
queued updates. We need to flush any pending updates before inspecting
the page table. Similarly, we must explicitly flush any modifications
CPA may have made (which comes down to flushing queued operations when
flushing the TLB).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/init_32.c: In function ‘find_low_pfn_range’:
arch/x86/mm/init_32.c:696: warning: format ‘%u’ expects type ‘unsigned int’, but
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* commit 'remotes/tip/x86/paravirt': (175 commits)
xen: use direct ops on 64-bit
xen: make direct versions of irq_enable/disable/save/restore to common code
xen: setup percpu data pointers
xen: fix 32-bit build resulting from mmu move
x86/paravirt: return full 64-bit result
x86, percpu: fix kexec with vmlinux
x86/vmi: fix interrupt enable/disable/save/restore calling convention.
x86/paravirt: don't restore second return reg
xen: setup percpu data pointers
x86: split loading percpu segments from loading gdt
x86: pass in cpu number to switch_to_new_gdt()
x86: UV fix uv_flush_send_and_wait()
x86/paravirt: fix missing callee-save call on pud_val
x86/paravirt: use callee-saved convention for pte_val/make_pte/etc
x86/paravirt: implement PVOP_CALL macros for callee-save functions
x86/paravirt: add register-saving thunks to reduce caller register pressure
x86/paravirt: selectively save/restore regs around pvops calls
x86: fix paravirt clobber in entry_64.S
x86/pvops: add a paravirt_ident functions to allow special patching
xen: move remaining mmu-related stuff into mmu.c
...
Conflicts:
arch/x86/mach-voyager/voyager_smp.c
arch/x86/mm/fault.c
Impact: bug fix
Don't use per_cpu_offset() to determine if it valid to access a
per-cpu variable for a given cpu number. It is not a valid assumption
on x86-64 anymore. Use cpu_possible() instead.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent kprobes from catching spurious faults which will cause infinite
recursive page-fault and memory corruption by stack overflow.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
Introduce helper function fault_in_kernel_address() to make editors happy.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: widen debug checks
VirtualBox calls do_page_fault() from an atomic context but runs into a
might_sleep() way pas this point, cure that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86/Voyager subarch used to have this distinction between
'x86 SMP support' and 'Voyager SMP support':
config X86_SMP
bool
depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
This is a pointless distinction - Voyager can (and already does) use
smp_ops to implement various SMP quirks it has - and it can be extended
more to cover all the specialities of Voyager.
So remove this complication in the Kconfig space.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Our send_IPI_*() methods and definitions are a twisted mess: the same
symbol is defined to different things depending on .config details,
in a non-transparent way.
- spread out the quirks into separately named per apic driver methods
- prefix the standard PC methods with default_
- get rid of wrapper macro obfuscation
- clean up various details
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Code movement, no functional change.
Move the 64-bit NUMA code from setup_percpu.c to numa_64.c
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
xen: unitialised return value in xenbus_write_transaction
x86: fix section mismatch warning
x86: unmask CPUID levels on Intel CPUs, fix
x86: work around PAGE_KERNEL_WC not getting WC in iomap_atomic_prot_pfn.
x86: use standard PIT frequency
xen: handle highmem pages correctly when shrinking a domain
x86, mm: fix pte_free()
xen: actually release memory when shrinking domain
x86: unmask CPUID levels on Intel CPUs
x86: add MSR_IA32_MISC_ENABLE bits to <asm/msr-index.h>
x86: fix PTE corruption issue while mapping RAM using /dev/mem
x86: mtrr fix debug boot parameter
x86: fix page attribute corruption with cpa()
Revert "x86: signal: change type of paramter for sys_rt_sigreturn()"
x86: use early clobbers in usercopy*.c
x86: remove kernel_physical_mapping_init() from init section
fix: crash: IP: __bitmap_intersects+0x48/0x73
cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
work_on_cpu: Use our own workqueue.
work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
...
In the absence of PAT, PAGE_KERNEL_WC ends up mapping to a memory type that
gets UC behavior even in the presence of a WC MTRR covering the area in
question. By swapping to PAGE_KERNEL_UC_MINUS, we can get the actual
behavior the caller wanted (WC if you can manage it, UC otherwise).
This recovers the 40% performance improvement of using WC in the DRM
to upload vertex data.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup
When PAT was originally introduced, it was handled specially for a few
reasons:
- PAT bugs are hard to track down, so we wanted to maintain a
whitelist of CPUs.
- The i386 and x86-64 CPUID code was not yet unified.
Both of these are now obsolete, so handle PAT like any other features,
including ordinary feature blacklisting due to known bugs.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: introduce new uaccess exception handling framework
Introduce {get|put}_user_try and {get|put}_user_catch as new uaccess exception
handling framework.
{get|put}_user_try begins exception block and {get|put}_user_catch(err) ends
the block and gets err if an exception occured in {get|put}_user_ex() in the
block. The exception is stored thread_info->uaccess_err.
The example usage of this framework is below;
int func()
{
int err = 0;
get_user_try {
get_user_ex(...);
get_user_ex(...);
:
} get_user_catch(err);
return err;
}
Note: get_user_ex() is not clear the value when an exception occurs, it's
different from the behavior of __get_user(), but I think it doesn't matter.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: fix/extend ioremap_wc() beyond 4GB aperture on 32-bit
ioremap_wc() was taking in unsigned long parameter, where as it should take
64-bit resource_size_t parameter like other ioremap variants.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tsk is already assigned to current, drop the redundant second
assignment.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Beschorner Daniel reported:
> hwinfo problem since 2.6.28, showing this in the oops:
> Corrupted page table at address 7fd04de3ec00
Also, PaX Team reported a regression with this commit:
> commit 9542ada803198e6eba29d3289abb39ea82047b92
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
> Date: Wed Sep 24 08:53:33 2008 -0700
>
> x86: track memtype for RAM in page struct
This commit breaks mapping any RAM page through /dev/mem, as the
reserve_memtype() was not initializing the return attribute type and as such
corrupting the PTE entry that was setup with the return attribute type.
Because of this bug, application mapping this RAM page through /dev/mem
will die with "Corrupted page table at address xxxx" message in the kernel
log and also the kernel identity mapping which maps the underlying RAM
page gets converted to UC.
Fix this by initializing the return attribute type before calling
reserve_ram_pages_type()
Reported-by: PaX Team <pageexec@freemail.hu>
Reported-and-tested-by: Beschorner Daniel <Daniel.Beschorner@facton.com>
Tested-and-Acked-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sporadic slowdowns and warning messages
This patch fixes a performance issue reported by Linus on his
Nehalem system. While Linus reverted the PAT patch (commit
58dab916dfb57328d50deb0aa9b3fc92efa248ff) which exposed the issue,
existing cpa() code can potentially still cause wrong(page attribute
corruption) behavior.
This patch also fixes the "WARNING: at arch/x86/mm/pageattr.c:560" that
various people reported.
In 64bit kernel, kernel identity mapping might have holes depending
on the available memory and how e820 reports the address range
covering the RAM, ACPI, PCI reserved regions. If there is a 2MB/1GB hole
in the address range that is not listed by e820 entries, kernel identity
mapping will have a corresponding hole in its 1-1 identity mapping.
If cpa() happens on the kernel identity mapping which falls into these holes,
existing code fails like this:
__change_page_attr_set_clr()
__change_page_attr()
returns 0 because of if (!kpte). But doesn't
set cpa->numpages and cpa->pfn.
cpa_process_alias()
uses uninitialized cpa->pfn (random value)
which can potentially lead to changing the page
attribute of kernel text/data, kernel identity
mapping of RAM pages etc. oops!
This bug was easily exposed by another PAT patch which was doing
cpa() more often on kernel identity mapping holes (physical range between
max_low_pfn_mapped and 4GB), where in here it was setting the
cache disable attribute(PCD) for kernel identity mappings aswell.
Fix cpa() to handle the kernel identity mapping holes. Retain
the WARN() for cpa() calls to other not present address ranges
(kernel-text/data, ioremap() addresses)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix:
arch/x86/mm/srat_64.c: In function ‘acpi_numa_processor_affinity_init’:
arch/x86/mm/srat_64.c:141: error: implicit declaration of function ‘get_uv_system_type’
arch/x86/mm/srat_64.c:141: error: ‘UV_X2APIC’ undeclared (first use in this function)
arch/x86/mm/srat_64.c:141: error: (Each undeclared identifier is reported only once
arch/x86/mm/srat_64.c:141: error: for each function it appears in.)
A couple of UV definitions were moved to asm/uv/uv.h, but srat_64.c did
not include that header. Add it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Now that it's unified, move the (SMP) TLB flushing code from arch/x86/kernel/
to arch/x86/mm/, where it belongs logically.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, restructure code to improve assembly
gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines so the normal path is easier to read, while the error
cases are also nicely split away.
Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.
And add a couple of branch annotations.
Before:
do_page_fault:
subq $360, %rsp #,
After:
do_page_fault:
subq $56, %rsp #,
bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680
Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size,
and touches far less stack.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>