Currently, during CPU offlining, after all pending work items are
drained, the trustee butchers all workers. Also, on CPU onlining
failure, workqueue_cpu_callback() ensures that the first idle worker
is destroyed. Combined, these guarantee that an offline CPU doesn't
have any worker for it once all the lingering work items are finished.
This guarantee isn't really necessary and makes CPU on/offlining more
expensive than needs to be, especially for platforms which use CPU
hotplug for powersaving.
This patch lets offline CPUs removes idle worker butchering from the
trustee and let a CPU which failed onlining keep the created first
worker. The first worker is created if the CPU doesn't have any
during CPU_DOWN_PREPARE and started right away. If onlining succeeds,
the rebind_workers() call in CPU_ONLINE will rebind it like any other
workers. If onlining fails, the worker is left alone till the next
try.
This makes CPU hotplugs cheaper by allowing global_cwqs to keep
workers across them and simplifies code.
Note that trustee doesn't re-arm idle timer when it's done and thus
the disassociated global_cwq will keep all workers until it comes back
online. This will be improved by further patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, if there are left workers when a CPU is being brough back
online, the trustee kills all idle workers and scheduled rebind_work
so that they re-bind to the CPU after the currently executing work is
finished. This works for busy workers because concurrency management
doesn't try to wake up them from scheduler callbacks, which require
the target task to be on the local run queue. The busy worker bumps
concurrency counter appropriately as it clears WORKER_UNBOUND from the
rebind work item and it's bound to the CPU before returning to the
idle state.
To reduce CPU on/offlining overhead (as many embedded systems use it
for powersaving) and simplify the code path, workqueue is planned to
be modified to retain idle workers across CPU on/offlining. This
patch reimplements CPU online rebinding such that it can also handle
idle workers.
As noted earlier, due to the local wakeup requirement, rebinding idle
workers is tricky. All idle workers must be re-bound before scheduler
callbacks are enabled. This is achieved by interlocking idle
re-binding. Idle workers are requested to re-bind and then hold until
all idle re-binding is complete so that no bound worker starts
executing work item. Only after all idle workers are re-bound and
parked, CPU_ONLINE proceeds to release them and queue rebind work item
to busy workers thus guaranteeing scheduler callbacks aren't invoked
until all idle workers are ready.
worker_rebind_fn() is renamed to busy_worker_rebind_fn() and
idle_worker_rebind() for idle workers is added. Rebinding logic is
moved to rebind_workers() and now called from CPU_ONLINE after
flushing trustee. While at it, add CPU sanity check in
worker_thread().
Note that now a worker may become idle or the manager between trustee
release and rebinding during CPU_ONLINE. As the previous patch
updated create_worker() so that it can be used by regular manager
while unbound and this patch implements idle re-binding, this is safe.
This prepares for removal of trustee and keeping idle workers across
CPU hotplugs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, create_worker()'s callers are responsible for deciding
whether the newly created worker should be bound to the associated CPU
and create_worker() sets WORKER_UNBOUND only for the workers for the
unbound global_cwq. Creation during normal operation is always via
maybe_create_worker() and @bind is true. For workers created during
hotplug, @bind is false.
Normal operation path is planned to be used even while the CPU is
going through hotplug operations or offline and this static decision
won't work.
Drop @bind from create_worker() and decide whether to bind by looking
at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND
autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED
while create_worker() is in progress, the flag is now allowed to be
changed only while holding all manager_mutexes on the global_cwq.
This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's
back. CPU_ONLINE no longer clears DISASSOCIATED before flushing
trustee, which clears DISASSOCIATED before rebinding remaining workers
if asked to release. For cases where trustee isn't around, CPU_ONLINE
clears DISASSOCIATED after flushing trustee. Also, now, first_idle
has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE
while binding it. These convolutions will soon be removed by further
simplification of CPU hotplug path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
POOL_MANAGING_WORKERS is used to ensure that at most one worker takes
the manager role at any given time on a given global_cwq. Trustee
later hitched on it to assume manager adding blocking wait for the
bit. As trustee already needed a custom wait mechanism, waiting for
MANAGING_WORKERS was rolled into the same mechanism.
Trustee is scheduled to be removed. This patch separates out
MANAGING_WORKERS wait into per-pool mutex. Workers use
mutex_trylock() to test for manager role and trustee uses mutex_lock()
to claim manager roles.
gcwq_claim/release_management() helpers are added to grab and release
manager roles of all pools on a global_cwq. gcwq_claim_management()
always grabs pool manager mutexes in ascending pool index order and
uses pool index as lockdep subclass.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, WORKER_UNBOUND is used to mark workers for the unbound
global_cwq and WORKER_ROGUE is used to mark workers for disassociated
per-cpu global_cwqs. Both are used to make the marked worker skip
concurrency management and the only place they make any difference is
in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling
idle timer, which can easily be replaced with trustee state testing.
This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops
WORKER_ROGUE. This is to prepare for removing trustee and handling
disassociated global_cwqs as unbound.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED.
This was necessary because workqueue's CPU_DOWN_PREPARE happened
before other DOWN_PREPARE notifiers and workqueue needed to stay
associated across the rest of DOWN_PREPARE.
After the previous patch, workqueue's DOWN_PREPARE happens after
others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and
let the trustee set GCWQ_DISASSOCIATED after disabling concurrency
management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.
Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.
However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.
While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.
Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.
Workqueue cpu hotplug operations will soon go through further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
WQ_HIGHPRI was implemented by queueing highpri work items at the head
of the global worklist. Other than queueing at the head, they weren't
handled differently; unfortunately, this could lead to execution
latency of a few seconds on heavily loaded systems.
Now that workqueue code has been updated to deal with multiple
worker_pools per global_cwq, this patch reimplements WQ_HIGHPRI using
a separate worker_pool. NR_WORKER_POOLS is bumped to two and
gcwq->pools[0] is used for normal pri work items and ->pools[1] for
highpri. Highpri workers get -20 nice level and has 'H' suffix in
their names. Note that this change increases the number of kworkers
per cpu.
POOL_HIGHPRI_PENDING, pool_determine_ins_pos() and highpri chain
wakeup code in process_one_work() are no longer used and removed.
This allows proper prioritization of highpri work items and removes
high execution latency of highpri work items.
v2: nr_running indexing bug in get_pool_nr_running() fixed.
v3: Refreshed for the get_pool_nr_running() update in the previous
patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Hunt <joshhunt00@gmail.com>
LKML-Reference: <CAKA=qzaHqwZ8eqpLNFjxnO2fX-tgAOjmpvxgBFjv6dJeQaOW1w@mail.gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Introduce NR_WORKER_POOLS and for_each_worker_pool() and convert code
paths which need to manipulate all pools in a gcwq to use them.
NR_WORKER_POOLS is currently one and for_each_worker_pool() iterates
over only @gcwq->pool.
Note that nr_running is per-pool property and converted to an array
with NR_WORKER_POOLS elements and renamed to pool_nr_running. Note
that get_pool_nr_running() currently assumes 0 index. The next patch
will make use of non-zero index.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
v2: nr_running indexing bug in get_pool_nr_running() fixed.
v3: Pointer to array is stupid. Don't use it in get_pool_nr_running()
as suggested by Linus.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
GCWQ_MANAGE_WORKERS, GCWQ_MANAGING_WORKERS and GCWQ_HIGHPRI_PENDING
are per-pool properties. Add worker_pool->flags and make the above
three flags per-pool flags.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
Signed-off-by: Tejun Heo <tj@kernel.org>
Modify all functions which deal with per-pool properties to pass
around @pool instead of @gcwq or @cpu.
The changes in this patch are mechanical and don't caues any
functional difference. This is to prepare for multiple pools per
gcwq.
Signed-off-by: Tejun Heo <tj@kernel.org>
Move worklist and all worker management fields from global_cwq into
the new struct worker_pool. worker_pool points back to the containing
gcwq. worker and cpu_workqueue_struct are updated to point to
worker_pool instead of gcwq too.
This change is mechanical and doesn't introduce any functional
difference other than rearranging of fields and an added level of
indirection in some places. This is to prepare for multiple pools per
gcwq.
v2: Comment typo fixes as suggested by Namhyung.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Unbound wqs aren't concurrency-managed and try to execute work items
as soon as possible. This is currently achieved by implicitly setting
%WQ_HIGHPRI on all unbound workqueues; however, WQ_HIGHPRI
implementation is about to be restructured and this usage won't be
valid anymore.
Add an explicit chain-wakeup path for unbound workqueues in
process_one_work() instead of piggy backing on %WQ_HIGHPRI.
Signed-off-by: Tejun Heo <tj@kernel.org>
Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.
Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.
Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.
Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].
Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().
* Patch orginally from Peter. Hugh modified it a bit and wrote the
description.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>
worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running
isn't zero when every worker is idle. This can trigger spuriously
while a cpu is going down due to the way trustee sets %WORKER_ROGUE
and zaps nr_running.
It first sets %WORKER_ROGUE on all workers without updating
nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and
then zaps nr_running. If the last running worker enters idle
inbetween, it would see stale nr_running which hasn't been zapped yet
and trigger the WARN_ON_ONCE().
Fix it by performing the sanity check iff the trustee is idle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
If a workqueue is flushed with flush_work() lockdep checking can
be circumvented. For example:
static DEFINE_MUTEX(mutex);
static void my_work(struct work_struct *w)
{
mutex_lock(&mutex);
mutex_unlock(&mutex);
}
static DECLARE_WORK(work, my_work);
static int __init start_test_module(void)
{
schedule_work(&work);
return 0;
}
module_init(start_test_module);
static void __exit stop_test_module(void)
{
mutex_lock(&mutex);
flush_work(&work);
mutex_unlock(&mutex);
}
module_exit(stop_test_module);
would not always print a warning when flush_work() was called.
In this trivial example nothing could go wrong since we are
guaranteed module_init() and module_exit() don't run concurrently,
but if the work item is schedule asynchronously we could have a
scenario where the work item is running just at the time flush_work()
is called resulting in a classic ABBA locking problem.
Add a lockdep hint by acquiring and releasing the work item
lockdep_map in flush_work() so that we always catch this
potential deadlock scenario.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This BUG_ON() can be triggered if you call schedule_work() before
calling INIT_WORK(). It is a bug definitely, but it's nicer to just
print a stack trace and return.
Reported-by: Matt Renzelmann <mjr@cs.wisc.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull workqueue changes from Tejun Heo:
"This contains only one commit which cleans up UP allocation path a
bit."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use percpu allocator for cwq on UP
I notice that the commit bbddff makes percpu allocator can work on UP,
So we don't need the magic way for UP.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
alloc_workqueue() currently expects the passed in @name pointer to remain
accessible. This is inconvenient and a bit silly given that the whole wq
is being dynamically allocated. This patch updates alloc_workqueue() and
friends to take printf format string instead of opaque string and matching
varargs at the end. The name is allocated together with the wq and
formatted.
alloc_ordered_workqueue() is converted to a macro to unify varargs
handling with alloc_workqueue(), and, while at it, add comment to
alloc_workqueue().
None of the current in-kernel users pass in string with '%' as constant
name and this change shouldn't cause any problem.
[akpm@linux-foundation.org: use __printf]
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.
Nothing to see here but a whole lot of instances of:
-#include <linux/module.h>
+#include <linux/export.h>
This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to
make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing
and then incrementing nr_active when it activates a delayed work.
We discovered this when a corner case in one of our drivers resulted in
us trying to destroy a workqueue in which the remaining work would
always requeue itself again in the same workqueue. We would hit this
race condition and trip the BUG_ON on workqueue.c:3080.
Signed-off-by: Thomas Tuttle <ttuttle@chromium.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: separate out drain_workqueue() from destroy_workqueue()
workqueue: remove cancel_rearming_delayed_work[queue]()
* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Unify input section names
percpu: Avoid extra NOP in percpu_cmpxchg16b_double
percpu: Cast away printk format warning
percpu: Always align percpu output section to PAGE_SIZE
Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
There are users which want to drain workqueues without destroying it.
Separate out drain functionality from destroy_workqueue() into
drain_workqueue() and make it accessible to workqueue users.
To guarantee forward-progress, only chain queueing is allowed while
drain is in progress. If a new work item which isn't chained from the
running or pending work items is queued while draining is in progress,
WARN_ON_ONCE() is triggered.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
If a rescuer and stop_machine() bringing down a CPU race with each
other, they may deadlock on non-preemptive kernel. The CPU won't
accept a new task, so the rescuer can't migrate to the target CPU,
while stop_machine() can't proceed because the rescuer is holding one
of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared
and worker_maybe_bind_and_lock() retries indefinitely.
This problem can be reproduced semi reliably while the system is
entering suspend.
http://thread.gmane.org/gmane.linux.kernel/1122051
A lot of kudos to Thilo-Alexander for reporting this tricky issue and
painstaking testing.
stable: This affects all kernels with cmwq, so all kernels since and
including v2.6.36 need this fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Cc: stable@kernel.org
Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly. The calculation of the
former depends on the alignment of percpu output section in the kernel
image.
The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.
This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.
For um, this patch raises the alignment of percpu area. As the area
is in .init, there shouldn't be any noticeable difference.
This problem was discovered by David Howells while debugging boot
failure on mn10300.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
ksoftirqd, kworker, migration, and pktgend kthreads can be created with
kthread_create_on_node(), to get proper NUMA affinities for their stack and
task_struct.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix build failure introduced by s/freezeable/freezable/
workqueue: add system_freezeable_wq
rds/ib: use system_wq instead of rds_ib_fmr_wq
net/9p: replace p9_poll_task with a work
net/9p: use system_wq instead of p9_mux_wq
xfs: convert to alloc_workqueue()
reiserfs: make commit_wq use the default concurrency level
ocfs2: use system_wq instead of ocfs2_quota_wq
ext4: convert to alloc_workqueue()
scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
misc/iwmc3200top: use system_wq instead of dedicated workqueues
i2o: use alloc_workqueue() instead of create_workqueue()
acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
fs/aio: aio_wq isn't used in memory reclaim path
input/tps6507x-ts: use system_wq instead of dedicated workqueue
cpufreq: use system_wq instead of dedicated workqueues
wireless/ipw2x00: use system_wq instead of dedicated workqueues
arm/omap: use system_wq in mailbox
workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
In complex subsystems like mac80211 structures can contain several
timers and work structs, so identifying a specific instance from the
call trace and object type output of debugobjects can be hard.
Allow the subsystems which support debugobjects to provide a hint
function. This function returns a pointer to a kernel address
(preferrably the objects callback function) which is printed along
with the debugobjects type.
Add hint methods for timer_list, work_struct and hrtimer.
[ tglx: Massaged changelog, made it compile ]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <20110307085809.GA9334@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
wq:fixes-2.6.38 does s/WQ_FREEZEABLE/WQ_FREEZABLE and wq:for-2.6.39
adds new usage of the flag. The combination of the two creates a
build failure after merge. Fix it by renaming all freezeables to
freezables.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on
configuration may end up 0 or 1. Even when it's 1, depending on when
the mayday timer is added in the current jiffy interval, it may expire
way before a jiffy has passed.
Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at
least a full jiffy has passed before calling rescuers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ray Jui <rjui@broadcom.com>
Cc: stable@kernel.org
There are two spellings in use for 'freeze' + 'able' - 'freezable' and
'freezeable'. The former is the more prominent one. The latter is
mostly used by workqueue and in a few other odd places. Unify the
spelling to 'freezable'.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Steven Whitehouse <swhiteho@redhat.com>
After executing the matching works, a rescuer leaves the gcwq whether
there are more pending works or not. This may decrease the
concurrency level to zero and stall execution until a new work item is
queued on the gcwq.
Make rescuer wake up a regular worker when it leaves a gcwq if there
are more works to execute, so that execution isn't stalled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ray Jui <rjui@broadcom.com>
Cc: stable@kernel.org
The nested NOT_RUNNING test in worker_clr_flags() is slightly
misleading in that if NOT_RUNNING were a single flag the nested test
would be always %true and thus noop. Add a comment noting that the
test isn't a noop.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Currently, the lockdep annotation in flush_work() requires exclusive
access on the workqueue the target work is queued on and triggers
warning if a work is trying to flush another work on the same
workqueue; however, this is no longer true as workqueues can now
execute multiple works concurrently.
This patch adds lock_map_acquire_read() and make process_one_work()
hold read access to the workqueue while executing a work and
start_flush_work() check for write access if concurrnecy level is one
or the workqueue has a rescuer (as only one execution resource - the
rescuer - is guaranteed to be available under memory pressure), and
read access if higher.
This better represents what's going on and removes spurious lockdep
warnings which are triggered by fake dependency chain created through
flush_work().
* Peter pointed out that flushing another work from a WQ_MEM_RECLAIM
wq breaks forward progress guarantee under memory pressure.
Condition check accordingly updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
Currently, destroy_workqueue() makes the workqueue deny all new
queueing by setting WQ_DYING and flushes the workqueue once before
proceeding with destruction; however, there are cases where work items
queue more related work items. Currently, such users need to
explicitly flush the workqueue multiple times depending on the
possible depth of such chained queueing.
This patch updates the queueing path such that a work item can queue
further work items on the same workqueue even when WQ_DYING is set.
The flush on destruction is automatically retried until the workqueue
is empty. This guarantees that the workqueue is empty on destruction
while allowing chained queueing.
The flush retry logic whines if it takes too many retries to drain the
workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Running the annotate branch profiler on three boxes, including my
main box that runs firefox, evolution, xchat, and is part of the distcc farm,
showed this with the likelys in the workqueue code:
correct incorrect % Function File Line
------- --------- - -------- ---- ----
96 996253 99 wq_worker_sleeping workqueue.c 703
96 996247 99 wq_worker_waking_up workqueue.c 677
The likely()s in this case were assuming that WORKER_NOT_RUNNING will
most likely be false. But this is not the case. The reason is
(and shown by adding trace_printks and testing it) that most of the time
WORKER_PREP is set.
In worker_thread() we have:
worker_clr_flags(worker, WORKER_PREP);
[ do work stuff ]
worker_set_flags(worker, WORKER_PREP, false);
(that 'false' means not to wake up an idle worker)
The wq_worker_sleeping() is called from schedule when a worker thread
is putting itself to sleep. Which happens most of the time outside
of that [ do work stuff ].
The wq_worker_waking_up is called by the wakeup worker code, which
is also callod outside that [ do work stuff ].
Thus, the likely and unlikely used by those two functions are actually
backwards.
Remove the annotation and let gcc figure it out.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
I found a trivial bug on initialization of workqueue.
Current init_workqueues doesn't check the result of
allocation of system_unbound_wq, this should be checked
like other queues.
Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Silly though it is, completions and wait_queue_heads use foo_ONSTACK
(COMPLETION_INITIALIZER_ONSTACK, DECLARE_COMPLETION_ONSTACK,
__WAIT_QUEUE_HEAD_INIT_ONSTACK and DECLARE_WAIT_QUEUE_HEAD_ONSTACK) so I
guess workqueues should do the same thing.
s/INIT_WORK_ON_STACK/INIT_WORK_ONSTACK/
s/INIT_DELAYED_WORK_ON_STACK/INIT_DELAYED_WORK_ONSTACK/
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the MN10300 arch, we occasionally see an assertion being tripped in
alloc_cwqs() at the following line:
/* just in case, make sure it's actually aligned */
---> BUG_ON(!IS_ALIGNED(wq->cpu_wq.v, align));
return wq->cpu_wq.v ? 0 : -ENOMEM;
The values are:
wa->cpu_wq.v => 0x902776e0
align => 0x100
and align is calculated by the following:
const size_t align = max_t(size_t, 1 << WORK_STRUCT_FLAG_BITS,
__alignof__(unsigned long long));
This is because the pointer in question (wq->cpu_wq.v) loses some of its
lower bits to control flags, and so the object it points to must be
sufficiently aligned to avoid the need to use those bits for pointing to
things.
Currently, 4 control bits and 4 colour bits are used in normal
circumstances, plus a debugging bit if debugging is set. This requires
the cpu_workqueue_struct struct to be at least 256 bytes aligned (or 512
bytes aligned with debugging).
PERCPU() alignment on MN13000, however, is only 32 bytes as set in
vmlinux.lds.S. So we set this to PAGE_SIZE (4096) to match most other
arches and stick a comment in alloc_cwqs() for anyone else who triggers
the assertion.
Reported-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mark Salter <msalter@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a25909a4 (lockdep: Add an in_workqueue_context() lockdep-based
test function) added in_workqueue_context() but there hasn't been any
in-kernel user and the lockdep annotation in workqueue is scheduled to
change. Remove the unused function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The documentation for schedule_on_each_cpu() states that it calls a
function on each online CPU from keventd. This can easily be
interpreted as an asyncronous call because the description does not
mention that flush_work is called. Clarify that it is synchronous.
tj: rephrased a bit
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add WQ_MEM_RECLAIM flag which currently maps to WQ_RESCUER, mark
WQ_RESCUER as internal and replace all external WQ_RESCUER usages to
WQ_MEM_RECLAIM.
This makes the API users express the intent of the workqueue instead
of indicating the internal mechanism used to guarantee forward
progress. This is also to make it cleaner to add more semantics to
WQ_MEM_RECLAIM. For example, if deemed necessary, memory reclaim
workqueues can be made highpri.
This patch doesn't introduce any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING
and could return %false with highpri work pending. This could lead to
late execution of a highpri work which was delayed due to @max_active
throttling if other works are actively consuming CPU cycles.
For example, the following could happen.
1. Work W0 which burns CPU cycles.
2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1.
3. W1 starts executing and W2 is put to delayed queue. W0 and W1 are
both runnable.
4. W1 finishes which puts W2 to pending queue but keep_working()
incorrectly returns %false and the worker goes to sleep.
5. W0 finishes and W2 starts execution.
With this patch applied, W2 starts execution as soon as W1 finishes.
Signed-off-by: Tejun Heo <tj@kernel.org>
These two tracepoints allow tracking when and how a work is queued and
activated. This patch is based on Frederic's patch to add queue_work
trace point.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Define workqueue_work event class and use it for workqueue_execute_end
trace point. Also, move trace/events/workqueue.h include downwards
such that all struct definitions are visible to it. This is to
prepare for more tracepoints and doesn't cause any functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>