A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained. When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.
Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference. This change is to
prepare for multiple shared workers per cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reimplement workqueue flushing using color coded works. wq has the
current work color which is painted on the works being issued via
cwqs. Flushing a workqueue is achieved by advancing the current work
colors of cwqs and waiting for all the works which have any of the
previous colors to drain.
Currently there are 16 possible colors, one is reserved for no color
and 15 colors are useable allowing 14 concurrent flushes. When color
space gets full, flush attempts are batched up and processed together
when color frees up, so even with many concurrent flushers, the new
implementation won't build up huge queue of flushers which has to be
processed one after another.
Only works which are queued via __queue_work() are colored. Works
which are directly put on queue using insert_work() use NO_COLOR and
don't participate in workqueue flushing. Currently only works used
for work-specific flush fall in this category.
This new implementation leaves only cleanup_workqueue_thread() as the
user of flush_cpu_workqueue(). Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.
This new implementation is to allow having and sharing multiple
workers per cpu.
Please note that one more bit is reserved for a future work flag by
this patch. This is to avoid shifting bits and updating comments
later.
Signed-off-by: Tejun Heo <tj@kernel.org>
work->data field is used for two purposes. It points to cwq it's
queued on and the lower bits are used for flags. Currently, two bits
are reserved which is always safe as 4 byte alignment is guaranteed on
every architecture. However, future changes will need more flag bits.
On SMP, the percpu allocator is capable of honoring larger alignment
(there are other users which depend on it) and larger alignment works
just fine. On UP, percpu allocator is a thin wrapper around
kzalloc/kfree() and don't honor alignment request.
This patch introduces WORK_STRUCT_FLAG_BITS and implements
alloc/free_cwqs() which guarantees max(1 << WORK_STRUCT_FLAG_BITS,
__alignof__(unsigned long long) alignment both on SMP and UP. On SMP,
simply wrapping percpu allocator is enough. On UP, extra space is
allocated so that cwq can be aligned and the original pointer can be
stored after it which is used in the free path.
* Alignment problem on UP is reported by Michal Simek.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Michal Simek <michal.simek@petalogix.com>
Work flags are about to see more traditional mask handling. Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks. Also, make WORK_STRUCT_STATIC_* flags
conditional
While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.
Signed-off-by: Tejun Heo <tj@kernel.org>
Make the following updates in preparation of concurrency managed
workqueue. None of these changes causes any visible behavior
difference.
* Add comments and adjust indentations to data structures and several
functions.
* Rename wq_per_cpu() to get_cwq() and swap the position of two
parameters for consistency. Convert a direct per_cpu_ptr() access
to wq->cpu_wq to get_cwq().
* Add work_static() and Update set_wq_data() such that it sets the
flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
@extra_flags.
* Move santiy check on work->entry emptiness from queue_work_on() to
__queue_work() which all queueing paths share.
* Make __queue_work() take @cpu and @wq instead of @cwq.
* Restructure flush_work() and __create_workqueue_key() to make them
easier to modify.
Signed-off-by: Tejun Heo <tj@kernel.org>
With stop_machine() converted to use cpu_stop, RT workqueue doesn't
have any user left. Kill RT workqueue support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread. The caller is
responsible for ensuring the validity of @task when calling this
function.
Signed-off-by: Tejun Heo <tj@kernel.org>
Implement simple work processor for kthread. This is to ease using
kthread. Single thread workqueue used to be used for things like this
but workqueue won't guarantee fixed kthread association anymore to
enable worker sharing.
This can be used in cases where specific kthread association is
necessary, for example, when it should have RT priority or be assigned
to certain cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
wimax/i2400m: fix missing endian correction read in fw loader
net8139: fix a race at the end of NAPI
pktgen: Fix accuracy of inter-packet delay.
pkt_sched: gen_estimator: add a new lock
net: deliver skbs on inactive slaves to exact matches
ipv6: fix ICMP6_MIB_OUTERRORS
r8169: fix mdio_read and update mdio_write according to hw specs
gianfar: Revive the driver for eTSEC devices (disable timestamping)
caif: fix a couple range checks
phylib: Add support for the LXT973 phy.
net: Print num_rx_queues imbalance warning only when there are allocated queues
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.
For example,
vlanx -> bond0 -> ethx
will be dropped in the vlan path and not delivered to any
packet handlers at all. However,
bond0 -> vlanx -> ethx
and
bond0 -> ethx
will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.
This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv(). Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly. This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.
I have tested the following 4 configurations in failover modes
and load balancing modes.
# bond0 -> ethx
# vlanx -> bond0 -> ethx
# bond0 -> vlanx -> ethx
# bond0 -> ethx
|
vlanx -> --
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
10, 233 is allocated officially to /dev/kmview which is shipping in
Ubuntu and Debian distributions. vhost_net seem to have borrowed it
without making a proper request and this causes regressions in the other
distributions.
vhost_net can use a dynamic minor so use that instead. Also update the
file with a comment to try and avoid future misunderstandings.
cc: stable@kernel.org
Signed-off-by: Alan Cox <device@lanana.org>
[ We should have caught this before 2.6.34 got released. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:
wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85
This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. This is a regression
introduced by 17bc6c30cf ("vfs: Add
no_nrwrite_index_update writeback control flag"), but cannot be reverted
directly due to subsequent bug fixes that have gone in on top of it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Concurrency managed workqueue needs to know when workers are going to
sleep and waking up. Using these two hooks, cmwq keeps track of the
current concurrency level and throttles execution of new works if it's
too high and wakes up another worker from the sleep hook if it becomes
too low.
This patch introduces PF_WQ_WORKER to identify workqueue workers and
adds the following two hooks.
* wq_worker_waking_up(): called when a worker is woken up.
* wq_worker_sleeping(): called when a worker is going to sleep and may
return a pointer to a local task which should be woken up. The
returned task is woken up using try_to_wake_up_local() which is
simplified ttwu which is called under rq lock and can only wake up
local tasks.
Both hooks are currently defined as noop in kernel/workqueue_sched.h.
Later cmwq implementation will replace them with proper
implementation.
These hooks are hard coded as they'll always be enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Currently, when a cpu goes down, cpu_active is cleared before
CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
default priority cpu notifier. When a cpu is coming up, it's set
before CPU_ONLINE but cpuset configuration again is updated from the
same cpu notifier.
For cpu notifiers, this presents an inconsistent state. Threads which
a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
migrated to other cpus because the cpu is no more inactive.
Fix it by updating cpu_active in the highest priority cpu notifier and
cpuset configuration in the second highest when a cpu is coming up.
Down path is updated similarly. This guarantees that all other cpu
notifiers see consistent cpu_active and cpuset configuration.
cpuset_track_online_cpus() notifier is converted to
cpuset_update_active_cpus() which just updates the configuration and
now called from cpuset_cpu_[in]active() notifiers registered from
sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus()
degenerates into partition_sched_domains() making separate notifier
for !CONFIG_CPUSETS unnecessary.
This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug
callback creates a kthread and kthread_bind()s it to the target cpu,
and the thread is expected to run on that cpu.
* Ingo's test discovered __cpuinit/exit markups were incorrect.
Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Menage <menage@google.com>
JMB362 is a new variant of jmicron controller which is similar to
JMB360 but has two SATA ports instead of one. As there is no PATA
port, single function AHCI mode can be used as in JMB360. Add pci
quirk for JMB362.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Aries Lee <arieslee@jmicron.com>
Cc: stable@kernel.org
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Minix: Clean up left over label
fix truncate inode time modification breakage
fix setattr error handling in sysfs, configfs
fcntl: return -EFAULT if copy_to_user fails
wrong type for 'magic' argument in simple_fill_super()
fix the deadlock in qib_fs
mqueue doesn't need make_bad_inode()
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
module: fix bne2 "gave up waiting for init of module libcrc32c"
module: verify_export_symbols under the lock
module: move find_module check to end
module: make locking more fine-grained.
module: Make module sysfs functions private.
module: move sysfs exposure to end of load_module
module: fix kdb's illicit use of struct module_use.
module: Make the 'usage' lists be two-way
These were placed in the header in ef665c1a06 to get the various
SYSFS/MODULE config combintations to compile.
That may have been necessary then, but it's not now. These functions
are all local to module.c.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
When adding a module that depends on another one, we used to create a
one-way list of "modules_which_use_me", so that module unloading could
see who needs a module.
It's actually quite simple to make that list go both ways: so that we
not only can see "who uses me", but also see a list of modules that are
"used by me".
In fact, we always wanted that list in "module_unload_free()": when we
unload a module, we want to also release all the other modules that are
used by that module. But because we didn't have that list, we used to
first iterate over all modules, and then iterate over each "used by me"
list of that module.
By making the list two-way, we simplify module_unload_free(), and it
allows for some trivial fixes later too.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (23 commits)
sh: Make intc messages consistent via pr_fmt.
sh: make sure static declaration on ms7724se
sh: make sure static declaration on mach-migor
sh: make sure static declaration on mach-ecovec24
sh: make sure static declaration on mach-ap325rxa
clocksource: sh_cmt: compute mult and shift before registration
clocksource: sh_tmu: compute mult and shift before registration
sh: PIO disabling for x3proto and urquell.
sh: mach-sdk7786: conditionally disable PIO support.
sh: support for platforms without PIO.
usb: r8a66597-hcd pio to mmio accessor conversion.
usb: gadget: r8a66597-udc pio to mmio accessor conversion.
usb: gadget: m66592-udc pio to mmio accessor conversion.
sh: add romImage MMCIF boot for sh7724 and Ecovec V2
sh: add boot code to MMCIF driver header
sh: prepare MMCIF driver header file
sh: allow romImage data between head.S and the zero page
sh: Add support MMCIF for ecovec
sh: remove duplicated #include
input: serio: disable i8042 for non-cayman sh platforms.
...
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (83 commits)
i7core_edac: Better describe the supported devices
Add support for Westmere to i7core_edac driver
i7core_edac: don't free on success
i7core_edac: Add support for X5670
Always call i7core_[ur]dimm_check_mc_ecc_err
i7core_edac: fix memory leak of i7core_dev
EDAC: add __init to i7core_xeon_pci_fixup
i7core_edac: Fix wrong device id for channel 1 devices
i7core: add support for Lynnfield alternate address
i7core_edac: Add initial support for Lynnfield
i7core_edac: do not export static functions
edac: fix i7core build
edac: i7core_edac produces undefined behaviour on 32bit
i7core_edac: Use a more generic approach for probing PCI devices
i7core_edac: PCI device is called NONCORE, instead of NOCORE
i7core_edac: Fix ringbuffer maxsize
i7core_edac: First store, then increment
i7core_edac: Better parse "any" addrmask
i7core_edac: Use a lockless ringbuffer
edac: Create an unique instance for each kobj
...