Commit Graph

742 Commits

Author SHA1 Message Date
Mark Brown
f3124ba519 Merge tag 'v3.10.35' into linux-linaro-lsk
This is the 3.10.35 stable release
2014-03-31 23:51:22 +01:00
Vaibhav Nagarnaik
a1c10a94ff tracing: Fix array size mismatch in format string
commit 87291347c4 upstream.

In event format strings, the array size is reported in two locations.
One in array subscript and then via the "size:" attribute. The values
reported there have a mismatch.

For e.g., in sched:sched_switch the prev_comm and next_comm character
arrays have subscript values as [32] where as the actual field size is
16.

name: sched_switch
ID: 301
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1;signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char prev_comm[32];       offset:8;       size:16;        signed:1;
        field:pid_t prev_pid;   offset:24;      size:4; signed:1;
        field:int prev_prio;    offset:28;      size:4; signed:1;
        field:long prev_state;  offset:32;      size:8; signed:1;
        field:char next_comm[32];       offset:40;      size:16;        signed:1;
        field:pid_t next_pid;   offset:56;      size:4; signed:1;
        field:int next_prio;    offset:60;      size:4; signed:1;

After bisection, the following commit was blamed:
92edca0 tracing: Use direct field, type and system names

This commit removes the duplication of strings for field->name and
field->type assuming that all the strings passed in
__trace_define_field() are immutable. This is not true for arrays, where
the type string is created in event_storage variable and field->type for
all array fields points to event_storage.

Use __stringify() to create a string constant for the type string.

Also, get rid of event_storage and event_storage_mutex that are not
needed anymore.

also, an added benefit is that this reduces the overhead of events a bit more:

   text    data     bss     dec     hex filename
8424787 2036472 1302528 11763787         b3804b vmlinux
8420814 2036408 1302528 11759750         b37086 vmlinux.patched

Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com

Cc: Laurent Chavey <chavey@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-31 09:58:12 -07:00
Mark Brown
dffe2a3eed Merge tag 'v3.10.22' into linux-linaro-lsk
This is the 3.10.22 stable release
2013-12-05 10:15:16 +00:00
Steven Rostedt (Red Hat)
1bfdf02fc0 tracing: Allow events to have NULL strings
commit 4e58e54754 upstream.

If an TRACE_EVENT() uses __assign_str() or __get_str on a NULL pointer
then the following oops will happen:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<c127a17b>] strlen+0x10/0x1a
*pde = 00000000 ^M
Oops: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.13.0-rc1-test+ #2
Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006^M
task: f5cde9f0 ti: f5e5e000 task.ti: f5e5e000
EIP: 0060:[<c127a17b>] EFLAGS: 00210046 CPU: 1
EIP is at strlen+0x10/0x1a
EAX: 00000000 EBX: c2472da8 ECX: ffffffff EDX: c2472da8
ESI: c1c5e5fc EDI: 00000000 EBP: f5e5fe84 ESP: f5e5fe80
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000000 CR3: 01f32000 CR4: 000007d0
Stack:
 f5f18b90 f5e5feb8 c10687a8 0759004f 00000005 00000005 00000005 00200046
 00000002 00000000 c1082a93 f56c7e28 c2472da8 c1082a93 f5e5fee4 c106bc61^M
 00000000 c1082a93 00000000 00000000 00000001 00200046 00200082 00000000
Call Trace:
 [<c10687a8>] ftrace_raw_event_lock+0x39/0xc0
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c106bc61>] lock_release+0x57/0x1a5
 [<c1082a93>] ? ktime_get+0x29/0x69
 [<c10824dd>] read_seqcount_begin.constprop.7+0x4d/0x75
 [<c1082a93>] ? ktime_get+0x29/0x69^M
 [<c1082a93>] ktime_get+0x29/0x69
 [<c108a46a>] __tick_nohz_idle_enter+0x1e/0x426
 [<c10690e8>] ? lock_release_holdtime.part.19+0x48/0x4d
 [<c10bc184>] ? time_hardirqs_off+0xe/0x28
 [<c1068c82>] ? trace_hardirqs_off_caller+0x3f/0xaf
 [<c108a8cb>] tick_nohz_idle_enter+0x59/0x62
 [<c1079242>] cpu_startup_entry+0x64/0x192
 [<c102299c>] start_secondary+0x277/0x27c
Code: 90 89 c6 89 d0 88 c4 ac 38 e0 74 09 84 c0 75 f7 be 01 00 00 00 89 f0 48 5e 5d c3 55 89 e5 57 66 66 66 66 90 83 c9 ff 89 c7 31 c0 <f2> ae f7 d1 8d 41 ff 5f 5d c3 55 89 e5 57 66 66 66 66 90 31 ff
EIP: [<c127a17b>] strlen+0x10/0x1a SS:ESP 0068:f5e5fe80
CR2: 0000000000000000
---[ end trace 01bc47bf519ec1b2 ]---

New tracepoints have been added that have allowed for NULL pointers
being assigned to strings. To fix this, change the TRACE_EVENT() code
to check for NULL and if it is, it will assign "(null)" to it instead
(similar to what glibc printf does).

Reported-by: Shuah Khan <shuah.kh@samsung.com>
Reported-by: Jovi Zhangwei <jovi.zhangwei@gmail.com>
Link: http://lkml.kernel.org/r/CAGdX0WFeEuy+DtpsJzyzn0343qEEjLX97+o1VREFkUEhndC+5Q@mail.gmail.com
Link: http://lkml.kernel.org/r/528D6972.9010702@samsung.com
Fixes: 9cbf117662 ("tracing/events: provide string with undefined size support")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04 10:57:17 -08:00
Mark Brown
bcd63b1df3 Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsk 2013-11-21 11:59:02 +00:00
Mathieu Poirier
2a68d1e912 HMP: Avoid using the cpu stopper to stop runnable tasks
When migrating a runnable task, we use the CPU stopper on
the source CPU to ensure that the task to be moved is not
currently running. Before this patch, all forced migrations
(up, offload, idle pull) use the stopper for every migration.

Using the CPU stopper is mandatory only when a task is currently
running on a CPU.  Otherwise tasks can be moved by locking the
source and destination run queues.

This patch checks to see if the task to be moved are currently
running.  If not the task is moved directly without using the
stopper thread.

Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-11-21 11:26:08 +00:00
Mark Brown
b9e7900a8a smp: Don't use typedef to work around compiler issue with tracepoints
Having the typedef in place for the tracepoints causes compiler crashes
in some situations.  Just using void * directly avoids triggering the
issue and should have no effect on the trace.

Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Liviu Dudau <Liviu.Dudau@arm.com>
2013-10-30 09:56:11 -07:00
Mark Brown
31d264741e Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsk 2013-10-14 17:13:46 +01:00
Mark Brown
6adb80d9ec smp: Don't use typedef to work around compiler issue with tracepoints
Having the typedef in place for the tracepoints causes compiler crashes
in some situations.  Just using void * directly avoids triggering the
issue and should have no effect on the trace.

Signed-off-by: Mark Brown <broonie@linaro.org>
Acked-by: Liviu Dudau <Liviu.Dudau@arm.com>
2013-10-14 14:04:39 +01:00
Mark Brown
fa4b900fca Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsk 2013-10-11 19:26:24 +01:00
Chris Redpath
5ecaba3d9f smp: smp_cross_call function pointer tracing
generic tracing for smp_cross_call function calls

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11 15:07:18 +01:00
Chris Redpath
2353c1f800 arm: ipi raise/start/end tracing
Add tracepoints for IPI raise events, and start and end of the
ipi handler.

Used to inspect the source of CPU wake-ups which are not already
traced - all other reasons for a CPU to wake-up are already
covered.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11 15:07:18 +01:00
Chris Redpath
7b8e0b3f2a sched: HMP: Additional trace points for debugging HMP behaviour
1. Replace magic numbers in code for migration trace.
   Trace points still emit a number as force=<n> field:
     force=0 : wakeup migration
     force=1 : forced migration
     force=2 : offload migration
     force=3 : idle pull migration

2. Add trace to expose offload decision-making.
   Also adds tracing rq->nr_running so that you can
   look back to see what state the RQ was in at the time.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
2013-10-11 15:07:17 +01:00
Mark Brown
38b5268356 Merge remote-tracking branch 'lsk/v3.10/topic/big.LITTLE' into linux-linaro-lsk 2013-07-18 16:42:46 +01:00
Morten Rasmussen
0d811e649a sched: Add HMP task migration ftrace event
Adds ftrace event for tracing task migrations using HMP
optimized scheduling.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:25 +01:00
Morten Rasmussen
b9d3d56128 sched: Add ftrace events for entity load-tracking
Adds ftrace events for key variables related to the entity
load-tracking to help debugging scheduler behaviour. Allows tracing
of load contribution and runqueue residency ratio for both entities
and runqueues as well as entity CPU usage ratio.

Signed-off-by: Morten Rasmussen <Morten.Rasmussen@arm.com>
2013-07-17 11:12:25 +01:00
Dave Martin
b87db1b9e3 ARM: bL_switcher/trace: Add trace trigger for trace bootstrapping
When tracing switching, an external tracer needs a way to bootstrap
its knowledge of the logical<->physical CPU mapping.

This patch adds a sysfs attribute trace_trigger.  A write to this
attribute will generate a power:cpu_migrate_current event for each
online CPU, indicating the current physical CPU for each logical
CPU.

Activating or deactivating the switcher also generates these
events, so that the tracer knows about the resulting remapping of
affected CPUs.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
2013-06-20 00:45:29 -04:00
Dave Martin
d4d434a824 ARM: bL_switcher: Basic trace events support
This patch adds simple trace events to the b.L switcher code
to allow tracing of CPU migration events.

To make use of the trace events, you will need:

CONFIG_FTRACE=y
CONFIG_ENABLE_DEFAULT_TRACERS=y

The following events are added:
  * power:cpu_migrate_begin
  * power:cpu_migrate_finish

each with the following data:
    u64     timestamp;
    u32     cpu_hwid;

power:cpu_migrate_begin occurs immediately before the
switcher-specific migration operations start.
power:cpu_migrate_finish occurs immediately when migration is
completed.

The cpu_hwid field contains the ID fields of the MPIDR.

* For power:cpu_migrate_begin, cpu_hwid is the ID of the outbound
  physical CPU (equivalent to (from_phys_cpu,from_phys_cluster)).

* For power:cpu_migrate_finish, cpu_hwid is the ID of the inbound
  physical CPU (equivalent to (to_phys_cpu,to_phys_cluster)).

By design, the cpu_hwid field is masked in the same way as the
device tree cpu node reg property, allowing direct correlation to
the DT description of the hardware.

The timestamp is added in order to minimise timing noise.  An
accurate system-wide clock should be used for generating this
(hopefully getnstimeofday is appropriate, but it could be changed).
It could be any monotonic shared clock, since the aim is to allow
accurate deltas to be computed.  We don't necessarily care about
accurate synchronisation with wall clock time.

In practice, each switch takes place on a single logical CPU,
and the trace infrastructure should guarantee that events are
well-ordered with respect to a single logical CPU.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
2013-06-20 00:45:28 -04:00
Linus Torvalds
b973425cbb Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
 "Fixed regressions (two stability regressions and a performance
  regression) introduced during the 3.10-rc1 merge window.

  Also included is a bug fix relating to allocating blocks after
  resizing an ext3 file system when using the ext4 file system driver"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
  ext4: revert "ext4: use io_end for multiple bios"
  ext4: limit group search loop for non-extent files
  ext4: fix fio regression
2013-05-14 09:30:54 -07:00
Linus Torvalds
942d33da99 Merge tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes the following major enhancement patches.
   - introduce a new gloabl lock scheme
   - add tracepoints on several major functions
   - fix the overall cleaning process focused on victim selection
   - apply the block plugging to merge IOs as much as possible
   - enhance management of free nids and its list
   - enhance the readahead mode for node pages
   - address several cretical deadlock conditions
   - reduce lock_page calls

  The other minor bug fixes and enhancements are as follows.
   - calculation mistakes: overflow
   - bio types: READ, READA, and READ_SYNC
   - fix the recovery flow, data races, and null pointer errors"

* tag 'f2fs-for-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
  f2fs: cover free_nid management with spin_lock
  f2fs: optimize scan_nat_page()
  f2fs: code cleanup for scan_nat_page() and build_free_nids()
  f2fs: bugfix for alloc_nid_failed()
  f2fs: recover when journal contains deleted files
  f2fs: continue to mount after failing recovery
  f2fs: avoid deadlock during evict after f2fs_gc
  f2fs: modify the number of issued pages to merge IOs
  f2fs: remove useless #include <linux/proc_fs.h> as we're now using sysfs as debug entry.
  f2fs: fix inconsistent using of NM_WOUT_THRESHOLD
  f2fs: check truncation of mapping after lock_page
  f2fs: enhance alloc_nid and build_free_nids flows
  f2fs: add a tracepoint on f2fs_new_inode
  f2fs: check nid == 0 in add_free_nid
  f2fs: add REQ_META about metadata requests for submit
  f2fs: give a chance to merge IOs by IO scheduler
  f2fs: avoid frequent background GC
  f2fs: add tracepoints to debug checkpoint request
  f2fs: add tracepoints for write page operations
  f2fs: add tracepoints to debug the block allocation
  ...
2013-05-08 15:11:48 -07:00
Linus Torvalds
ebb3727779 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
2013-05-08 11:51:05 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Linus Torvalds
01227a889e Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Gleb Natapov:
 "Highlights of the updates are:

  general:
   - new emulated device API
   - legacy device assignment is now optional
   - irqfd interface is more generic and can be shared between arches

  x86:
   - VMCS shadow support and other nested VMX improvements
   - APIC virtualization and Posted Interrupt hardware support
   - Optimize mmio spte zapping

  ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

  ARM:
   - reworking of Hyp idmaps

  s390:
   - ioeventfd for virtio-ccw

  And many other bug fixes, cleanups and improvements"

* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  kvm: Add compat_ioctl for device control API
  KVM: x86: Account for failing enable_irq_window for NMI window request
  KVM: PPC: Book3S: Add API for in-kernel XICS emulation
  kvm/ppc/mpic: fix missing unlock in set_base_addr()
  kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
  kvm/ppc/mpic: remove users
  kvm/ppc/mpic: fix mmio region lists when multiple guests used
  kvm/ppc/mpic: remove default routes from documentation
  kvm: KVM_CAP_IOMMU only available with device assignment
  ARM: KVM: iterate over all CPUs for CPU compatibility check
  KVM: ARM: Fix spelling in error message
  ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
  KVM: ARM: Fix API documentation for ONE_REG encoding
  ARM: KVM: promote vfp_host pointer to generic host cpu context
  ARM: KVM: add architecture specific hook for capabilities
  ARM: KVM: perform HYP initilization for hotplugged CPUs
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: rework HYP page table freeing
  ARM: KVM: enforce maximum size for identity mapped code
  ARM: KVM: move to a KVM provided HYP idmap
  ...
2013-05-05 14:47:31 -07:00
Linus Torvalds
534c97b095 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'full dynticks' support from Ingo Molnar:
 "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
  kernel feature to the timer and scheduler subsystems: 'full dynticks',
  or CONFIG_NO_HZ_FULL=y.

  This feature extends the nohz variable-size timer tick feature from
  idle to busy CPUs (running at most one task) as well, potentially
  reducing the number of timer interrupts significantly.

  This feature got motivated by real-time folks and the -rt tree, but
  the general utility and motivation of full-dynticks runs wider than
  that:

   - HPC workloads get faster: CPUs running a single task should be able
     to utilize a maximum amount of CPU power.  A periodic timer tick at
     HZ=1000 can cause a constant overhead of up to 1.0%.  This feature
     removes that overhead - and speeds up the system by 0.5%-1.0% on
     typical distro configs even on modern systems.

   - Real-time workload latency reduction: CPUs running critical tasks
     should experience as little jitter as possible.  The last remaining
     source of kernel-related jitter was the periodic timer tick.

   - A single task executing on a CPU is a pretty common situation,
     especially with an increasing number of cores/CPUs, so this feature
     helps desktop and mobile workloads as well.

  The cost of the feature is mainly related to increased timer
  reprogramming overhead when a CPU switches its tick period, and thus
  slightly longer to-idle and from-idle latency.

  Configuration-wise a third mode of operation is added to the existing
  two NOHZ kconfig modes:

   - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
     as a config option.  This is the traditional Linux periodic tick
     design: there's a HZ tick going on all the time, regardless of
     whether a CPU is idle or not.

   - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
     periodic tick when a CPU enters idle mode.

   - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
     tick when a CPU is idle, also slows the tick down to 1 Hz (one
     timer interrupt per second) when only a single task is running on a
     CPU.

  The .config behavior is compatible: existing !CONFIG_NO_HZ and
  CONFIG_NO_HZ=y settings get translated to the new values, without the
  user having to configure anything.  CONFIG_NO_HZ_FULL is turned off by
  default.

  This feature is based on a lot of infrastructure work that has been
  steadily going upstream in the last 2-3 cycles: related RCU support
  and non-periodic cputime support in particular is upstream already.

  This tree adds the final pieces and activates the feature.  The pull
  request is marked RFC because:

   - it's marked 64-bit only at the moment - the 32-bit support patch is
     small but did not get ready in time.

   - it has a number of fresh commits that came in after the merge
     window.  The overwhelming majority of commits are from before the
     merge window, but still some aspects of the tree are fresh and so I
     marked it RFC.

   - it's a pretty wide-reaching feature with lots of effects - and
     while the components have been in testing for some time, the full
     combination is still not very widely used.  That it's default-off
     should reduce its regression abilities and obviously there are no
     known regressions with CONFIG_NO_HZ_FULL=y enabled either.

   - the feature is not completely idempotent: there is no 100%
     equivalent replacement for a periodic scheduler/timer tick.  In
     particular there's ongoing work to map out and reduce its effects
     on scheduler load-balancing and statistics.  This should not impact
     correctness though, there are no known regressions related to this
     feature at this point.

   - it's a pretty ambitious feature that with time will likely be
     enabled by most Linux distros, and we'd like you to make input on
     its design/implementation, if you dislike some aspect we missed.
     Without flaming us to crisp! :-)

  Future plans:

   - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
     the periodic tick altogether when there's a single busy task on a
     CPU.  We'd first like 1 Hz to be exposed more widely before we go
     for the 0 Hz target though.

   - once we reach 0 Hz we can remove the periodic tick assumption from
     nr_running>=2 as well, by essentially interrupting busy tasks only
     as frequently as the sched_latency constraints require us to do -
     once every 4-40 msecs, depending on nr_running.

  I am personally leaning towards biting the bullet and doing this in
  v3.10, like the -rt tree this effort has been going on for too long -
  but the final word is up to you as usual.

  More technical details can be found in Documentation/timers/NO_HZ.txt"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  sched: Keep at least 1 tick per second for active dynticks tasks
  rcu: Fix full dynticks' dependency on wide RCU nocb mode
  nohz: Protect smp_processor_id() in tick_nohz_task_switch()
  nohz_full: Add documentation.
  cputime_nsecs: use math64.h for nsec resolution conversion helpers
  nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
  nohz: Reduce overhead under high-freq idling patterns
  nohz: Remove full dynticks' superfluous dependency on RCU tree
  nohz: Fix unavailable tick_stop tracepoint in dynticks idle
  nohz: Add basic tracing
  nohz: Select wide RCU nocb for full dynticks
  nohz: Disable the tick when irq resume in full dynticks CPU
  nohz: Re-evaluate the tick for the new task after a context switch
  nohz: Prepare to stop the tick on irq exit
  nohz: Implement full dynticks kick
  nohz: Re-evaluate the tick from the scheduler IPI
  sched: New helper to prevent from stopping the tick in full dynticks
  sched: Kick full dynticks CPU that have more than one task enqueued.
  perf: New helper to prevent full dynticks CPUs from stopping tick
  perf: Kick full dynticks CPU if events rotation is needed
  ...
2013-05-05 13:23:27 -07:00
Yan, Zheng
e30b5dca15 ext4: fix fio regression
We (Linux Kernel Performance project) found a regression introduced
by commit:

  f7fec032aa ext4: track all extent status in extent status tree

The commit causes about 20% performance decrease in fio random write
test. Profiler shows that rb_next() uses a lot of CPU time. The call
stack is:

  rb_next
  ext4_es_find_delayed_extent
  ext4_map_blocks
  _ext4_get_block
  ext4_get_block_write
  __blockdev_direct_IO
  ext4_direct_IO
  generic_file_direct_write
  __generic_file_aio_write
  ext4_file_write
  aio_rw_vect_retry
  aio_run_iocb
  do_io_submit
  sys_io_submit
  system_call_fastpath
  io_submit
  td_io_getevents
  io_u_queued_complete
  thread_main
  main
  __libc_start_main

The cause is that ext4_es_find_delayed_extent() doesn't have an
upper bound, it keeps searching until a delayed extent is found.
When there are a lots of non-delayed entries in the extent state
tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.

Reported-by: LKP project <lkp@linux.intel.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
2013-05-03 02:15:52 -04:00