Pull irq updates from Thomas Gleixner:
"This updated pull request does not contain the last few GIC related
patches which were reported to cause a regression. There is a fix
available, but I let it breed for a couple of days first.
The irq departement provides:
- new infrastructure to support non PCI based MSI interrupts
- a couple of new irq chip drivers
- the usual pile of fixlets and updates to irq chip drivers
- preparatory changes for removal of the irq argument from interrupt
flow handlers
- preparatory changes to remove IRQF_VALID"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
irqchip/imx-gpcv2: IMX GPCv2 driver for wakeup sources
irqchip: Add bcm2836 interrupt controller for Raspberry Pi 2
irqchip: Add documentation for the bcm2836 interrupt controller
irqchip/bcm2835: Add support for being used as a second level controller
irqchip/bcm2835: Refactor handle_IRQ() calls out of MAKE_HWIRQ
PCI: xilinx: Fix typo in function name
irqchip/gic: Ensure gic_cpu_if_up/down() programs correct GIC instance
irqchip/gic: Only allow the primary GIC to set the CPU map
PCI/MSI: pci-xgene-msi: Consolidate chained IRQ handler install/remove
unicore32/irq: Prepare puv3_gpio_handler for irq argument removal
tile/pci_gx: Prepare trio_handle_level_irq for irq argument removal
m68k/irq: Prepare irq handlers for irq argument removal
C6X/megamode-pic: Prepare megamod_irq_cascade for irq argument removal
blackfin: Prepare irq handlers for irq argument removal
arc/irq: Prepare idu_cascade_isr for irq argument removal
sparc/irq: Use access helper irq_data_get_affinity_mask()
sparc/irq: Use helper irq_data_get_irq_handler_data()
parisc/irq: Use access helper irq_data_get_affinity_mask()
mn10300/irq: Use access helper irq_data_get_affinity_mask()
irqchip/i8259: Prepare i8259_irq_dispatch for irq argument removal
...
In times of ARC 700 performance counters didn't have support of
interrupt an so for ARC we only had support of non-sampling events.
Put simply only "perf stat" was functional.
Now with ARC HS we have support of interrupts in performance counters
which this change introduces support of.
ARC performance counters act in the following way in regard of
interrupts generation.
[1] A counter counts starting from value set in PCT_COUNT register pair
[2] Once counter reaches value set in PCT_INT_CNT interrupt is raised
Basic setup look like this:
[1] PCT_COUNT = 0;
[2] PCT_INT_CNT = __limit_value__;
[3] Enable interrupts for that counter and let it run
[4] Let counter reach its limit
[5] Handle interrupt when it happens
Note that PCT HW block is build in CPU core and so ints interrupt
line (which is basically OR of all counters IRQs) is wired directly to
top-level IRQC. That means do de-assert PCT interrupt it's required to
reset IRQs from all counters that have reached their limit values.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This generalization prepares for support of overflow interrupts.
Hardware event counters on ARC work that way:
Each counter counts from programmed start value (set in
ARC_REG_PCT_COUNT) to a limit value (set in ARC_REG_PCT_INT_CNT) and
once limit value is reached this timer generates an interrupt.
Even though this hardware implementation allows for more flexibility,
in Linux kernel we decided to mimic behavior of other architectures
this way:
[1] Set limit value as half of counter's max value (to allow counter to
run after reaching it limit, see below for more explanation):
---------->8-----------
arc_pmu->max_period = (1ULL << counter_size) / 2 - 1ULL;
---------->8-----------
[2] Set start value as "arc_pmu->max_period - sample_period" and then
count up to the limit
Our event counters don't stop on reaching max value (the one we set in
ARC_REG_PCT_INT_CNT) but continue to count until kernel explicitly
stops each of them.
And setting a limit as half of counter capacity is done to allow
capturing of additional events in between moment when interrupt was
triggered until we're actually processing PMU interrupts. That way
we're trying to be more precise.
For example if we count CPU cycles we keep track of cycles while
running through generic IRQ handling code:
[1] We set counter period as say 100_000 events of type "crun"
[2] Counter reaches that limit and raises its interrupt
[3] Once we get in PMU IRQ handler we read current counter value from
ARC_REG_PCT_SNAP ans see there something like 105_000.
If counters stop on reaching a limit value then we would miss
additional 5000 cycles.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The number of counters in PCT can never be more than 32 (while
countable conditions could be 100+) for both ARCompact and ARCv2
And while at it update copyright dates.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
When kernel's binary becomes large enough (32M and more) errors
may occur during the final linkage stage. It happens because
the build system uses short relocations for ARC by default.
This problem may be easily resolved by passing -mlong-calls
option to GCC to use long absolute jumps (j) instead of short
relative branchs (b).
But there are fragments of pure assembler code exist which use
branchs in inappropriate places and cause a linkage error because
of relocations overflow.
First of these fragments is .fixup insertion in futex.h and
unaligned.c. It inserts a code in the separate section (.fixup)
with branch instruction. It leads to the linkage error when
kernel becomes large.
Second of these fragments is calling scheduler's functions
(common kernel code) from entry.S of ARC's code. When kernel's
binary becomes large it may lead to the linkage error because
scheduler may occur far enough from ARC's code in the final
binary.
Signed-off-by: Yuriy Kolerov <yuriy.kolerov@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Callers of cmpxchg_futex_value_locked() in futex code expect bimodal
return value:
!0 (essentially -EFAULT as failure)
0 (success)
Before this patch, the success return value was old value of futex,
which could very well be non zero, causing caller to possibly take the
failure path erroneously.
Fix that by returning 0 for success
(This fix was done back in 2011 for all upstream arches, which ARC
obviously missed)
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The atomic ops on futex need to provide the full barrier just like
regular atomics in kernel.
Also remove pagefault_enable/disable in futex_atomic_cmpxchg_inatomic()
as core code already does that
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
In case of ARCv2 CPU there're could be following configurations
that affect cache handling for data exchanged with peripherals
via DMA:
[1] Only L1 cache exists
[2] Both L1 and L2 exist, but no IO coherency unit
[3] L1, L2 caches and IO coherency unit exist
Current implementation takes care of [1] and [2].
Moreover support of [2] is implemented with run-time check
for SLC existence which is not super optimal.
This patch introduces support of [3] and rework of DMA ops
usage. Instead of doing run-time check every time a particular
DMA op is executed we'll have 3 different implementations of
DMA ops and select appropriate one during init.
As for IOC support for it we need:
[a] Implement empty DMA ops because IOC takes care of cache
coherency with DMAed data
[b] Route dma_alloc_coherent() via dma_alloc_noncoherent()
This is required to make IOC work in first place and also
serves as optimization as LD/ST to coherent buffers can be
srviced from caches w/o going all the way to memory
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
[vgupta:
-Added some comments about IOC gains
-Marked dma ops as static,
-Massaged changelog a bit]
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
KGDB fails to build after f51e2f1911 ("ARC: make sure instruction_pointer()
returns unsigned value")
The hack to force one specific reg to unsigned backfired. There's no
reason to keep the regs signed after all.
| CC arch/arc/kernel/kgdb.o
|../arch/arc/kernel/kgdb.c: In function 'kgdb_trap':
| ../arch/arc/kernel/kgdb.c:180:29: error: lvalue required as left operand of assignment
| instruction_pointer(regs) -= BREAK_INSTR_SIZE;
Reported-by: Yuriy Kolerov <yuriy.kolerov@synopsys.com>
Fixes: f51e2f1911 ("ARC: make sure instruction_pointer() returns unsigned value")
Cc: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
The previous commit for delayed retry of SCOND needs some fine tuning
for spin locks.
The backoff from delayed retry in conjunction with spin looping of lock
itself can potentially cause the delay counter to reach high values.
So to provide fairness to any lock operation, after a lock "seems"
available (i.e. just before first SCOND try0, reset the delay counter
back to starting value of 1
Essentially reset delay to 1 for a new spin-wait-loop-acquire cycle.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
This is to workaround the llock/scond livelock
HS38x4 could get into a LLOCK/SCOND livelock in case of multiple overlapping
coherency transactions in the SCU. The exclusive line state keeps rotating
among contenting cores leading to a never ending cycle. So break the cycle
by deferring the retry of failed exclusive access (SCOND). The actual delay
needed is function of number of contending cores as well as the unrelated
coherency traffic from other cores. To keep the code simple, start off with
small delay of 1 which would suffice most cases and in case of contention
double the delay. Eventually the delay is sufficient such that the coherency
pipeline is drained, thus a subsequent exclusive access would succeed.
Link: http://lkml.kernel.org/r/1438612568-28265-1-git-send-email-vgupta@synopsys.com
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
With LLOCK/SCOND, the rwlock counter can be atomically updated w/o need
for a guarding spin lock.
This in turn elides the EXchange instruction based spinning which causes
the cacheline transition to exclusive state and concurrent spinning
across cores would cause the line to keep bouncing around.
LLOCK/SCOND based implementation is superior as spinning on LLOCK keeps
the cacheline in shared state.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>