This patch attempts to do a few things. The highlights are: 1) enable
SPARSE_IRQ unconditionally, 2) kills off !SPARSE_IRQ code 3) allocates
ivector_table at boot time and 4) default to cookie only VIRQ mechanism
for supported firmware. The first firmware with cookie only support for
me appears on T5. You can optionally force the HV firmware to not cookie
only mode which is the sysino support.
The sysino is a deprecated HV mechanism according to the most recent
SPARC Virtual Machine Specification. HV_GRP_INTR is what controls the
cookie/sysino firmware versioning.
The history of this interface is:
1) Major version 1.0 only supported sysino based interrupt interfaces.
2) Major version 2.0 added cookie based VIRQs, however due to the fact
that OSs were using the VIRQs without negoatiating major version
2.0 (Linux and Solaris are both guilty), the VIRQs calls were
allowed even with major version 1.0
To complicate things even further, the VIRQ interfaces were only
actually hooked up in the hypervisor for LDC interrupt sources.
VIRQ calls on other device types would result in HV_EINVAL errors.
So effectively, major version 2.0 is unusable.
3) Major version 3.0 was created to signal use of VIRQs and the fact
that the hypervisor has these calls hooked up for all interrupt
sources, not just those for LDC devices.
A new boot option is provided should cookie only HV support have issues.
hvirq - this is the version for HV_GRP_INTR. This is related to HV API
versioning. The code attempts major=3 first by default. The option can
be used to override this default.
I've tested with SPARSE_IRQ on T5-8, M7-4 and T4-X and Jalap?no.
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to accomodate embedded per-cpu allocation with large numbers
of cpus and numa nodes, we have to use as much virtual address space
as possible for the vmalloc region. Otherwise we can get things like:
PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
So, once we select a value for PAGE_OFFSET, derive the size of the
vmalloc region based upon that.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
Make sure, at compile time, that the kernel can properly support
whatever MAX_PHYS_ADDRESS_BITS is defined to.
On M7 chips, use a max_phys_bits value of 49.
Based upon a patch by Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
For sparse memory configurations, the vmemmap array behaves terribly
and it takes up an inordinate amount of space in the BSS section of
the kernel image unconditionally.
Just build huge PMDs and look them up just like we do for TLB misses
in the vmalloc area.
Kernel BSS shrinks by about 2MB.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
DEBUG_PAGEALLOC stop working because the 3-level page tables only
can cover up to 43 bits.
Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
47, several statically allocated tables became enormous.
Compounding this is that we will need to support up to 49 bits of
physical addressing for M7 chips.
The two tables in question are sparc64_valid_addr_bitmap and
kpte_linear_bitmap.
The first holds a bitmap, with 1 bit for each 4MB chunk of physical
memory, indicating whether that chunk actually exists in the machine
and is valid.
The second table is a set of 2-bit values which tell how large of a
mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
chunk of ram in the system.
These tables are huge and take up an enormous amount of the BSS
section of the sparc64 kernel image. Specifically, the
sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.
So let's solve the space wastage and the DEBUG_PAGEALLOC problem
at the same time, by using the kernel page tables (as designed) to
manage this information.
We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
and we do this by encoding huge PMDs and PUDs.
On a T4-2 with 256GB of ram the kernel page table takes up 16K with
DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this
memory is dynamically allocated at run time rather than coded
statically into the kernel image.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
As currently coded the KTSB accesses in the kernel only support up to
47 bits of physical addressing.
Adjust the instruction and patching sequence in order to support
arbitrary 64 bits addresses.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
Now that we use 4-level page tables, we can provide up to 53-bits of
virtual address space to the user.
Adjust the VA hole based upon the capabilities of the cpu type probed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
This has become necessary with chips that support more than 43-bits
of physical addressing.
Based almost entirely upon a patch by Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
The vio_set_intr() API should be used by VIO consumers to enable/disable
Rx interrupts to facilitate deferred processing in softirq/bottom-half
context.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vio_dring_avail() will allow use of every dring entry, but when the last
entry is allocated then dr->prod == dr->cons which is indistinguishable from
the ring empty condition. This causes the next allocation to reuse an entry.
When this happens in sunvdc, the server side vds driver begins nack'ing the
messages and ends up resetting the ldc channel. This problem does not effect
sunvnet since it checks for < 2.
The fix here is to just never allocate the very last dring slot so that full
and empty are not the same condition. The request start path was changed to
check for the ring being full a bit earlier, and to stop the blk_queue if
there is no space left. The blk_queue will be restarted once the ring is
only half full again. The number of ring entries was increased to 512 which
matches the sunvnet and Solaris vdc drivers, and greatly reduces the
frequency of hitting the ring full condition and the associated blk_queue
stop/starting. The checks in sunvent were adjusted to account for
vio_dring_avail() returning 1 less.
Orabug: 19441666
OraBZ: 14983
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ldc_map_sg() could fail its check that the number of pages referred to
by the sg scatterlist was <= the number of cookies.
This fixes the issue by doing a similar thing to the xen-blkfront driver,
ensuring that the scatterlist will only ever contain a segment count <=
port->ring_cookies, and each segment will be page aligned, and <= page
size. This ensures that the scatterlist is always mappable.
Orabug: 19347817
OraBZ: 15945
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The LDom diskserver doesn't return reliable geometry data. In addition,
the types for all fields in the vio_disk_geom are u16, which were being
truncated in the cast into the u8's of the Linux struct hd_geometry.
Modify vdc_getgeo() to compute the geometry from the disk's capacity in a
manner consistent with xen-blkfront::blkif_getgeo().
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Interpret the media type from v1.1 protocol to support CDROM/DVD.
For v1.0 protocol, a disk's size continues to be calculated from the
geometry returned by the vdisk server. The geometry returned by the server
can be less than the actual number of sectors available in the backing
image/device due to the rounding in the division used to compute the
geometry in the vdisk server.
In v1.1 protocol a disk's actual size in sectors is returned during the
handshake. Use this size when v1.1 protocol is negotiated. Since this size
will always be larger than the former geometry computed size, disks created
under v1.0 will be forwards compatible to v1.1, but not vice versa.
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We changed PAGE_OFFSET to be a variable rather than a constant,
but this reference here in the hibernate assembler got missed.
Signed-off-by: David S. Miller <davem@davemloft.net>
The T5 (niagara5) has different PCR related HV fast trap values and a new
HV API Group. This patch utilizes these and shares when possible with niagara4.
We use the same sparc_pmu niagara4_pmu. Should there be new effort to
obtain the MCU perf statistics then this would have to be changed.
Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "mem" boot option can result in many unexpected consequences. This patch
attempts to prevent boot hangs which have been experienced on T4-4 and T5-8.
Basically the boot loader allocates vmlinuz and initrd higher in available
OBP physical memory. For example, on a 2Tb T5-8 it isn't possible to boot
with mem=20G.
The patch utilizes memblock to avoid reserved regions and trim memory which
is only free. Other improvements are possible for a multi-node machine.
This is a snippet of the boot log with mem=20G on T5-8 with the patch applied:
MEMBLOCK configuration: <- before memory reduction
memory size = 0x1ffad6ce000 reserved size = 0xa1adf44
memory.cnt = 0xb
memory[0x0] [0x00000030400000-0x00003fdde47fff], 0x3fada48000 bytes
memory[0x1] [0x00003fdde4e000-0x00003fdde4ffff], 0x2000 bytes
memory[0x2] [0x00080000000000-0x00083fffffffff], 0x4000000000 bytes
memory[0x3] [0x00100000000000-0x00103fffffffff], 0x4000000000 bytes
memory[0x4] [0x00180000000000-0x00183fffffffff], 0x4000000000 bytes
memory[0x5] [0x00200000000000-0x00203fffffffff], 0x4000000000 bytes
memory[0x6] [0x00280000000000-0x00283fffffffff], 0x4000000000 bytes
memory[0x7] [0x00300000000000-0x00303fffffffff], 0x4000000000 bytes
memory[0x8] [0x00380000000000-0x00383fffc71fff], 0x3fffc72000 bytes
memory[0x9] [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
memory[0xa] [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
reserved.cnt = 0x2
reserved[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
reserved[0x1] [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
...
MEMBLOCK configuration: <- after reduction of memory
memory size = 0x50a1adf44 reserved size = 0xa1adf44
memory.cnt = 0x4
memory[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
memory[0x1] [0x00380004000000-0x0038050d01d74a], 0x50901d74b bytes
memory[0x2] [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
memory[0x3] [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
reserved.cnt = 0x2
reserved[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
reserved[0x1] [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
...
Early memory node ranges
node 7: [mem 0x380000000000-0x38000117dfff]
node 7: [mem 0x380004000000-0x380f0d01bfff]
node 7: [mem 0x383fffc92000-0x383fffca1fff]
node 7: [mem 0x383fffcb4000-0x383fffcb5fff]
Could not find start_pfn for node 0
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 4
Could not find start_pfn for node 5
Could not find start_pfn for node 6
.
The patch was tested on T4-1, T5-8 and Jalap?no.
Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have seen an issue with guest boot into LDOM that causes early boot failures
because of no matching rules for node identitity of the memory. I analyzed this
on my T4 and concluded there might not be a solution. I saw the issue in
mainline too when booting into the control/primary domain - with guests
configured. Note, this could be a firmware bug on some older machines.
I'll provide a full explanation of the issues below. Should we not find a
matching BEST latency group for a real address (RA) then we will assume node 0.
On the T4-2 here with the information provided I can't see an alternative.
Technically the LDOM shown below should match the MBLOCK to the
favorable latency group. However other factors must be considered too. Were
the memory controllers configured "fine" grained interleave or "coarse"
grain interleaved - T4. Also should a "group" MD node be considered a NUMA
node?
There has to be at least one Machine Description (MD) "group" and hence one
NUMA node. The group can have one or more latency groups (lg) - more than one
memory controller. The current code chooses the smallest latency as the most
favorable per group. The latency and lg information is in MLGROUP below.
MBLOCK is the base and size of the RAs for the machine as fetched from OBP
/memory "available" property. My machine has one MBLOCK but more would be
possible - with holes?
For a T4-2 the following information has been gathered:
with LDOM guest
MEMBLOCK configuration:
memory size = 0x27f870000
memory.cnt = 0x3
memory[0x0] [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes
memory[0x1] [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes
memory[0x2] [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes
reserved.cnt = 0x2
reserved[0x0] [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes
reserved[0x1] [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes
MBLOCK[0]: base[20000000] size[280000000] offset[0]
(note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values)
(note: (RA + offset) & mask = val is the formula to detect a match for the
memory controller. should there be no match for find_node node, a return
value of -1 resulted for the node - BAD)
There is one group. It has these forward links
MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000]
MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000]
NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8])
(note: "val" is the best lg's (smallest latency) "match")
no LDOM guest - bare metal
MEMBLOCK configuration:
memory size = 0xfdf2d0000
memory.cnt = 0x3
memory[0x0] [0x00000020400000-0x00000fff6adfff], 0xfdf2ae000 bytes
memory[0x1] [0x00000fff6d2000-0x00000fff6e7fff], 0x16000 bytes
memory[0x2] [0x00000fff766000-0x00000fff771fff], 0xc000 bytes
reserved.cnt = 0x2
reserved[0x0] [0x00000020800000-0x00000021a04580], 0x1204581 bytes
reserved[0x1] [0x00000024800000-0x0000002c7d29fc], 0x7fd29fd bytes
MBLOCK[0]: base[20000000] size[fe0000000] offset[0]
there are two groups
group node[16d5]
MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000]
MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000]
NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8])
group node[171d]
MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000]
MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000]
NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8])
(note: for this two "group" bare metal machine, 1/2 memory is in group one's
lg and 1/2 memory is in group two's lg).
Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've witnessed a few TLB events causing the machine to power off because
of prom_halt. In one case it was some nfs related area during rmmod. Another
was an mmapper of /dev/mem. A more recent one is an ITLB issue with
a bad pagesize which could be a hardware bug. Bugs happen but we should
attempt to not power off the machine and/or hang it when possible.
This is a DTLB error from an mmapper of /dev/mem:
[root@sparcie ~]# SUN4V-DTLB: Error at TPC[fffff80100903e6c], tl 1
SUN4V-DTLB: TPC<0xfffff80100903e6c>
SUN4V-DTLB: O7[fffff801081979d0]
SUN4V-DTLB: O7<0xfffff801081979d0>
SUN4V-DTLB: vaddr[fffff80100000000] ctx[1250] pte[98000000000f0610] error[2]
.
This is recent mainline for ITLB:
[ 3708.179864] SUN4V-ITLB: TPC<0xfffffc010071cefc>
[ 3708.188866] SUN4V-ITLB: O7[fffffc010071cee8]
[ 3708.197377] SUN4V-ITLB: O7<0xfffffc010071cee8>
[ 3708.206539] SUN4V-ITLB: vaddr[e0003] ctx[1a3c] pte[2900000dcc800eeb] error[4]
.
Normally sun4v_itlb_error_report() and sun4v_dtlb_error_report() would call
prom_halt() and drop us to OF command prompt "ok". This isn't the case for
LDOMs and the machine powers off.
For the HV reported error of HV_ENORADDR for HV HV_MMU_MAP_ADDR_TRAP we cause
a SIGBUS error by qualifying it within do_sparc64_fault() for fault code mask
of FAULT_CODE_BAD_RA. This is done when trap level (%tl) is less or equal
one("1"). Otherwise, for %tl > 1, we proceed eventually to die_if_kernel().
The logic of this patch was partially inspired by David Miller's feedback.
Power off of large sparc64 machines is painful. Plus die_if_kernel provides
more context. A reset sequence isn't a brief period on large sparc64 but
better than power-off/power-on sequence.
Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dma_zalloc_coherent() calls dma_alloc_coherent(__GFP_ZERO)
but the sparc32 implementations sbus_alloc_coherent() and
pci32_alloc_coherent() doesn't take the gfp flags into
account.
Tested on the SPARC32/LEON GRETH Ethernet driver which fails
due to dma_alloc_coherent(__GFP_ZERO) returns non zeroed
pages.
Signed-off-by: Daniel Hellstrom <daniel@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The leon_dma_ops struct is needed for leon regardless of PCI configuration.
Signed-off-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes sure that leon_cycles_offset takes the pending bit into
account and that leon_clear_clock_irq clears the pending bit. Otherwise,
if leon_cycles_offset is executed after the timer has wrapped but before
timer_interrupt has increased timer_cs_internal_counter, time can be
perceived to go backwards.
Signed-off-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes memset follow the standard (instead of returning 0 on success). This
is needed when certain versions of gcc optimizes around memset calls and assume
that the address argument is preserved in %o0.
Signed-off-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>