Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes. Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.
This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).
The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.
For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.
Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
flag appropriately set). A second RTM_NEWADDR is sent if DAD
completes (the address flags have changed), otherwise an
RTM_DELADDR is sent.
Also: add an entry in ip-sysctl.txt for optimistic_dad.
[cherry-pick of net-next 7fd2561e4e]
Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug: 17769720
Change-Id: Ic7e50781c607e1f3a492d9ce7395946efb95c533
When using mark-based routing, sockets returned from accept()
may need to be marked differently depending on the incoming
connection request.
This is the case, for example, if different socket marks identify
different networks: a listening socket may want to accept
connections from all networks, but each connection should be
marked with the network that the request came in on, so that
subsequent packets are sent on the correct network.
This patch adds a sysctl to mark TCP sockets based on the fwmark
of the incoming SYN packet. If enabled, and an unmarked socket
receives a SYN, then the SYN packet's fwmark is written to the
connection's inet_request_sock, and later written back to the
accepted socket when the connection is established. If the
socket already has a nonzero mark, then the behaviour is the same
as it is today, i.e., the listening socket's fwmark is used.
Black-box tested using user-mode linux:
- IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
mark of the incoming SYN packet.
- The socket returned by accept() is marked with the mark of the
incoming SYN packet.
- Tested with syncookies=1 and syncookies=2.
Change-Id: I26bc1eceefd2c588d73b921865ab70e4645ade57
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Currently, routing lookups used for Path PMTU Discovery in
absence of a socket or on unmarked sockets use a mark of 0.
This causes PMTUD not to work when using routing based on
netfilter fwmark mangling and fwmark ip rules, such as:
iptables -j MARK --set-mark 17
ip rule add fwmark 17 lookup 100
This patch causes these route lookups to use the fwmark from the
received ICMP error when the fwmark_reflect sysctl is enabled.
This allows the administrator to make PMTUD work by configuring
appropriate fwmark rules to mark the inbound ICMP packets.
Black-box tested using user-mode linux by pointing different
fwmarks at routing tables egressing on different interfaces, and
using iptables mangling to mark packets inbound on each interface
with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
work as expected when mark reflection is enabled and fail when
it is disabled.
Change-Id: Id7fefb7ec1ff7f5142fba43db1960b050e0dfaec
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.
This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.
Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).
Change-Id: I6873d973196797bcf32e2e91976df647c7e8b85a
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Commit d50240a5f6 ("arm64: mm: permit use of tagged pointers at EL0")
added support for tagged pointers in userspace, but the corresponding
update to Documentation/ contained some imprecise statements.
This patch fixes up some minor ambiguities in the text, hopefully making
it more clear about exactly what the kernel expects from user virtual
addresses.
Change-Id: I7df342e01d5253ccacb3847449940892768d7e07
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
TCR.TBI0 can be used to cause hardware address translation to ignore the
top byte of userspace virtual addresses. Whilst not especially useful in
standard C programs, this can be used by JITs to `tag' pointers with
various pieces of metadata.
This patch enables this bit for AArch64 Linux, and adds a new file to
Documentation/arm64/ which describes some potential caveats when using
tagged virtual addresses.
Change-Id: I4c025d026144c69a2259b6562e46176f95b4e110
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.
This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period. In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.
It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.
[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.
Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Update Documentation/android.txt to reference PSTORE_CONSOLE
and PSTORE_RAM instead of ANDROID_RAM_CONSOLE
Change-Id: I2c56e73f8c65c3ddbe6ddbf1faadfacb42a09575
Reported-by: Jon Medhurst (Tixy) <tixy@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Accept a string of delays and speeds at which to apply the delay before
raising each step above hispeed. For example, "80000 1300000:200000
1500000:40000" means that the delay at or above 1GHz, until 1.3GHz is 80 msecs,
the delay until 1.5GHz is 200 msecs and the delay at or above 1.5GHz is 40
msecs when hispeed_freq is 1GHz.
[toddpoynor@google.com: add documentation]
Change-Id: Ifeebede8b1acbdd0a53e5c6916bccbf764dc854f
Signed-off-by: Minsung Kim <ms925.kim@samsung.com>
Add the 'funcgraph-flat' option to the function_graph tracer to use the default
trace printing format rather than the hierarchical formatting normally used.
Change-Id: If2900bfb86e6f8f51379f56da4f6fabafa630909
Signed-off-by: Jamie Gennis <jgennis@google.com>
Update default go_hispeed_load from 85% to 99%. Recent changes to the
governor now use a default target_load of 90%. go_hispeed_load should
not be lower than the target load for hispeed_freq, which could lead
to oscillating speed decisions. Other recent changes reduce the need
to dampen speed jumps on load spikes, while input event boosts from
userspace are the preferred method for anticipating load spikes with
UI impacts.
General update to the documentation to reflect recent changes.
Change-Id: I1b92f3091f42c04b10503cd1169a943b5dfd6faf
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Accept a string of target loads and speeds at which to apply the
target loads, per the documentation update in this patch. For example,
"85 1000000:90 1700000:99" targets CPU load 85% below speed 1GHz, 90%
at or above 1GHz, until 1.7GHz and above, at which load 99% is targeted.
Attempt to avoid oscillations by evaluating the current speed
weighted by current load against each new choice of speed, choosing a
higher speed if the current load requires a higher speed.
Change-Id: Ie3300206047c84eca5a26b0b63ea512e5207550e
Signed-off-by: Todd Poynor <toddpoynor@google.com>
This governor is designed for latency-sensitive workloads, such as
interactive user interfaces. The interactive governor aims to be
significantly more responsive to ramp CPU quickly up when CPU-intensive
activity begins.
Existing governors sample CPU load at a particular rate, typically
every X ms. This can lead to under-powering UI threads for the period of
time during which the user begins interacting with a previously-idle system
until the next sample period happens.
The 'interactive' governor uses a different approach. Instead of sampling
the CPU at a specified rate, the governor will check whether to scale the
CPU frequency up soon after coming out of idle. When the CPU comes out of
idle, a timer is configured to fire within 1-2 ticks. If the CPU is very
busy from exiting idle to when the timer fires then we assume the CPU is
underpowered and ramp to MAX speed.
If the CPU was not sufficiently busy to immediately ramp to MAX speed, then
the governor evaluates the CPU load since the last speed adjustment,
choosing the highest value between that longer-term load or the short-term
load since idle exit to determine the CPU speed to ramp to.
A realtime thread is used for scaling up, giving the remaining tasks the
CPU performance benefit, unlike existing governors which are more likely to
schedule rampup work to occur after your performance starved tasks have
completed.
The tuneables for this governor are:
/sys/devices/system/cpu/cpufreq/interactive/min_sample_time:
The minimum amount of time to spend at the current frequency before
ramping down. This is to ensure that the governor has seen enough
historic CPU load data to determine the appropriate workload.
Default is 80000 uS.
/sys/devices/system/cpu/cpufreq/interactive/go_maxspeed_load
The CPU load at which to ramp to max speed. Default is 85.
Change-Id: Ib2b362607c62f7c56d35f44a9ef3280f98c17585
Signed-off-by: Mike Chan <mike@android.com>
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Bug: 3152864
Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup via CFS, move permission checks into each
specific cgroup subsystem. If a subsystem does not specify a
'allow_attach' handler, then we fall back to doing our checks
the old way.
Use the 'allow_attach' handler for the 'cpu' cgroup to allow
non-root processes to add arbitrary processes to a 'cpu' cgroup
if it has the CAP_SYS_NICE capability set.
This version of the patch adds a 'allow_attach' handler instead
of reusing the 'can_attach' handler. If the 'can_attach' handler
is reused, a new cgroup that implements 'can_attach' but not
the permission checks could end up with no permission checks
at all.
Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c
Original-Author: San Mehat <san@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
Pull networking fixes from David Miller:
1) Found via trinity:
If you connect up an ipv6 socket to an ipv4 mapped address then an
ipv6 one, sendmsg() can croak because ip6_sk_dst_check() assumes the
route cached in the socket is an ipv6 one. In this case there is an
ipv4 route attached, so it gets stomped on.
Reported by Dave Jones and Hannes Frederic Sowa, fixed by Eric
Dumazet.
2) AF_KEY notifications leak some kernel memory to userspace, fix from
Mathias Krause.
3) DLCI calls __dev_get_by_name() without proper locking, and dlci_del
doesn't validate that the device being deleted is actually a DLCI
one. Fixes from Li Zefan.
4) Length check on bluetooth l2cap information responses is wrong, each
response type has a different lenth, so we should make sure it's in
a given range rather than enforce one single valid length. From
Jaganath Kanakkassery.
5) Receive FIFO overflow is really easy to trigger in stress scenerios
in the sh_eth driver, but the event isn't being handled properly at
all. Specifically, the mask of error interrupts doesn't include the
event so we never clear it, resulting in the driver becomming wedged
processing an interrupt that never gets cleared.
Fix from Sergei Shtylyov.
6) qlcnic sleeps while holding a spinlock, use mdelay() instead of
msleep(). From Shahed Shaikh.
7) Missing curly braces causes SIP netfilter NAT module to always drop
packets. Fix from Balazs Peter Odor.
8) ipt_ULOG in netfilter passes the wrong value to timer setup, causing
the timer to dereference crap when it fires. Fix from Gao Feng.
9) Missing RCU protection around txq->axq_acq traversal in
ath_txq_schedule(). Fix from Felix Fietkau.
10) Idle state transition test in ath9k_htc_config() is reversed, fix
from Sujith Manoharan.
11) IPV6 forwarding handles unicast Router Alert packets incorrectly.
It tests the wrong option state. Previously opt->ra being non-zero
indicated a router alert marking in the SKB, but now it's indicated
by a bit in opt->flags. Fix from YOSHIFUJI Hideaki.
12) SKB leak in GRE tunnel GSO handling, from Eric Dumazet.
13) get_user_pages_fast() error handling in TUN and MACVTAP use the same
local variable for the base index and the loop iterator for page
traversal, oops! Fix from Michael S Tsirkin.
14) ipv6_get_lladdr() can fail, and we must therefore check it's return
value in inet6_set_iftoken(). For from Hannes Frederic Sowa.
15) If you change an interface name and meanwhile can sneak in something
that looks up the name (like SO_BINDTODEVICE or SIOCGIFNAME) we can
deadlock with CONFIG_PREEMPT=n. Fix this by providing a helper
function that properly uses raw_seqcount_begin(). From Nicolas
Schichan.
16) Chain noise calibration test is inverted in iwlwifi, fix from
Nikolay Martynov.
17) Properly set TX iwlwifi descriptor flags for back requests. Fix
from Emmanuel Grumbach.
18) We can't assume skb_transport_header() is set in xt_TCPOPTSTRAP
module, fix from Pablo Neira Ayuso.
19) Some crummy APs don't provide the proper High Throughput info in
association response frames. Add a workaround by assume we'll use
whatever is in the beacon/probe. Fix from Johannes Berg.
20) mac80211 call to rate_idx_match_mask() swaps two arguments (mask and
channel width). Fix from Simon Wunderlich.
21) xt_TCPMSS (like xt_TCPOPTSTRAP) must not try to handle fragmented
frames. Fix from Phil Oester.
22) Fix rate control regression causing iwlwifi/iwlegacy chips to use
1Mbit/s on pre-11n networks. From Moshe Benji and Stanslaw Gruszka.
23) Disable brcmsmac power-save functions, they cause regressions. From
Arend van Spriel.
24) Enforce a sane minimum MTU in l2cap_build_cmd() otherwise we can
easily crash. Fix from Anderson Lizardo.
25) If a learning packet arrives during vxlan_stop() we crash, easily
fixed by checking netif_running(). From Stephen Hemminger.
26) Static vxlan FDB entries should not be migrated, also from Stephen.
27) skb_clone() failures not handled in vxlan_xmit(), oops. Also from
Stephen.
28) Add minimal driver for AR816x/AR817x ethernet chips, from Johannes
Berg.
29) Fix regression in userspace VLAN acceleration control, added by the
802.1ad support changes. Fix from Fernando Luis Vazquez Cao.
30) Interval selection for MLD queries in the bridging code was
reversed. Fix from Linus Lüssing.
31) ipv6's ndisc_send_redirect() erroneously writes to the packet we
received not the packet we are building to send out. Fix from
Matthias Schiffer.
32) Don't free netdev before unregistering it, in usb_8dev can driver.
From Marc Kleine-Budde.
33) Fix nl80211 attribute buffer races, from Johannes Berg.
34) Although netlink_diag.h is under uapi/ it isn't present in Kbuild.
From Stephen Hemminger.
35) Wrong address and family passed to MD5 key lookups in TCP, from
Aydin Arik.
36) phy_type attribute created by SFC driver should not be writable.
From Ben Hutchings.
37) Receive/Transmit queue allocations in pxa168_eth and mv643xx_eth
should use kzalloc(). Otherwise if setup fails half-way, we'll
dereference garbage when trying to teardown the rings. From Lubomir
Rintel.
38) Fix double-allocation of dst (resulting in unfreeable net device) in
ipv6's init_loopback(). From Gao Feng.
39) Fix fragmentation handling SKB leak in netfilter conntrack, we were
freeing the wrong skb pointer. From Phil Oester.
40) Don't report "-1" (SPEED_UNKNOWN) in bond_miimon_commit(), from
Nikolay Aleksandrov.
41) davinci_cpdma doesn't check for DMA mapping errors, letting the
device scribble to random addresses. From Sebastian Siewior.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
dlci: validate the net device in dlci_del()
dlci: acquire rtnl_lock before calling __dev_get_by_name()
af_key: fix info leaks in notify messages
ipv6: ip6_sk_dst_check() must not assume ipv6 dst
net: fix kernel deadlock with interface rename and netdev name retrieval.
net/tg3: Avoid delay during MMIO access
ipv6: check return value of ipv6_get_lladdr
macvtap: fix recovery from gup errors
tun: fix recovery from gup errors
gre: fix a possible skb leak
ipv6: Process unicast packet with Router Alert by checking flag in skb.
ath9k_htc: Handle IDLE state transition properly
ath9k: fix an RCU issue in calling ieee80211_get_tx_rates
netfilter: ipt_ULOG: fix incorrect setting of ulog timer
netfilter: ctnetlink: send event when conntrack label was modified
netfilter: nf_nat_sip: fix mangling
qlcnic: Do not sleep while holding spinlock
drivers: net: cpsw: fix compilation error with cpsw driver
tcp: doc : fix the syncookies default value
sh_eth: fix misreporting of transmit abort
...
syncookies is on for default since in commit e994b7c901
(tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL).
And fix a typo of CONFIG_SYN_COOKIES.
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull sound fixes from Takashi Iwai:
"Only driver/device-specific small fixes that are pretty safe to apply:
- USB-audio Android and Logitech webcam fixes
- HD-audio MacBook Air 4,2 quirk
- Complete Dell headset quirk entries that were introduced in 3.10"
* tag 'sound-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Add models for Dell headset jacks
ALSA: usb-audio: Fix invalid volume resolution for Logitech HD Webcam c310
ALSA: hda - Fix pin configurations for MacBook Air 4,2
ALSA: usb-audio: work around Android accessory firmware bug
ALSA: hda - Headset mic support for three more machines
Pull media fixes from Mauro Carvalho Chehab:
"Series of fixes for 3.10. There are some usual driver fixes (mostly
on s5p/exynos playform drivers), plus some fixes at V4L2 core"
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (40 commits)
[media] soc_camera: error dev remove and v4l2 call
[media] sh_veu: fix the buffer size calculation
[media] sh_veu: keep power supply until the m2m context is released
[media] sh_veu: invoke v4l2_m2m_job_finish() even if a job has been aborted
[media] v4l2-ioctl: don't print the clips list
[media] v4l2-ctrls: V4L2_CTRL_CLASS_FM_RX controls are also valid radio controls
[media] cx88: fix NULL pointer dereference
[media] DocBook/media/v4l: update version number
[media] exynos4-is: Remove "sysreg" clock handling
[media] exynos4-is: Fix reported colorspace at FIMC-IS-ISP subdev
[media] exynos4-is: Ensure fimc-is clocks are not enabled until properly configured
[media] exynos4-is: Prevent NULL pointer dereference when firmware isn't loaded
[media] s5p-mfc: Add NULL check for allocated buffer
[media] s5p-mfc: added missing end-of-lines in debug messages
[media] s5p-mfc: v4l2 controls setup routine moved to initialization code
[media] s5p-mfc: separate encoder parameters for h264 and mpeg4
[media] s5p-mfc: Remove special clock usage in driver
[media] s5p-mfc: Remove unused s5p_mfc_get_decoded_status_v6() function
[media] v4l2: mem2mem: save irq flags correctly
[media] coda: v4l2-compliance fix: add VIDIOC_CREATE_BUFS support
...