linux-apfs

mirror of https://github.com/linux-apfs/linux-apfs.git synced 2026-05-01 15:00:59 -07:00

Author	SHA1	Message	Date
Linus Torvalds	191a712090	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Fixes and a lot of cleanups. Locking cleanup is finally complete. cgroup_mutex is no longer exposed to individual controlelrs which used to cause nasty deadlock issues. Li fixed and cleaned up quite a bit including long standing ones like racy cgroup_path(). - device cgroup now supports proper hierarchy thanks to Aristeu. - perf_event cgroup now supports proper hierarchy. - A new mount option "__DEVEL__sane_behavior" is added. As indicated by the name, this option is to be used for development only at this point and generates a warning message when used. Unfortunately, cgroup interface currently has too many brekages and inconsistencies to implement a consistent and unified hierarchy on top. The new flag is used to collect the behavior changes which are necessary to implement consistent unified hierarchy. It's likely that this flag won't be used verbatim when it becomes ready but will be enabled implicitly along with unified hierarchy. The option currently disables some of broken behaviors in cgroup core and also .use_hierarchy switch in memcg (will be routed through -mm), which can be used to make very unusual hierarchy where nesting is partially honored. It will also be used to implement hierarchy support for blk-throttle which would be impossible otherwise without introducing a full separate set of control knobs. This is essentially versioning of interface which isn't very nice but at this point I can't see any other options which would allow keeping the interface the same while moving towards hierarchy behavior which is at least somewhat sane. The planned unified hierarchy is likely to require some level of adaptation from userland anyway, so I think it'd be best to take the chance and update the interface such that it's supportable in the long term. Maintaining the existing interface does complicate cgroup core but shouldn't put too much strain on individual controllers and I think it'd be manageable for the foreseeable future. Maybe we'll be able to drop it in a decade. Fix up conflicts (including a semantic one adding a new #include to ppc that was uncovered by header the file changes) as per Tejun. * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits) cpuset: fix compile warning when CONFIG_SMP=n cpuset: fix cpu hotplug vs rebuild_sched_domains() race cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() cgroup: restore the call to eventfd->poll() cgroup: fix use-after-free when umounting cgroupfs cgroup: fix broken file xattrs devcg: remove parent_cgroup. memcg: force use_hierarchy if sane_behavior cgroup: remove cgrp->top_cgroup cgroup: introduce sane_behavior mount option move cgroupfs_root to include/linux/cgroup.h cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix cgroup: make cgroup_path() not print double slashes Revert "cgroup: remove bind() method from cgroup_subsys." perf: make perf_event cgroup hierarchical cgroup: implement cgroup_is_descendant() cgroup: make sure parent won't be destroyed before its children cgroup: remove bind() method from cgroup_subsys. devcg: remove broken_hierarchy tag cgroup: remove cgroup_lock_is_held() ...	2013-04-29 19:14:20 -07:00
Linus Torvalds	46d9be3e5e	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue updates from Tejun Heo: "A lot of activities on workqueue side this time. The changes achieve the followings. - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are updated to be able to interface with multiple backend worker pools. This involved a lot of churning but the end result seems actually neater as unbound workqueues are now a lot closer to per-cpu ones. - The ability to interface with multiple backend worker pools are used to implement unbound workqueues with custom attributes. Currently the supported attributes are the nice level and CPU affinity. It may be expanded to include cgroup association in future. The attributes can be specified either by calling apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if the workqueue in question is exported through sysfs. The backend worker pools are keyed by the actual attributes and shared by any workqueues which share the same attributes. When attributes of a workqueue are changed, the workqueue binds to the worker pool with the specified attributes while leaving the work items which are already executing in its previous worker pools alone. This allows converting custom worker pool implementations which want worker attribute tuning to use workqueues. The writeback pool is already converted in block tree and there are a couple others are likely to follow including btrfs io workers. - WQ_UNBOUND's ability to bind to multiple worker pools is also used to make it NUMA-aware. Because there's no association between work item issuer and the specific worker assigned to execute it, before this change, using unbound workqueue led to unnecessary cross-node bouncing and it couldn't be helped by autonuma as it requires tasks to have implicit node affinity and workers are assigned randomly. After these changes, an unbound workqueue now binds to multiple NUMA-affine worker pools so that queued work items are executed in the same node. This is turned on by default but can be disabled system-wide or for individual workqueues. Crypto was requesting NUMA affinity as encrypting data across different nodes can contribute noticeable overhead and doing it per-cpu was too limiting for certain cases and IO throughput could be bottlenecked by one CPU being fully occupied while others have idle cycles. While the new features required a lot of changes including restructuring locking, it didn't complicate the execution paths much. The unbound workqueue handling is now closer to per-cpu ones and the new features are implemented by simply associating a workqueue with different sets of backend worker pools without changing queue, execution or flush paths. As such, even though the amount of change is very high, I feel relatively safe in that it isn't likely to cause subtle issues with basic correctness of work item execution and handling. If something is wrong, it's likely to show up as being associated with worker pools with the wrong attributes or OOPS while workqueue attributes are being changed or during CPU hotplug. While this creates more backend worker pools, it doesn't add too many more workers unless, of course, there are many workqueues with unique combinations of attributes. Assuming everything else is the same, NUMA awareness costs an extra worker pool per NUMA node with online CPUs. There are also a couple things which are being routed outside the workqueue tree. - block tree pulled in workqueue for-3.10 so that writeback worker pool can be converted to unbound workqueue with sysfs control exposed. This simplifies the code, makes writeback workers NUMA-aware and allows tuning nice level and CPU affinity via sysfs. - The conversion to workqueue means that there's no 1:1 association between a specific worker, which makes writeback folks unhappy as they want to be able to tell which filesystem caused a problem from backtrace on systems with many filesystems mounted. This is resolved by allowing work items to set debug info string which is printed when the task is dumped. As this change involves unifying implementations of dump_stack() and friends in arch codes, it's being routed through Andrew's -mm tree." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits) workqueue: use kmem_cache_free() instead of kfree() workqueue: avoid false negative WARN_ON() in destroy_workqueue() workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity workqueue: implement NUMA affinity for unbound workqueues workqueue: introduce put_pwq_unlocked() workqueue: introduce numa_pwq_tbl_install() workqueue: use NUMA-aware allocation for pool_workqueues workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() workqueue: map an unbound workqueues to multiple per-node pool_workqueues workqueue: move hot fields of workqueue_struct to the end workqueue: make workqueue->name[] fixed len workqueue: add workqueue->unbound_attrs workqueue: determine NUMA node of workers accourding to the allowed cpumask workqueue: drop 'H' from kworker names of unbound worker pools workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] workqueue: move pwq_pool_locking outside of get/put_unbound_pool() workqueue: fix memory leak in apply_workqueue_attrs() workqueue: fix unbound workqueue attrs hashing / comparison workqueue: fix race condition in unbound workqueue free path workqueue: remove pwq_lock which is no longer used ...	2013-04-29 19:07:40 -07:00
Linus Torvalds	ce8aa48929	Merge branch 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull async update from Tejun Heo: "This contains three cleanup patches for async from Lai. All three patches are essentially cosmetic." * 'for-3.10-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: async: rename and redefine async_func_ptr async: remove unused @node from struct async_domain async: simplify lowest_in_progress()	2013-04-29 19:06:59 -07:00
Linus Torvalds	b97db07510	Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu patch from Tejun Heo: "A puny pull request for percpu. We were expecting more cleanup patches but didn't happen this time, so just a single patch adding documentation from Christoph." * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add documentation on this_cpu operations	2013-04-29 19:06:16 -07:00
Linus Torvalds	73154383f0	Merge branch 'akpm' (incoming from Andrew) Merge first batch of fixes from Andrew Morton: - A couple of kthread changes - A few minor audit patches - A number of fbdev patches. Florian remains AWOL so I'm picking up some of these. - A few kbuild things - ocfs2 updates - Almost all of the MM queue (And in the meantime, I already have the second big batch from Andrew pending in my mailbox ;^) * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (149 commits) memcg: take reference before releasing rcu_read_lock mem hotunplug: fix kfree() of bootmem memory mmKconfig: add an option to disable bounce mm, nobootmem: do memset() after memblock_reserve() mm, nobootmem: clean-up of free_low_memory_core_early() fs/buffer.c: remove unnecessary init operation after allocating buffer_head. numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU mm: fix memory_hotplug.c printk format warning mm: swap: mark swap pages writeback before queueing for direct IO swap: redirty page if page write fails on swap file mm, memcg: give exiting processes access to memory reserves thp: fix huge zero page logic for page with pfn == 0 memcg: avoid accessing memcg after releasing reference fs: fix fsync() error reporting memblock: fix missing comment of memblock_insert_region() mm: Remove unused parameter of pages_correctly_reserved() firmware, memmap: fix firmware_map_entry leak mm/vmstat: add note on safety of drain_zonestat mm: thp: add split tail pages to shrink page list in page reclaim mm: allow for outstanding swap writeback accounting ...	2013-04-29 17:29:08 -07:00
Linus Torvalds	362ed48dee	Merge tag 'clk-for-linus-3.10' of git://git.linaro.org/people/mturquette/linux Pull clock framework update from Michael Turquette: "The common clock framework changes for 3.10 include many fixes for existing platforms, as well as adoption of the framework by new platforms and devices. Some long-needed fixes to the core framework are here as well as new features such as improved initialization of clocks from DT as well as framework reentrancy for nested clock operations." * tag 'clk-for-linus-3.10' of git://git.linaro.org/people/mturquette/linux: (44 commits) clk: add clk_ignore_unused option to keep boot clocks on clk: ux500: fix mismatched types clk: vexpress: Add separate SP810 driver clk: si5351: make clk-si5351 depend on CONFIG_OF clk: export __clk_get_flags for modular clock providers clk: vt8500: Missing breaks in vtwm_pll_round_rate/_set_rate. clk: sunxi: Unify oscillator clock clk: composite: allow fixed rates & fixed dividers clk: composite: rename 'div' references to 'rate' clk: add si5351 i2c common clock driver clk: add device tree fixed-factor-clock binding support clk: Properly handle notifier return values clk: ux500: abx500: Define clock tree for ab850x clk: ux500: Add support for sysctrl clocks clk: mvebu: Fix valid value range checking for cpu_freq_select clk: Fixup locking issues for clk_set_parent clk: Fixup errorhandling for clk_set_parent clk: Restructure code for __clk_reparent clk: sunxi: drop an unnecesary kmalloc clk: sunxi: drop CLK_IGNORE_UNUSED ...	2013-04-29 16:43:54 -07:00
Linus Torvalds	61f3d0a988	Merge tag 'spi-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi updates from Mark Brown: "A fairly quiet release for SPI, mainly driver work. A few highlights: - Supports bits per word compatibility checking in the core. - Allow use of the IP used in Freescale SPI controllers outside Freescale SoCs. - DMA support for the Atmel SPI driver. - New drivers for the BCM2835 and Tegra114" * tag 'spi-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (68 commits) spi-topcliff-pch: fix to use list_for_each_entry_safe() when delete list items spi-topcliff-pch: missing platform_driver_unregister() on error in pch_spi_init() ARM: dts: add pinctrl property for spi node for atmel SoC ARM: dts: add spi nodes for the atmel boards ARM: dts: add spi nodes for atmel SoC ARM: at91: add clocks for spi dt entries spi/spi-atmel: add dmaengine support spi/spi-atmel: add flag to controller data for lock operations spi/spi-atmel: add physical base address spi/sirf: fix MODULE_DEVICE_TABLE MAINTAINERS: Add git repository and update my address spi/s3c64xx: Check for errors in dmaengine prepare_transfer() spi/s3c64xx: Fix non-dmaengine usage spi: omap2-mcspi: fix error return code in omap2_mcspi_probe() spi/s3c64xx: let device core setup the default pin configuration MAINTAINERS: Update Grant's email address and maintainership spi: omap2-mcspi: Fix transfers if DMADEVICES is not set spi: s3c64xx: move to generic dmaengine API spi-gpio: init CS before spi_bitbang_setup() spi: spi-mpc512x-psc: let transmiter/receiver enabled when in xfer loop ...	2013-04-29 16:38:41 -07:00
Linus Torvalds	8ded8d4e4f	Merge tag 'regulator-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator updates from Mark Brown: "The diffstat and changelog here is dominated by Lee Jones' heroic efforts to sync the ab8500 driver that's been maintained out of tree with mainline (plus Axel's cleanup work on the results) but there's a few other things here: - Axel Lin added regulator_map_voltage_ascend() optimising a common pattern for drivers using the core code. - Milo Kim tought the regulator core to handle regulators sharing an enable GPIO, avoiding the need to do hacks to support such systems. - Andrew Bresticker added code to handle missing supplies for regulators more sensibly for device tree systems, reducing the need for stubbing there. plus the usual batch of driver specific updates and fixes" * tag 'regulator-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (152 commits) regulator: mc13892: Fix MC13892_SWITCHERS0_SWxHI bit in set_voltage_sel regulator: Remove NULL test before calling regulator_unregister() regulator: mc13783: Add device tree probe support regulator: mc13xxx: Add warning of incorrect names of regulators regulator: max77686: Don't update max77686->opmode if update register fails regulator: max8952: Add missing config.of_node setting for regulator register regulator: ab3100: Fix regulator register error handling regulator: tps6524x: Use regulator_map_voltage_ascend regulator: lp8788-buck: Use regulator_map_voltage_ascend regulator: lp872x: Use regulator_map_voltage_ascend regulator: mc13892: Use regulator_map_voltage_ascend for mc13892_sw_regulator_ops regulator: tps65023: Use regulator_map_voltage_ascend regulator: tps65023: Merge tps65020 ldo1 and ldo2 vsel table regulator: tps6507x: Use regulator_map_voltage_ascend regulator: mc13892: Fix MC13892_SWITCHERS0_SWxHI bit in set_voltage_sel regulator: ab3100: device tree support regulator: ab3100: refactor probe to use IDs regulator: max8973: Don't override control1 variable when set ramp delay bits regulator: tps80031: Convert tps80031_dcdc_ops to [get\|set]_voltage_sel_regmap regulator: tps80031: Fix LDO2 track mode for TPS80031 or TPS80032-ES1.0 ...	2013-04-29 16:32:25 -07:00
Linus Torvalds	7b053842b9	Merge tag 'regmap-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "In user visible terms just a couple of enhancements here, though there was a moderate amount of refactoring required in order to support the register cache sync performance improvements. - Support for block and asynchronous I/O during register cache syncing; this provides a use case dependant performance improvement. - Additional debugfs information on the memory consuption and register set" * tag 'regmap-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (23 commits) regmap: don't corrupt work buffer in _regmap_raw_write() regmap: cache: Fix format specifier in dev_dbg regmap: cache: Make regcache_sync_block_raw static regmap: cache: Write consecutive registers in a single block write regmap: cache: Split raw and non-raw syncs regmap: cache: Factor out block sync regmap: cache: Factor out reg_present support from rbtree cache regmap: cache: Use raw I/O to sync rbtrees if we can regmap: core: Provide regmap_can_raw_write() operation regmap: cache: Provide a get address of value operation regmap: Cut down on the average # of nodes in the rbtree cache regmap: core: Make raw write available to regcache regmap: core: Warn on invalid operation combinations regmap: irq: Clarify error message when we fail to request primary IRQ regmap: rbtree Expose total memory consumption in the rbtree debugfs entry regmap: debugfs: Add a registers `range' file regmap: debugfs: Simplify calculation of `c->max_reg' regmap: cache: Store caches in native register format where possible regmap: core: Split out in place value parsing regmap: cache: Use regcache_get_value() to check if we updated ...	2013-04-29 16:31:26 -07:00
Li Zefan	ca0dde9717	memcg: take reference before releasing rcu_read_lock The memcg is not referenced, so it can be destroyed at anytime right after we exit rcu read section, so it's not safe to access it. To fix this, we call css_tryget() to get a reference while we're still in rcu read section. This also removes a bogus comment above __memcg_create_cache_enqueue(). Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:40 -07:00
Yasuaki Ishimatsu	ebff7d8f27	mem hotunplug: fix kfree() of bootmem memory When hot removing memory presented at boot time, following messages are shown: kernel BUG at mm/slub.c:3409! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables xt_CHECKSUM iptable_mangle bridge stp llc ipmi_devintf ipmi_msghandler sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc vfat fat dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core e1000e ptp pps_core tpm_infineon ioatdma dca sr_mod cdrom sd_mod crc_t10dif usb_storage megaraid_sas lpfc scsi_transport_fc scsi_tgt scsi_mod CPU 0 Pid: 5091, comm: kworker/0:2 Tainted: G W 3.9.0-rc6+ #15 RIP: kfree+0x232/0x240 Process kworker/0:2 (pid: 5091, threadinfo ffff88084678c000, task ffff88083928ca80) Call Trace: __release_region+0xd4/0xe0 __remove_pages+0x52/0x110 arch_remove_memory+0x89/0xd0 remove_memory+0xc4/0x100 acpi_memory_device_remove+0x6d/0xb1 acpi_device_remove+0x89/0xab __device_release_driver+0x7c/0xf0 device_release_driver+0x2f/0x50 acpi_bus_device_detach+0x6c/0x70 acpi_ns_walk_namespace+0x11a/0x250 acpi_walk_namespace+0xee/0x137 acpi_bus_trim+0x33/0x7a acpi_bus_hot_remove_device+0xc4/0x1a1 acpi_os_execute_deferred+0x27/0x34 process_one_work+0x1f7/0x590 worker_thread+0x11a/0x370 kthread+0xee/0x100 ret_from_fork+0x7c/0xb0 RIP [<ffffffff811c41d2>] kfree+0x232/0x240 RSP <ffff88084678d968> The reason why the messages are shown is to release a resource structure, allocated by bootmem, by kfree(). So when we release a resource structure, we should check whether it is allocated by bootmem or not. But even if we know a resource structure is allocated by bootmem, we cannot release it since SLxB cannot treat it. So for reusing a resource structure, this patch remembers it by using bootmem_resource as follows: When releasing a resource structure by free_resource(), free_resource() checks whether the resource structure is allocated by bootmem or not. If it is allocated by bootmem, free_resource() adds it to bootmem_resource. If it is not allocated by bootmem, free_resource() release it by kfree(). And when getting a new resource structure by get_resource(), get_resource() checks whether bootmem_resource has released resource structures or not. If there is a released resource structure, get_resource() returns it. If there is not a releaed resource structure, get_resource() returns new resource structure allocated by kzalloc(). [akpm@linux-foundation.org: s/get_resource/alloc_resource/] Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Toshi Kani <toshi.kani@hp.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ram Pai <linuxram@us.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:40 -07:00
Vinayak Menon	9ca24e2e19	mmKconfig: add an option to disable bounce There are times when HIGHMEM is enabled, but we don't prefer CONFIG_BOUNCE to be enabled. CONFIG_BOUNCE can reduce the block device throughput, and this is not ideal for machines where we don't gain much by enabling it. So provide an option to deselect CONFIG_BOUNCE. The observation was made while measuring eMMC throughput using iozone on an ARM device with 1GB RAM. Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:40 -07:00
Joonsoo Kim	b476e2951f	mm, nobootmem: do memset() after memblock_reserve() Currently, we do memset() before reserving the area. This may not cause any problem, but it is somewhat weird. So change execution order. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Joonsoo Kim	b4def3509d	mm, nobootmem: clean-up of free_low_memory_core_early() Remove unused argument and make function static, because there is no user outside of nobootmem.c Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jiang Liu <liuj97@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
majianpeng	e76004093d	fs/buffer.c: remove unnecessary init operation after allocating buffer_head. bh allocation uses kmem_cache_zalloc() so we needn't call 'init_buffer(bh, NULL, NULL)' and perform other set-zero-operations. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Yasuaki Ishimatsu	3464046824	numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU When booting x86 system contains memoryless node, node numbers of CPUs on memoryless node were changed to nearest online node number by init_cpu_to_node() because the node is not online. In my system, node numbers of cpu#30-44 and 75-89 were changed from 2 to 0 as follows: $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 0 size: 32394 MB node 0 free: 27898 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 1 size: 32768 MB node 1 free: 30335 MB If we hot add memory to memoryless node and offine/online all CPUs on the node, node numbers of these CPUs are changed to correct node numbers by srat_detect_node() because the node become online. In this case, node numbers of cpu#30-44 and 75-89 were changed from 0 to 2 in my system as follows: $ numactl --hardware available: 3 nodes (0-2) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 node 0 size: 32394 MB node 0 free: 27218 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 1 size: 32768 MB node 1 free: 30014 MB node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 2 size: 16384 MB node 2 free: 16384 MB But "cpu to node" and "node to cpu" links were not changed as follows: $ ls /sys/devices/system/cpu/cpu30/\|grep node node0 $ ls /sys/devices/system/node/node0/\|grep cpu30 cpu30 "numactl --hardware" shows that cpu30 belongs to node 2. But sysfs links does not change. This patch changes "cpu to node" and "node to cpu" links when node number changed by onlining CPU. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Randy Dunlap	349daa0f93	mm: fix memory_hotplug.c printk format warning PFN_PHYS() is a phys_addr_t, which can be u32 or u64. Fix the build warning when phys_addr_t is u32. mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 2 has type 'unsigned int' [-Wformat]: => 1685:3 mm/memory_hotplug.c: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]: => 1685:3 Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Mel Gorman	0cdc444a67	mm: swap: mark swap pages writeback before queueing for direct IO As pointed out by Andrew Morton, the swap-over-NFS writeback is not setting PageWriteback before it is queued for direct IO. While swap pages do not participate in BDI or process dirty accounting and the IO is synchronous, the writeback bit is still required and not setting it in this case was an oversight. swapoff depends on the page writeback to synchronoise all pending writes on a swap page before it is reused. Swapcache freeing and reuse depend on checking the PageWriteback under lock to ensure the page is safe to reuse. Direct IO handlers and the direct IO handler for NFS do not deal with PageWriteback as they are synchronous writes. In the case of NFS, it schedules pages (or a page in the case of swap) for IO and then waits synchronously for IO to complete in nfs_direct_write(). It is recognised that this is a slowdown from normal swap handling which is asynchronous and uses a completion handler. Shoving PageWriteback handling down into direct IO handlers looks like a bad fit to handle the swap case although it may have to be dealt with some day if swap is converted to use direct IO in general and bmap is finally done away with. At that point it will be necessary to refit asynchronous direct IO with completion handlers onto the swap subsystem. As swapcache currently depends on PageWriteback to protect against races, this patch sets PageWriteback under the page lock before queueing it for direct IO. It is cleared when the direct IO handler returns. IO errors are treated similarly to the direct-to-bio case except PageError is not set as in the case of swap-over-NFS, it is likely to be a transient error. It was asked what prevents such a page being reclaimed in parallel. With this patch applied, such a page will now be skipped (most of the time) or blocked until the writeback completes. Reclaim checks PageWriteback under the page lock before calling try_to_free_swap and the page lock should prevent the page being requeued for IO before it is freed. This and Jerome's related patch should considered for -stable as far back as 3.6 when swap-over-NFS was introduced. [akpm@linux-foundation.org: use pr_err_ratelimited()] [akpm@linux-foundation.org: remove hopefully-unneeded cast in printk] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Jerome Marchand	2d30d31ea3	swap: redirty page if page write fails on swap file Since commit `62c230bc17` ("mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages"), swap_writepage() calls direct_IO on swap files. However, in that case the page isn't redirtied if I/O fails, and is therefore handled afterwards as if it has been successfully written to the swap file, leading to memory corruption when the page is eventually swapped back in. This patch sets the page dirty when direct_IO() fails. It fixes a memory corruption that happened while using swap-over-NFS. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
David Rientjes	465adcf1ea	mm, memcg: give exiting processes access to memory reserves A memcg may livelock when oom if the process that grabs the hierarchy's oom lock is never the first process with PF_EXITING set in the memcg's task iteration. The oom killer, both global and memcg, will defer if it finds an eligible process that is in the process of exiting and it is not being ptraced. The idea is to allow it to exit without using memory reserves before needlessly killing another process. This normally works fine except in the memcg case with a large number of threads attached to the oom memcg. In this case, the memcg oom killer only gets called for the process that grabs the hierarchy's oom lock; all others end up blocked on the memcg's oom waitqueue. Thus, if the process that grabs the hierarchy's oom lock is never the first PF_EXITING process in the memcg's task iteration, the oom killer is constantly deferred without anything making progress. The fix is to give PF_EXITING processes access to memory reserves so that we've marked them as oom killed without any iteration. This allows __mem_cgroup_try_charge() to succeed so that the process may exit. This makes the memcg oom killer exemption for TIF_MEMDIE tasks, now immediately granted for processes with pending SIGKILLs and those in the exit path, to be equivalent to what is done for the global oom killer. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Kirill A. Shutemov	5918d10a4b	thp: fix huge zero page logic for page with pfn == 0 Current implementation of huge zero page uses pfn value 0 to indicate that the page hasn't allocated yet. It assumes that buddy page allocator can't return page with pfn == 0. Let's rework the code to store 'struct page *' of huge zero page, not its pfn. This way we can avoid the weak assumption. [akpm@linux-foundation.org: fix sparse warning] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Li Zefan	fd0ccaf2bd	memcg: avoid accessing memcg after releasing reference This might cause a use-after-free bug. Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:39 -07:00
Dmitry Monakhov	865ffef379	fs: fix fsync() error reporting There are two convenient ways to report errors to userspace 1) retun error to original syscall for example write(2) 2) mark mapping with error flag and return it on later fsync(2) Second one is broken if (mapping->nrpages == 0) This is real-life situation because after error pages are likey to be truncated or invalidated. We have to return an error regardless to number of pages in the mapping. #Original testcase: git@github.com:dmonakhov/xfstests.git MOUNT_OPTIONS="-b1024" ./check shared/305 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:38 -07:00
Tang Chen	209ff86d61	memblock: fix missing comment of memblock_insert_region() There is no comment for parameter nid of memblock_insert_region(). This patch adds comment for it. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:38 -07:00
Tang Chen	6056d619a8	mm: Remove unused parameter of pages_correctly_reserved() nr_pages is not used in pages_correctly_reserved(). So remove it. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-29 15:54:38 -07:00

1 2 3 4 5 ...

365605 Commits