mirror of
https://github.com/armbian/linux-rockchip.git
synced 2026-01-06 11:08:10 -08:00
ce80098db2439ee44403ec6fccd3a10be21c7aff
32 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
1507f51255 |
mm: introduce memfd_secret system call to create "secret" memory areas
Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.
The secretmem feature is off by default and the user must explicitly
enable it at the boot time.
Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.
Secretmem is designed to provide the following protections:
* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes
"simple" ROP insufficient to perform exfiltration, which increases the
required complexity of the attack. Along with other protections like
the kernel stack size limit and address space layout randomization which
make finding gadgets is really hard, absence of any in-kernel primitive
for accessing secret memory means the one gadget ROP attack can't work.
Since the only way to access secret memory is to reconstruct the missing
mapping entry, the attacker has to recover the physical page and insert
a PTE pointing to it in the kernel and then retrieve the contents. That
takes at least three gadgets which is a level of difficulty beyond most
standard attacks.
* Prevent cross-process secret userspace memory exposures. Once the
secret memory is allocated, the user can't accidentally pass it into the
kernel to be transmitted somewhere. The secreremem pages cannot be
accessed via the direct map and they are disallowed in GUP.
* Harden against exploited kernel flaws. In order to access secretmem,
a kernel-side attack would need to either walk the page tables and
create new ones, or spawn a new privileged uiserspace process to perform
secrets exfiltration using ptrace.
The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU. Andy Lutomirski says:
"Getting fd-backed memory into a guest will take some possibly major
work in the kernel, but getting vma-backed memory into a guest without
mapping it in the host user address space seems much, much worse."
memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with. If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.
The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.
Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.
Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit
|
||
|
|
3234ac664a |
/dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of a given address range. Commit |
||
|
|
8dcc1a9d90 |
fs: New zonefs file system
zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).
As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.
Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.
The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
under a common sub-directory:
* For conventional zones, the sub-directory "cnv" is used.
* For sequential write zones, the sub-directory "seq" is used.
These two directories are the only directories that exist in zonefs.
Users cannot create other directories and cannot rename nor delete
the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
pointer position relative to the zone start sector. Truncating these
files is allowed only down to 0, in which case, the zone is reset to
rewind the zone write pointer position to the start of the zone, or
up to the zone size, in which case the file's zone is transitioned
to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
file zone size. Any access exceeding the zone size is failed with
the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
that can be issued to conventional zone files. Buffered, direct and
mmap read & write operations are accepted. For sequential zone files,
there are no restrictions on read operations, but all write
operations must be direct IO append writes. mmap write of sequential
files is not allowed.
Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
zones can be aggregated into a single larger file instead of the
default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
(root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
changed.
The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:
git@github.com:damien-lemoal/zonefs-tools.git.
zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.
Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.
$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).
$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0
This aggregated conventional zone file can be used as a regular file.
$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data
The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.
$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355
For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.
$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
The written file can be truncated to the zone size, preventing any
further write operation.
$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.
$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.
$ stat /mnt/seq/0
File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
Birth: -
The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.
This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
||
|
|
fe030c9b85 |
powerpc/pseries/cmm: Implement balloon compaction
We can now get rid of the cmm_lock and completely rely on the balloon compaction internals, which now also manage the page list and the lock. Inflated/"loaned" pages are now movable. Memory blocks that contain such pages can get offlined. Also, all such pages will be marked PageOffline() and can therefore be excluded in memory dumps using recent versions of makedumpfile. Don't switch to balloon_page_alloc() yet (due to the GFP_NOIO). Will do that separately to discuss this change in detail. Signed-off-by: David Hildenbrand <david@redhat.com> [mpe: Add isolated_pages-- in cmm_migratepage() as suggested by David] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191031142933.10779-9-david@redhat.com |
||
|
|
47e4937a4a |
erofs: move erofs out of staging
EROFS filesystem has been merged into linux-staging for a year. EROFS is designed to be a better solution of saving extra storage space with guaranteed end-to-end performance for read-only files with the help of reduced metadata, fixed-sized output compression and decompression inplace technologies. In the past year, EROFS was greatly improved by many people as a staging driver, self-tested, betaed by a large number of our internal users, successfully applied to almost all in-service HUAWEI smartphones as the part of EMUI 9.1 and proven to be stable enough to be moved out of staging. EROFS is a self-contained filesystem driver. Although there are still some TODOs to be more generic, we have a dedicated team actively keeping on working on EROFS in order to make it better with the evolution of Linux kernel as the other in-kernel filesystems. As Pavel suggested, it's better to do as one commit since git can do moves and all histories will be saved in this way. Let's promote it from staging and enhance it more actively as a "real" part of kernel for more wider scenarios! Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Pavel Machek <pavel@denx.de> Cc: David Sterba <dsterba@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J . Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Richard Weinberger <richard@nod.at> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Chao Yu <yuchao0@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Li Guifu <bluce.liguifu@huawei.com> Cc: Fang Wei <fangwei1@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Link: https://lore.kernel.org/r/20190822213659.5501-1-hsiangkao@aol.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
933a90bf4f |
Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ... |
||
|
|
ed63bb1d1f |
dma-buf: give each buffer a full-fledged inode
By traversing /proc/*/fd and /proc/*/map_files, processes with CAP_ADMIN can get a lot of fine-grained data about how shmem buffers are shared among processes. stat(2) on each entry gives the caller a unique ID (st_ino), the buffer's size (st_size), and even the number of pages currently charged to the buffer (st_blocks / 512). In contrast, all dma-bufs share the same anonymous inode. So while we can count how many dma-buf fds or mappings a process has, we can't get the size of the backing buffers or tell if two entries point to the same dma-buf. On systems with debugfs, we can get a per-buffer breakdown of size and reference count, but can't tell which processes are actually holding the references to each buffer. Replace the singleton inode with full-fledged inodes allocated by alloc_anon_inode(). This involves creating and mounting a mini-pseudo-filesystem for dma-buf, following the example in fs/aio.c. Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20190613223408.139221-2-fengc@google.com |
||
|
|
ea8157ab2a |
zsfold: Convert zsfold to use the new mount API
Convert the zsfold filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Signed-off-by: David Howells <dhowells@redhat.com> |
||
|
|
3ad20fe393 |
binder: implement binderfs
As discussed at Linux Plumbers Conference 2018 in Vancouver [1] this is the
implementation of binderfs.
/* Abstract */
binderfs is a backwards-compatible filesystem for Android's binder ipc
mechanism. Each ipc namespace will mount a new binderfs instance. Mounting
binderfs multiple times at different locations in the same ipc namespace
will not cause a new super block to be allocated and hence it will be the
same filesystem instance.
Each new binderfs mount will have its own set of binder devices only
visible in the ipc namespace it has been mounted in. All devices in a new
binderfs mount will follow the scheme binder%d and numbering will always
start at 0.
/* Backwards compatibility */
Devices requested in the Kconfig via CONFIG_ANDROID_BINDER_DEVICES for the
initial ipc namespace will work as before. They will be registered via
misc_register() and appear in the devtmpfs mount. Specifically, the
standard devices binder, hwbinder, and vndbinder will all appear in their
standard locations in /dev. Mounting or unmounting the binderfs mount in
the initial ipc namespace will have no effect on these devices, i.e. they
will neither show up in the binderfs mount nor will they disappear when the
binderfs mount is gone.
/* binder-control */
Each new binderfs instance comes with a binder-control device. No other
devices will be present at first. The binder-control device can be used to
dynamically allocate binder devices. All requests operate on the binderfs
mount the binder-control device resides in.
Assuming a new instance of binderfs has been mounted at /dev/binderfs
via mount -t binderfs binderfs /dev/binderfs. Then a request to create a
new binder device can be made as illustrated in [2].
Binderfs devices can simply be removed via unlink().
/* Implementation details */
- dynamic major number allocation:
When binderfs is registered as a new filesystem it will dynamically
allocate a new major number. The allocated major number will be returned
in struct binderfs_device when a new binder device is allocated.
- global minor number tracking:
Minor are tracked in a global idr struct that is capped at
BINDERFS_MAX_MINOR. The minor number tracker is protected by a global
mutex. This is the only point of contention between binderfs mounts.
- struct binderfs_info:
Each binderfs super block has its own struct binderfs_info that tracks
specific details about a binderfs instance:
- ipc namespace
- dentry of the binder-control device
- root uid and root gid of the user namespace the binderfs instance
was mounted in
- mountable by user namespace root:
binderfs can be mounted by user namespace root in a non-initial user
namespace. The devices will be owned by user namespace root.
- binderfs binder devices without misc infrastructure:
New binder devices associated with a binderfs mount do not use the
full misc_register() infrastructure.
The misc_register() infrastructure can only create new devices in the
host's devtmpfs mount. binderfs does however only make devices appear
under its own mountpoint and thus allocates new character device nodes
from the inode of the root dentry of the super block. This will have
the side-effect that binderfs specific device nodes do not appear in
sysfs. This behavior is similar to devpts allocated pts devices and
has no effect on the functionality of the ipc mechanism itself.
[1]: https://goo.gl/JL2tfX
[2]: program to allocate a new binderfs binder device:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <linux/android/binder_ctl.h>
int main(int argc, char *argv[])
{
int fd, ret, saved_errno;
size_t len;
struct binderfs_device device = { 0 };
if (argc < 2)
exit(EXIT_FAILURE);
len = strlen(argv[1]);
if (len > BINDERFS_MAX_NAME)
exit(EXIT_FAILURE);
memcpy(device.name, argv[1], len);
fd = open("/dev/binderfs/binder-control", O_RDONLY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open binder-control device\n",
strerror(errno));
exit(EXIT_FAILURE);
}
ret = ioctl(fd, BINDER_CTL_ADD, &device);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to allocate new binder device\n",
strerror(errno));
exit(EXIT_FAILURE);
}
printf("Allocated new binder device with major %d, minor %d, and "
"name %s\n", device.major, device.minor,
device.name);
exit(EXIT_SUCCESS);
}
Cc: Martijn Coenen <maco@android.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
dddde68b8f |
xfs: add a define for statfs magic to uapi
Needed by userspace programs that call fstatfs(). It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two have identical values, they have different semantic meaning: one is an enum cookie meant for statfs, the other a signature of the on-disk format. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com> |
||
|
|
f044c8847b |
afs: Lay the groundwork for supporting network namespaces
Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.
The following changes have been made:
(1) Store the netns in the superblock info. This will be obtained from
the mounter's nsproxy on a manual mount and inherited from the parent
superblock on an automount.
(2) The cell list is made per-netns. It can be viewed through
/proc/net/afs/cells and also be modified by writing commands to that
file.
(3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
This is unset by default.
(4) The 'rootcell' module parameter, which sets a cell and VL server list
modifies the init net namespace, thereby allowing an AFS root fs to be
theoretically used.
(5) The volume location lists and the file lock manager are made
per-netns.
(6) The AF_RXRPC socket and associated I/O bits are made per-ns.
The various workqueues remain global for the moment.
Changes still to be made:
(1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
from the old name.
(2) A per-netns subsys needs to be registered for AFS into which it can
store its per-netns data.
(3) Rather than the AF_RXRPC socket being opened on module init, it needs
to be opened on the creation of a superblock in that netns.
(4) The socket needs to be closed when the last superblock using it is
destroyed and all outstanding client calls on it have been completed.
This prevents a reference loop on the namespace.
(5) It is possible that several namespaces will want to use AFS, in which
case each one will need its own UDP port. These can either be set
through /proc/net/afs/cm_port or the kernel can pick one at random.
The init_ns gets 7001 by default.
Other issues that need resolving:
(1) The DNS keyring needs net-namespacing.
(2) Where do upcalls go (eg. DNS request-key upcall)?
(3) Need something like open_socket_in_file_ns() syscall so that AFS
command line tools attempting to operate on an AFS file/volume have
their RPC calls go to the right place.
Signed-off-by: David Howells <dhowells@redhat.com>
|
||
|
|
6f52b16c5b |
License cleanup: add SPDX license identifier to uapi header files with no license
Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
62aa81d7c4 |
ocfs2: use magic.h
Filesystems generally use SUPER_MAGIC values from magic.h instead of a local definition. Link: http://lkml.kernel.org/r/20170521154217.27917-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a481f4d917 |
apparmor: add custom apparmorfs that will be used by policy namespace files
AppArmor policy needs to be able to be resolved based on the policy namespace a task is confined by. Add a base apparmorfs filesystem that (like nsfs) will exist as a kern mount and be accessed via jump_link through a securityfs file. Setup the base apparmorfs fns and data, but don't use it yet. Signed-off-by: John Johansen <john.johansen@canonical.com> Reviewed-by: Seth Arnold <seth.arnold@canonical.com> Reviewed-by: Kees Cook <keescook@chromium.org> |
||
|
|
5ff193fbde |
x86/intel_rdt: Add basic resctrl filesystem support
Use kernfs as basis for our user interface filesystem. This patch supports mount/umount, and one mount parameter "cdp" to enable code/data prioritization (though all we do at this point is ensure that the system can support CDP). The file system is not populated yet in this patch. [ tglx: Fixed up a few nits and added cdp handling in case of error ] Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Shaohua Li" <shli@fb.com> Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com> Cc: "Peter Zijlstra" <peterz@infradead.org> Cc: "Stephane Eranian" <eranian@google.com> Cc: "Dave Hansen" <dave.hansen@intel.com> Cc: "David Carrillo-Cisneros" <davidcc@google.com> Cc: "Nilay Vaish" <nilayvaish@gmail.com> Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com> Cc: "Ingo Molnar" <mingo@elte.hu> Cc: "Borislav Petkov" <bp@suse.de> Cc: "H. Peter Anvin" <h.peter.anvin@intel.com> Link: http://lkml.kernel.org/r/1477692289-37412-4-git-send-email-fenghua.yu@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
3bc52c45ba |
dax: define a unified inode/address_space for device-dax mappings
In support of enabling resize / truncate of device-dax instances, define a pseudo-fs to provide a unified inode/address space for vm operations. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
||
|
|
48b4800a1c |
zsmalloc: page migration support
This patch introduces run-time migration feature for zspage. For migration, VM uses page.lru field so it would be better to not use page.next field which is unified with page.lru for own purpose. For that, firstly, we can get first object offset of the page via runtime calculation instead of using page.index so we can use page.index as link for page chaining instead of page.next. In case of huge object, it stores handle to page.index instead of next link of page chaining because huge object doesn't need to next link for page chaining. So get_next_page need to identify huge object to return NULL. For it, this patch uses PG_owner_priv_1 flag of the page flag. For migration, it supports three functions * zs_page_isolate It isolates a zspage which includes a subpage VM want to migrate from class so anyone cannot allocate new object from the zspage. We could try to isolate a zspage by the number of subpage so subsequent isolation trial of other subpage of the zpsage shouldn't fail. For that, we introduce zspage.isolated count. With that, zs_page_isolate can know whether zspage is already isolated or not for migration so if it is isolated for migration, subsequent isolation trial can be successful without trying further isolation. * zs_page_migrate First of all, it holds write-side zspage->lock to prevent migrate other subpage in zspage. Then, lock all objects in the page VM want to migrate. The reason we should lock all objects in the page is due to race between zs_map_object and zs_page_migrate. zs_map_object zs_page_migrate pin_tag(handle) obj = handle_to_obj(handle) obj_to_location(obj, &page, &obj_idx); write_lock(&zspage->lock) if (!trypin_tag(handle)) goto unpin_object zspage = get_zspage(page); read_lock(&zspage->lock); If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be stale by migration so it goes crash. If it locks all of objects successfully, it copies content from old page to new one, finally, create new zspage chain with new page. And if it's last isolated subpage in the zspage, put the zspage back to class. * zs_page_putback It returns isolated zspage to right fullness_group list if it fails to migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage freeing to workqueue. See below about async zspage freeing. This patch introduces asynchronous zspage free. The reason to need it is we need page_lock to clear PG_movable but unfortunately, zs_free path should be atomic so the apporach is try to grab page_lock. If it got page_lock of all of pages successfully, it can free zspage immediately. Otherwise, it queues free request and free zspage via workqueue in process context. If zs_free finds the zspage is isolated when it try to free zspage, it delays the freeing until zs_page_putback finds it so it will free free the zspage finally. In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL. First of all, it will use ZS_EMPTY list for delay freeing. And with adding ZS_FULL list, it makes to identify whether zspage is isolated or not via list_empty(&zspage->list) test. [minchan@kernel.org: zsmalloc: keep first object offset in struct page] Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org [minchan@kernel.org: zsmalloc: zspage sanity check] Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b1123ea6d3 |
mm: balloon: use general non-lru movable page feature
Now, VM has a feature to migrate non-lru movable pages so balloon
doesn't need custom migration hooks in migrate.c and compaction.c.
Instead, this patch implements the page->mapping->a_ops->
{isolate|migrate|putback} functions.
With that, we could remove hooks for ballooning in general migration
functions and make balloon compaction simple.
[akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
2a28900be2 |
udf: Export superblock magic to userspace
Currently UDF superblock magic doesn't appear in any userspace header files and thus userspace apps have hard time checking for this fs. Let's export the magic to userspace as with any other filesystem. Signed-off-by: Jan Kara <jack@suse.cz> |
||
|
|
e9f57ebcba |
Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi: "This contains several bug fixes and a new mount option 'default_permissions' that allows read-only exported NFS filesystems to be used as lower layer" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: check dentry positiveness in ovl_cleanup_whiteouts() ovl: setattr: check permissions before copy-up ovl: root: copy attr ovl: move super block magic number to magic.h ovl: use a minimal buffer in ovl_copy_xattr ovl: allow zero size xattr ovl: default permissions |
||
|
|
67e9c74b8a |
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild. Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.
mount -t cgroup2 none $MOUNT_POINT
The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").
v2: Assign a different magic number to cgroup2 fs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
|
||
|
|
257f871993 |
ovl: move super block magic number to magic.h
The overlayfs file system is not recognized by programs like tail because the magic number is not in standard header location. Move it so that the value will propagate on for the GNU library and utilities. Needs to go in the fstatfs manual page as well. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> |
||
|
|
b2197755b2 |
bpf: add support for persistent maps/progs
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
|
4282d60689 |
tracefs: Add new tracefs file system
Add a separate file system to handle the tracing directory. Currently it is part of debugfs, but that is starting to show its limits. One thing is that in order to access the tracing infrastructure, you need to mount debugfs. As that includes debugging from all sorts of sub systems in the kernel, it is not considered advisable to mount such an all encompassing debugging system. Having the tracing system in its own file systems gives access to the tracing sub system without needing to include all other systems. Another problem with tracing using the debugfs system is that the instances use mkdir to create sub buffers. debugfs does not support mkdir from userspace so to implement it, special hacks were used. By controlling the file system that the tracing infrastructure uses, this can be properly done without hacks. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> |
||
|
|
e149ed2b80 |
take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now. It's not mountable (not even registered, so it's not in /proc/filesystems, etc.). Files on it *are* bindable - we explicitly permit that in do_loopback(). This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well. get_proc_ns() is a macro now (it's simply returning ->i_private; would have been an inline, if not for header ordering headache). proc_ns_inode() is an ex-parrot. The interface used in procfs is ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops). Dentries and inodes are never hashed; a non-counting reference to dentry is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path() if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details of that mechanism. As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt; it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets from ns_get_path(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |