Commit Graph

1326352 Commits

Author SHA1 Message Date
Jens Axboe
a23ad06bfe io_uring/register: use atomic_read/write for sq_flags migration
A previous commit changed all of the migration from the old to the new
ring for resizing to use READ/WRITE_ONCE. However, ->sq_flags is an
atomic_t, and while most archs won't complain on this, some will indeed
flag this:

io_uring/register.c:554:9: sparse: sparse: cast to non-scalar
io_uring/register.c:554:9: sparse: sparse: cast from non-scalar

Just use atomic_set/atomic_read for handling this case.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501242000.A2sKqaCL-lkp@intel.com/
Fixes: 2c5aae129f ("io_uring/register: document io_register_resize_rings() shared mem usage")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-24 14:36:43 -07:00
Jens Axboe
ff74954e4e io_uring/alloc_cache: get rid of _nocache() helper
Just allow passing in NULL for the cache, if the type in question
doesn't have a cache associated with it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23 11:32:34 -07:00
Jens Axboe
fa3595523d io_uring: get rid of alloc cache init_once handling
init_once is called when an object doesn't come from the cache, and
hence needs initial clearing of certain members. While the whole
struct could get cleared by memset() in that case, a few of the cache
members are large enough that this may cause unnecessary overhead if
the caches used aren't large enough to satisfy the workload. For those
cases, some churn of kmalloc+kfree is to be expected.

Ensure that the 3 users that need clearing put the members they need
cleared at the start of the struct, and wrap the rest of the struct in
a struct group so the offset is known.

While at it, improve the interaction with KASAN such that when/if
KASAN writes to members inside the struct that should be retained over
caching, it won't trip over itself. For rw and net, the retaining of
the iovec over caching is disabled if KASAN is enabled. A helper will
free and clear those members in that case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23 11:32:28 -07:00
Jens Axboe
eaf72f7b41 io_uring/uring_cmd: cleanup struct io_uring_cmd_data layout
A few spots in uring_cmd assume that the SQEs copied are always at the
start of the structure, and hence mix req->async_data and the struct
itself.

Clean that up and use the proper indices.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23 11:29:00 -07:00
Jens Axboe
d58d82bd0e io_uring/uring_cmd: use cached cmd_op in io_uring_cmd_sock()
io_uring_cmd_sock() does a normal read of cmd->sqe->cmd_op, where it
really should be using a READ_ONCE() as ->sqe may still be pointing to
the original SQE. Since the prep side already does this READ_ONCE() and
stores it locally, use that value rather than re-read it.

Fixes: 8e9fad0e70 ("io_uring: Add io_uring command support for sockets")
Link: https://lore.kernel.org/r/20250121-uring-sockcmd-fix-v1-1-add742802a29@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23 11:27:52 -07:00
Jens Axboe
69a62e03f8 io_uring/msg_ring: don't leave potentially dangling ->tctx pointer
For remote posting of messages, req->tctx is assigned even though it
is never used. Rather than leave a dangling pointer, just clear it to
NULL and use the previous check for a valid submitter_task to gate on
whether or not the request should be terminated.

Reported-by: Jann Horn <jannh@google.com>
Fixes: b6f58a3f4a ("io_uring: move struct io_kiocb from task_struct to io_uring_task")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-22 17:10:45 -07:00
Jann Horn
2839ab71ac io_uring/rsrc: Move lockdep assert from io_free_rsrc_node() to caller
Checking for lockdep_assert_held(&ctx->uring_lock) in io_free_rsrc_node()
means that the assertion is only checked when the resource drops to zero
references.
Move the lockdep assertion up into the caller io_put_rsrc_node() so that it
instead happens on every reference count decrement.

Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250120-uring-lockdep-assert-earlier-v1-1-68d8e071a4bb@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21 07:07:26 -07:00
Sidong Yang
b73de0da50 io_uring/rsrc: remove unused parameter ctx for io_rsrc_node_alloc()
io_uring_ctx parameter for io_rsrc_node_alloc() is unused for now.
This patch removes the parameter and fixes the callers accordingly.

Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Link: https://lore.kernel.org/r/20250115142033.658599-1-sidong.yang@furiosa.ai
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21 07:07:21 -07:00
Pavel Begunkov
bb2d76344b io_uring: clean up io_uring_register_get_file()
Make it always reference the returned file. It's safer, especially with
unregistrations happening under it. And it makes the api cleaner with no
conditional clean ups by the caller.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0d0b13a63e8edd6b5d360fc821dcdb035cb6b7e0.1736995897.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21 07:07:17 -07:00
Jann Horn
5719e28235 io_uring/rsrc: Simplify buffer cloning by locking both rings
The locking in the buffer cloning code is somewhat complex because it goes
back and forth between locking the source ring and the destination ring.

Make it easier to reason about by locking both rings at the same time.
To avoid ABBA deadlocks, lock the rings in ascending kernel address order,
just like in lock_two_nondirectories().

Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20250115-uring-clone-refactor-v2-1-7289ba50776d@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-21 07:07:10 -07:00
Linus Torvalds
95ec54a420 Merge tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Madhavan Srinivasan:

 - Add preempt lazy support

 - Deprecate cxl and cxl flash driver

 - Fix a possible IOMMU related OOPS at boot on pSeries

 - Optimize sched_clock() in ppc32 by replacing mulhdu() by
   mul_u64_u64_shr()

Thanks to Andrew Donnellan, Andy Shevchenko, Ankur Arora, Christophe
Leroy, Frederic Barrat, Gaurav Batra, Luis Felipe Hernandez, Michael
Ellerman, Nilay Shroff, Ricardo B.  Marliere, Ritesh Harjani (IBM),
Sebastian Andrzej Siewior, Shrikanth Hegde, Sourabh Jain, Thorsten Blum,
and Zhu Jun.

* tag 'powerpc-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  selftests/powerpc: Fix argument order to timer_sub()
  powerpc/prom_init: Use IS_ENABLED()
  powerpc/pseries/iommu: IOMMU incorrectly marks MMIO range in DDW
  powerpc: Use str_on_off() helper in check_cache_coherency()
  powerpc: Large user copy aware of full:rt:lazy preemption
  powerpc: Add preempt lazy support
  powerpc/book3s64/hugetlb: Fix disabling hugetlb when fadump is active
  powerpc/vdso: Mark the vDSO code read-only after init
  powerpc/64: Use get_user() in start_thread()
  macintosh: declare ctl_table as const
  selftest/powerpc/ptrace: Cleanup duplicate macro definitions
  selftest/powerpc/ptrace/ptrace-pkey: Remove duplicate macros
  selftest/powerpc/ptrace/core-pkey: Remove duplicate macros
  powerpc/8xx: Drop legacy-of-mm-gpiochip.h header
  scsi/cxlflash: Deprecate driver
  cxl: Deprecate driver
  selftests/powerpc: Fix typo in test-vphn.c
  powerpc/xmon: Use str_yes_no() helper in dump_one_paca()
  powerpc/32: Replace mulhdu() by mul_u64_u64_shr()
2025-01-20 21:40:19 -08:00
Linus Torvalds
9ad09c4f28 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "We've got a little less than normal thanks to the holidays in
  December, but there's the usual summary below. The highlight is
  probably the 52-bit physical addressing (LPA2) clean-up from Ard.

  Confidential Computing:

   - Register a platform device when running in CCA realm mode to enable
     automatic loading of dependent modules

  CPU Features:

   - Update a bunch of system register definitions to pick up new field
     encodings from the architectural documentation

   - Add hwcaps and selftests for the new (2024) dpISA extensions

  Documentation:

   - Update EL3 (firmware) requirements for booting Linux on modern
     arm64 designs

   - Remove stale information about the kernel virtual memory map

  Miscellaneous:

   - Minor cleanups and typo fixes

  Memory management:

   - Fix vmemmap_check_pmd() to look at the PMD type bits

   - LPA2 (52-bit physical addressing) cleanups and minor fixes

   - Adjust physical address space depending upon whether or not LPA2 is
     enabled

  Perf and PMUs:

   - Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU

   - Extend AXI filtering support for the DDR PMU on NXP IMX SoCs

   - Fix Designware PCIe PMU event numbering

   - Add generic branch events for the Apple M1 CPU PMU

   - Add support for Marvell Odyssey DDR and LLC-TAD PMUs

   - Cleanups to the Hisilicon DDRC and Uncore PMU code

   - Advertise discard mode for the SPE PMU

   - Add the perf users mailing list to our MAINTAINERS entry"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
  Documentation: arm64: Remove stale and redundant virtual memory diagrams
  perf docs: arm_spe: Document new discard mode
  perf: arm_spe: Add format option for discard mode
  MAINTAINERS: Add perf list for drivers/perf/
  arm64: Remove duplicate included header
  drivers/perf: apple_m1: Map generic branch events
  arm64: rsi: Add automatic arm-cca-guest module loading
  kselftest/arm64: Add 2024 dpISA extensions to hwcap test
  KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
  arm64/hwcap: Describe 2024 dpISA extensions to userspace
  arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
  arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
  drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
  arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
  arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
  arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
  ...
2025-01-20 21:21:49 -08:00
Linus Torvalds
e7244cc382 Merge tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:

 - Use the generic muldi3 libgcc function

 - Miscellaneous fixes and improvements

* tag 'm68k-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: libgcc: Fix lvalue abuse in umul_ppmm()
  m68k: vga: Fix I/O defines
  zorro: Constify 'struct bin_attribute'
  m68k: atari: Use str_on_off() helper in atari_nvram_proc_read()
  m68k: Use kernel's generic muldi3 libgcc function
2025-01-20 21:18:36 -08:00
Linus Torvalds
4f42d0bf72 Merge tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:

 - Select config option KASAN_VMALLOC if KASAN is enabled

 - Select config option VMAP_STACK unconditionally

 - Implement arch_atomic_inc() / arch_atomic_dec() functions which
   result in a single instruction if compiled for z196 or newer
   architectures

 - Make layering between atomic.h and atomic_ops.h consistent

 - Comment s390 preempt_count implementation

 - Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation

 - GCC uses the number of lines of an inline assembly to calculate
   number of instructions and decide on inlining. Therefore remove
   superfluous new lines from a couple of inline assemblies.

 - Provide arch_atomic_*_and_test() implementations that allow the
   compiler to generate slightly better code.

 - Optimize __preempt_count_dec_and_test()

 - Remove __bootdata annotations from declarations in header files

 - Add missing include of <linux/smp.h> in abs_lowcore.h to provide
   declarations for get_cpu() and put_cpu() used in the code

 - Fix suboptimal kernel image base when running make kasan.config

 - Remove huge_pte_none() and huge_pte_none_mostly() as are identical to
   the generic variants

 - Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC, and
   REGION3_KERNEL_EXEC defines

 - Simplify noexec page protection handling and change the page, segment
   and region3 protection definitions automatically if the instruction
   execution-protection facility is not available

 - Save one instruction and prefer EXRL instruction over EX in string,
   xor_*(), amode31 and other functions

 - Create /dev/diag misc device to fetch diagnose specific information
   from the kernel and provide it to userspace

 - Retrieve electrical power readings using DIAGNOSE 0x324 ioctl

 - Make ccw_device_get_ciw() consistent and use array indices instead of
   pointer arithmetic

 - s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into
   one place

 - The sysfs core now allows instances of 'struct bin_attribute' to be
   moved into read-only memory. Make use of that in s390 code

 - Add missing TLB range adjustment in pud_free_tlb()

 - Improve topology setup by adding early polarization detection

 - Fix length checks in codepage_convert() function

 - The generic bitops implementation is nearly identical to the s390
   one. Switch to the generic variant and decrease a bit the kernel
   image size

 - Provide an optimized arch_test_bit() implementation which makes use
   of flag output constraint. This generates slightly better code

 - Provide memory topology information obtanied with DIAGNOSE 0x310
   using ioctl.

 - Various other small improvements, fixes, and cleanups

Also, some changes came in through a merge of 'pci-device-recovery'
branch:

 - Add PCI error recovery status mechanism

 - Simplify and document debug_next_entry() logic

 - Split private data allocation and freeing out of debug file open()
   and close() operations

 - Add debug_dump() function that gets a textual representation of a
   debug info (e.g. PCI recovery hardware error logs)

 - Add formatted content of pci_debug_msg_id to the PCI report

* tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
  s390/futex: Fix FUTEX_OP_ANDN implementation
  s390/diag: Add memory topology information via diag310
  s390/bitops: Provide optimized arch_test_bit()
  s390/bitops: Switch to generic bitops
  s390/ebcdic: Fix length decrement in codepage_convert()
  s390/ebcdic: Fix length check in codepage_convert()
  s390/ebcdic: Use exrl instead of ex
  s390/amode31: Use exrl instead of ex
  s390/stackleak: Use exrl instead of ex in __stackleak_poison()
  s390/lib: Use exrl instead of ex in xor functions
  s390/topology: Improve topology detection
  s390/tlb: Add missing TLB range adjustment
  s390/pkey: Constify 'struct bin_attribute'
  s390/sclp: Constify 'struct bin_attribute'
  s390/pci: Constify 'struct bin_attribute'
  s390/ipl: Constify 'struct bin_attribute'
  s390/crypto/cpacf: Constify 'struct bin_attribute'
  s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
  s390/cio: Use array indices instead of pointer arithmetic
  s390/qdio: Rename feature flag aif_osa to aif_qdio
  ...
2025-01-20 21:14:49 -08:00
Linus Torvalds
a312e1706c Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
 "Not a lot in terms of features this time around, mostly just cleanups
  and code consolidation:

   - Support for PI meta data read/write via io_uring, with NVMe and
     SCSI covered

   - Cleanup the per-op structure caching, making it consistent across
     various command types

   - Consolidate the various user mapped features into a concept called
     regions, making the various users of that consistent

   - Various cleanups and fixes"

* tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits)
  io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname
  io_uring: reuse io_should_terminate_tw() for cmds
  io_uring: Factor out a function to parse restrictions
  io_uring/rsrc: require cloned buffers to share accounting contexts
  io_uring: simplify the SQPOLL thread check when cancelling requests
  io_uring: expose read/write attribute capability
  io_uring/rw: don't gate retry on completion context
  io_uring/rw: handle -EAGAIN retry at IO completion time
  io_uring/rw: use io_rw_recycle() from cleanup path
  io_uring/rsrc: simplify the bvec iter count calculation
  io_uring: ensure io_queue_deferred() is out-of-line
  io_uring/rw: always clear ->bytes_done on io_async_rw setup
  io_uring/rw: use NULL for rw->free_iovec assigment
  io_uring/rw: don't mask in f_iocb_flags
  io_uring/msg_ring: Drop custom destructor
  io_uring: Move old async data allocation helper to header
  io_uring/rw: Allocate async data through helper
  io_uring/net: Allocate msghdr async data through helper
  io_uring/uring_cmd: Allocate async data through generic helper
  io_uring/poll: Allocate apoll with generic alloc_cache helper
  ...
2025-01-20 20:27:33 -08:00
Linus Torvalds
1cbfb828e0 Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
Linus Torvalds
3d3a9c8b89 Merge tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:

 - Fix a case where the new scanning code missed removing an unused rsb

 - Fix the error when removing a configfs entry for an invalid node id

* tag 'dlm-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: return -ENOENT if no comm was found
  dlm: fix srcu_read_lock() return type to int
  dlm: fix removal of rsb struct that is master and dir record
2025-01-20 14:26:59 -08:00
Linus Torvalds
2622f29041 Merge tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet:
 "Lots of scalability work, another big on-disk format change. On-disk
  format version goes from 1.13 to 1.20.

  Like 6.11, this is another big and expensive automatic/required on
  disk format upgrade. This is planned to be the last big on disk format
  upgrade before the experimental label comes off. There will be one
  more minor on disk format update for a few things that couldn't make
  this release.

  Headline improvements:

   - Self healing work:

     Allocator and reflink now run the exact same check/repair code that
     fsck does at runtime, where applicable.

     The long term goal here is to remove inconsistent() errors (that
     cause us to go emergency read only) by lifting fsck code up to
     normal runtime paths; we should only go emergency read-only if we
     detect an inconsistency that was due to a runtime bug - or truly
     catastrophic damage (corrupted btree roots/interior nodes).

   - Reflink repair no longer deletes reflink pointers:

     Instead we flip an error bit and log the error, and they can still
     be deleted by file deletion. This means a temporary failure to find
     an indirect extent (perhaps repaired later by btree node scan)
     won't result in unnecessary data loss

   - Improvements to rebalance data path option handling:

     We can now correctly apply changed filesystem-level io path options
     to pending rebalance work, and soon we'll be able to apply
     file-level io path option changes to indirect extents

   - Fix mount time regression that some users encountered post the 6.11
     disk accounting rewrite.

     Accounting keys were encoded little endian (typetag in the low
     bits) - which didn't anticipate adding accounting keys for every
     inode, which aren't stored in memory and we don't want to scan at
     mount time.

   - fsck time on large filesystems is improved by multiple orders of
     magnitude. Previously, 100TB was about the practical max filesystem
     size, where users were reporting fsck times of a day+. With the new
     changes (which nearly eliminate backpointers fsck overhead), we
     fsck'd a filesystem with 10PB of data in 1.5 hours.

     The problematic fsck passes were walking every extent and checking
     for missing backpointers, and walking every backpointer to check
     for dangling backpointers. As we've been adding more and more
     runtime self healing there was no reason to keep around the
     backpointers -> extents pass; dangling backpointers are just
     deleted, and we can do that when using them - thus, backpointers ->
     extents is now only run in debug mode.

     extents -> backpointers does need to exist, since missing
     backpointers would mean we can't find data to move it (for e.g.
     copygc, device evacuate, scrub). But the new on disk format version
     makes possible a new strategy where we sum up backpointers within a
     bucket and check it against the bucket sector counts, and then only
     scan for missing backpointers if the counts are off (and then, only
     for specific buckets).

  Full list of on disk format changes:

   - 1.14: backpointer_bucket_gen

     Backpointers now have a field for the bucket generation number,
     replacing the obsolete bucket_offset field. This is needed for the
     new "sum up backpointers within a bucket" code, since backpointers
     use the btree write buffer - meaning we will see stale reads, and
     this runs online, with the filesystem in full rw mode.

   - 1.15: disk_accounting_big_endian

     As previously described, fix the endianness of accounting keys so
     that accounting keys with the same typetag sort together, and
     accounting read can skip types it's not interested in.

   - 1.16: reflink_p_may_update_opts:

     This version indicates that a new reflink pointer field is
     understood and may be used; the field indicates whether the reflink
     pointer has permissions to update IO path options (e.g.
     compression, replicas) may be updated on the indirect extent it
     points to.

     This completes the rebalance/reflink data path option handling from
     the 6.13 pull request.

   - 1.17: inode_depth

     Add a new inode field, bi_depth, to accelerate the
     check_directory_structure fsck path, which checks for loops in the
     filesystem heirarchy.

     check_inodes and check_dirents check connectivity, so
     check_directory_structure only has to check for loops - by walking
     back up to the root from every directory.

     But a path can't be a loop if it has a counter that increases
     monotonically from root to leaf - adding a depth counter means that
     we can check for loops with only local (parent -> child) checks. We
     might need to occasionally renumber the depth field in fsck if
     directories have been moved around, but then future fsck runs will
     be much faster.

   - 1.18: persistent_inode_cursors

     Previously, the cursor used for inode allocation was only kept in
     memory, which meant that users with large filesystems and lots of
     files were reporting that the first create after mounting would
     take awhile - since it had to scan from the start.

     Inode allocation cursors are now persistent, and also include a
     generation field (incremented on wraparound, which will only happen
     if inode allocation is restricted to 32 bit inodes), so that we
     don't have to leave inode_generation keys around after a delete.

     The option for 32 bit inode numbers may now also be set on
     individual directories, and non-32 bit inode allocations are
     disallowed from allocating from the 32 bit part of the inode number
     space.

   - 1.19: autofix_errors

     Runtime self healing is now the default.o

   - 1.20: directory size (from Hongbo)

     directory i_size is now meaningful, and not 0"

* tag 'bcachefs-2025-01-20.2' of git://evilpiepirate.org/bcachefs: (268 commits)
  bcachefs: Fix check_inode_hash_info_matches_root()
  bcachefs: Document issue with bch_stripe layout
  bcachefs: Fix self healing on read error
  bcachefs: Pop all the transactions from the abort one
  bcachefs: Only abort the transactions in the cycle
  bcachefs: Introduce lock_graph_pop_from
  bcachefs: Convert open-coded lock_graph_pop_all to helper
  bcachefs: Do not allow no fail lock request to fail
  bcachefs: Merge the condition to avoid additional invocation
  Revert "bcachefs: Fix bch2_btree_node_upgrade()"
  bcachefs: bcachefs_metadata_version_directory_size
  bcachefs: make directory i_size meaningful
  bcachefs: check_unreachable_inodes is not actually PASS_ONLINE yet
  bcachefs: Don't use BTREE_ITER_cached when walking alloc btree during fsck
  bcachefs: Check for dirents to overwritten inodes
  bcachefs: bch2_btree_iter_peek_slot() handles navigating to nonexistent depth
  bcachefs: Don't set btree_path to updtodate if we don't fill
  bcachefs: __bch2_btree_pos_to_text()
  bcachefs: printbuf_reset() handles tabstops
  bcachefs: Silence read-only errors when deleting snapshots
  ...
2025-01-20 13:55:19 -08:00
Linus Torvalds
5d8a4bd6b2 Merge tag 'pstore-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:

 - pstore/blk: trivial typo fixes (Eugen Hristev)

 - pstore/zone: reject zero-sized allocations (Eugen Hristev)

* tag 'pstore-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/zone: avoid dereferencing zero sized ptr after init zones
  pstore/blk: trivial typo fixes
2025-01-20 13:37:14 -08:00
Linus Torvalds
fadc3ed9ce Merge tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:

 - fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case (Tycho
   Andersen, Kees Cook)

 - binfmt_misc: Fix comment typos (Christophe JAILLET)

 - move empty argv[0] warning closer to actual logic (Nir Lichtman)

 - remove legacy custom binfmt modules autoloading (Nir Lichtman)

 - Make sure set_task_comm() always NUL-terminates

 - binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan
   Carpenter)

 - coredump: Do not lock when copying "comm"

 - MAINTAINERS: add auxvec.h and set myself as maintainer

* tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_flat: Fix integer overflow bug on 32 bit systems
  selftests/exec: add a test for execveat()'s comm
  exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
  exec: Make sure task->comm is always NUL-terminated
  exec: remove legacy custom binfmt modules autoloading
  exec: move warning of null argv to be next to the relevant code
  fs: binfmt: Fix a typo
  MAINTAINERS: exec: Mark Kees as maintainer
  MAINTAINERS: exec: Add auxvec.h UAPI
  coredump: Do not lock during 'comm' reporting
2025-01-20 13:27:58 -08:00
Linus Torvalds
0eb4aaa230 Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "User visible changes, features:

   - rebuilding of the free space tree at mount time is done in more
     transactions, fix potential hangs when the transaction thread is
     blocked due to large amount of block groups

   - more read IO balancing strategies (experimental config), add two
     new ways how to select a device for read if the profiles allow that
     (all RAID1*), the current default selects the device by pid which
     is good on average but less performant for single reader workloads

       - select preferred device for all reads (namely for testing)
       - round-robin, balance reads across devices relevant for the
         requested IO range

   - add encoded write ioctl support to io_uring (read was added in
     6.12), basis for writing send stream using that instead of
     syscalls, non-blocking mode is not yet implemented

   - support FS_IOC_READ_VERITY_METADATA, applications can use the
     metadata to do their own verification

   - pass inode's i_write_hint to bios, for parity with other
     filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT

  Core:

   - in zoned mode: allow to directly reclaim a block group by simply
     resetting it, then it can be reused and another block group does
     not need to be allocated

   - super block validation now also does more comprehensive sys array
     validation, adding it to the points where superblock is validated
     (post-read, pre-write)

   - subpage mode fixes:
      - fix double accounting of blocks due to some races
      - improved or fixed error handling in a few cases (compression,
        delalloc)

   - raid stripe tree:
      - fix various cases with extent range splitting or deleting
      - implement hole punching to extent range
      - reduce number of stripe tree lookups during bio submission
      - more self-tests

   - updated self-tests (delayed refs)

   - error handling improvements

   - cleanups, refactoring
      - remove rest of backref caching infrastructure from relocation,
        not needed anymore
      - error message updates
      - remove unnecessary calls when extent buffer was marked dirty
      - unused parameter removal
      - code moved to new files

  Other code changes: add rb_find_add_cached() to the rb-tree API"

* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
  btrfs: selftests: add a selftest for deleting two out of three extents
  btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
  btrfs: selftests: add selftest for punching holes into the RAID stripe extents
  btrfs: selftests: test RAID stripe-tree deletion spanning two items
  btrfs: selftests: don't split RAID extents in half
  btrfs: selftests: check for correct return value of failed lookup
  btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
  btrfs: implement hole punching for RAID stripe extents
  btrfs: fix deletion of a range spanning parts two RAID stripe extents
  btrfs: fix tail delete of RAID stripe-extents
  btrfs: fix front delete range calculation for RAID stripe extents
  btrfs: assert RAID stripe-extent length is always greater than 0
  btrfs: don't try to delete RAID stripe-extents if we don't need to
  btrfs: selftests: correct RAID stripe-tree feature flag setting
  btrfs: add io_uring interface for encoded writes
  btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
  btrfs: add extra error messages for delalloc range related errors
  btrfs: subpage: dump the involved bitmap when ASSERT() failed
  btrfs: subpage: fix the bitmap dump of the locked flags
  btrfs: do proper folio cleanup when run_delalloc_nocow() failed
  ...
2025-01-20 13:09:30 -08:00
Linus Torvalds
1851bccf60 Merge tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - In the quota code, to avoid spurious audit messages, don't call
   capable() when quotas are off

 - When changing the 'j' flag of an inode, truncate the inode address
   space to avoid mixing "buffer head" and "iomap" pages

* tag 'gfs2-for-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Truncate address space when flipping GFS2_DIF_JDATA flag
  gfs2: reorder capability check last
2025-01-20 13:06:28 -08:00
Linus Torvalds
b971424b6e Merge tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull afs updates from Christian Brauner:
 "Dynamic root improvements:

   - Create an /afs/.<cell> mountpoint to match the /afs/<cell>
     mountpoint when a cell is created

   - Add some more checks on cell names proposed by the user to prevent
     dodgy symlink bodies from being created. Also prevent rootcell from
     being altered once set to simplify the locking

   - Change the handling of /afs/@cell from being a dentry name
     substitution at lookup time to making it a symlink to the current
     cell name and also provide a /afs/.@cell symlink to point to the
     dotted cell mountpoint

  Fixes:

   - Fix the abort code check in the fallback handling for the
     YFS.RemoveFile2 RPC call

   - Use call->op->server() for oridnary filesystem RPC calls that have
     an operation descriptor instead of call->server()"

* tag 'vfs-6.14-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Fix the fallback handling for the YFS.RemoveFile2 RPC call
  afs: Make /afs/@cell and /afs/.@cell symlinks
  afs: Add rootcell checks
  afs: Make /afs/.<cell> as well as /afs/<cell> mountpoints
2025-01-20 11:40:48 -08:00
Linus Torvalds
47c9f2b3c8 Merge tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs direct-io updates from Christian Brauner:
 "File systems that write out of place usually require different
  alignment for direct I/O writes than what they can do for reads.

  Add a separate dio read align field to statx, as many out of place
  write file systems can easily do reads aligned to the device sector
  size, but require bigger alignment for writes.

  This is usually papered over by falling back to buffered I/O for
  smaller writes and doing read-modify-write cycles, but performance for
  this sucks, so applications benefit from knowing the actual write
  alignment"

* tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: report larger dio alignment for COW inodes
  xfs: report the correct read/write dio alignment for reflinked inodes
  xfs: cleanup xfs_vn_getattr
  fs: add STATX_DIO_READ_ALIGN
  fs: reformat the statx definition
2025-01-20 11:16:50 -08:00
Linus Torvalds
7e587c20ad Merge tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs libfs updates from Christian Brauner:
 "This improves the stable directory offset behavior in various ways.

  Stable offsets are needed so that NFS can reliably read directories on
  filesystems such as tmpfs:

   - Improve the end-of-directory detection

     According to getdents(3), the d_off field in each returned
     directory entry points to the next entry in the directory. The
     d_off field in the last returned entry in the readdir buffer must
     contain a valid offset value, but if it points to an actual
     directory entry, then readdir/getdents can loop.

     Introduce a specific fixed offset value that is placed in the d_off
     field of the last entry in a directory. Some user space
     applications assume that the EOD offset value is larger than the
     offsets of real directory entries, so the largest valid offset
     value is reserved for this purpose. This new value is never
     allocated by simple_offset_add().

     When ->iterate_dir() returns, getdents{64} inserts the ctx->pos
     value into the d_off field of the last valid entry in the readdir
     buffer. When it hits EOD, offset_readdir() sets ctx->pos to the EOD
     offset value so the last entry is updated to point to the EOD
     marker.

     When trying to read the entry at the EOD offset, offset_readdir()
     terminates immediately.

   - Rely on d_children to iterate stable offset directories

     Instead of using the mtree to emit entries in the order of their
     offset values, use it only to map incoming ctx->pos to a starting
     entry. Then use the directory's d_children list, which is already
     maintained properly by the dcache, to find the next child to emit.

   - Narrow the range of directory offset values returned by
     simple_offset_add() to 3 .. (S32_MAX - 1) on all platforms. This
     means the allocation behavior is identical on 32-bit systems,
     64-bit systems, and 32-bit user space on 64-bit kernels. The new
     range still permits over 2 billion concurrent entries per
     directory.

   - Return ENOSPC when the directory offset range is exhausted. Hitting
     this error is almost impossible though.

   - Remove the simple_offset_empty() helper"

* tag 'vfs-6.14-rc1.libfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  libfs: Use d_children list to iterate simple_offset directories
  libfs: Replace simple_offset end-of-directory detection
  Revert "libfs: fix infinite directory reads for offset dir"
  Revert "libfs: Add simple_offset_empty()"
  libfs: Return ENOSPC when the directory offset range is exhausted
2025-01-20 11:00:53 -08:00