kernel

mirror of https://github.com/ukui/kernel.git synced 2026-03-09 10:07:04 -07:00

Author	SHA1	Message	Date
Dani Liberman	ea6eb91c09	habanalabs: fix race condition in multi CS completion Race condition occurs when CS fence completes and multi CS did not completed yet, while waiting for multi CS ends and returns indication to user that the CS completed. Next wait for multi CS may be triggered by previous multi CS completion without any current CS completed, causing an error. Example scenario : 1. User do multi CS wait for CSs 1 and 2 on master QID 0 2. CS 1 and 2 reached the "cs release" code. The thread of CS 1 completed both the CS and multi CS handling but the completion thread of CS 2 completed the CS but still did not executed complete_multi_cs (note that in CS completion the sequence is to first do complete all for the CS and then another complete all to signal the multi_cs) 3. User received indication that CS 1 and 2 completed (since we check the CS fence and both indicated as completed) and immediately waits on CS 3 and 4, also on master QID 0. 4. Completion thread of CS2 executed complete_multi_cs before completion of CS 3 and 4 and so will trigger the multi CS wait of CSs 3 and 4 as they wait on master QID 0. This will trigger multi CS completion although none of its current CS has been completed. Fixed by adding multi CS complete handling indication for each CS. CS will be marked to the user as completed only if its fence completed and multi CS handling is done. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Oded Gabbay	1d16a46b1a	habanalabs: use only u32 In the kernel it is common to use u32 and not uint32_t. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Oded Gabbay	efc6b04b86	habanalabs: update firmware files Update the firmware headers to the latest version Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Bharat Jauhari	10cab81d1c	habanalabs: bypass reset for continuous h/w error event There may be a situation where drivers receives continuous fatal H/W error events from FW immediately post reset cycle. This may be due to some fault on the silicon itself. In such case its better to bypass reset cycle so we won't be stuck in endless loop of resets. This commit bypasses reset request in case driver received two back to back FW fatal error before first occurrence of heartbeat event. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:47 +03:00
Yuri Nudelman	f05d17b226	habanalabs: take timestamp on wait for interrupt Taking an accurate timestamp in a close proximity of the interrupt is required for user side statistics management. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	c1904127ce	habanalabs: prevent race between fd close/open The driver allows only a single process to open a device's FD at any single time. This is done by checking "hdev->compute_ctx" under mutex. Therefore, to prevent a race between the moment a user closes it's FD and when another user tries to open the device, we need to make sure that clearing this variable is the very last thing that is done in the code of the FD's release. I'm moving the idle check before clearing this variable and the "reset on device release". btw, if the reset happens it will prevent any other user from opening the device until the reset is finished. An important thing to note is that we need to remove the user process that is closing the device from the process list BEFORE calling the reset function. That is to prevent a case where the reset code will try to kill that user process and it is unnecessary as the process doesn't hold any device/driver resources anymore. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	1282dbbd29	habanalabs: refactor reset log message Reset to the device is not necessarily due to an error, so print it as info instead of error. In addition, print the type of reset we are doing: - reset of the entire device (aka hard reset) - reset of the device after user have released it (less than hard reset) - lighter reset of an inference device (aka soft reset) Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	a00f1f571e	habanalabs: define soft-reset as inference op Soft-reset is the procedure where we reset only the compute/DMA engines of the device, without requiring the current user-space process to release the device. This type of reset can happen if TDR event occurred (a workload got stuck) or by a root request through sysfs. This is only relevant for inference ASICs, as there is no real-world use-case to do that in training, because training runs on multiple devices. In addition, we also do (in certain ASICs) a reset upon device release. That reset uses the same code as the soft-reset. Therefore, to better differentiate between the two resets, it is better to rename the soft-reset support as "inference soft-reset", to make the code more self-explanatory. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Yuri Nudelman	dd08335fb9	habanalabs: fix debugfs device memory MMU VA translation The translation in debugfs of device memory MMU virtual addresses was wrong as it did not take into consideration the fact that the page sizes there can be not power of 2. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Ofir Bitton	d62b9a6976	habanalabs: add support for a long interrupt target value In order to avoid user target value wraparound, we modify the current interface so user will be able to wait for an 8-byte target value rather than a 4-byte value. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Ofir Bitton	027d53b03c	habanalabs: remove redundant cs validity checks During TDR handling, we check multiple times if CS is valid. No need to perform this check as CS must be valid at all time during the TDR handling. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Rajaravi Krishna Katta	2b28485d0a	habanalabs: enable power info via HWMON framework Add support to retrieve following power info via HWMON: - instantaneous power value - highest value since last reset - reset the highest place holder Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Alon Mizrahi	2ee58fee3f	habanalabs: generalize COMMS message sending procedure Instead of having dedicated function per message that we want to send to the firmware in COMMS protocol, have a generic function that we can call to from other parts of the driver Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Rajaravi Krishna Katta	7457269136	habanalabs: create static map of f/w hwmon enums Instead of using the Linux kernel HWMON enums definition when communicating with the firmware, use proprietary HWMON based enums i.e. map hwmon.h header enum to cpucp_if.h based enum while. This is needed because the HWMON enums are not forcing backward compatibility and therefore changes can break compatibility between newer driver and older firmware. The driver will check for CPU_BOOT_DEV_STS0_MAP_HWMON_EN bit to validate if f/w supports cpucp->hwmon enum mapping to support older firmware where this mapping won't be available. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Ofir Bitton	4be9fb5303	habanalabs: add debugfs node for configuring CS timeout Command submission timeout is currently determined during driver loading time. As some environments requires this timeout to be modified in runtime, we introduce a new debugfs node that controls the timeout value without the need to reload the driver. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Oded Gabbay	511c1957de	habanalabs: add kernel-doc style comments Modify some comments in the uapi file to be in kernel-doc style. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>	2021-10-18 12:05:46 +03:00
Greg Kroah-Hartman	22d4f9beaf	Merge 5.15-rc6 into char-misc-next We need the char/misc fixes in here for merging and testing. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-10-18 09:29:27 +02:00
Linus Torvalds	519d81956e	Linux 5.15-rc6	2021-10-17 20:00:13 -10:00
Linus Torvalds	cd079b1f87	Merge tag 'libata-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata Pull libata fixes from Damien Le Moal: "Two fixes for this cycle: - Fix a null pointer dereference in ahci-platform driver (from Hai) - Fix uninitialized variables in pata_legacy driver (from Dan)" * tag 'libata-5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: ata: ahci_platform: fix null-ptr-deref in ahci_platform_enable_regulators() pata_legacy: fix a couple uninitialized variable bugs	2021-10-17 19:39:22 -10:00
Linus Torvalds	f2b3420b92	Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "Bigger than usual for this point in time, the majority is fixing some issues around BDI lifetimes with the move from the request_queue to the disk in this release. In detail: - Series on draining fs IO for del_gendisk() (Christoph) - NVMe pull request via Christoph: - fix the abort command id (Keith Busch) - nvme: fix per-namespace chardev deletion (Adam Manzanares) - brd locking scope fix (Tetsuo) - BFQ fix (Paolo)" * tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block: block, bfq: reset last_bfqq_created on group change block: warn when putting the final reference on a registered disk brd: reduce the brd_devices_mutex scope kyber: avoid q->disk dereferences in trace points block: keep q_usage_counter in atomic mode after del_gendisk block: drain file system I/O on del_gendisk block: split bio_queue_enter from blk_queue_enter block: factor out a blk_try_enter_queue helper block: call submit_bio_checks under q_usage_counter nvme: fix per-namespace chardev deletion block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs nvme-pci: Fix abort command id	2021-10-17 19:25:20 -10:00
Linus Torvalds	cc0af0a951	Merge tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block Pull io_uring fix from Jens Axboe: "Just a single fix for a wrong condition for grabbing a lock, a regression in this merge window" * tag 'io_uring-5.15-2021-10-17' of git://git.kernel.dk/linux-block: io_uring: fix wrong condition to grab uring lock	2021-10-17 19:20:13 -10:00
Linus Torvalds	3bb50f8530	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio fixes from Michael Tsirkin: "Fixes up some issues in rc5" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-vdpa: Fix the wrong input in config_cb VDUSE: fix documentation underline warning Revert "virtio-blk: Add validation for block size in config space" vhost_vdpa: unset vq irq before freeing irq virtio: write back F_VERSION_1 before validate	2021-10-17 18:17:19 -10:00
Linus Torvalds	be9eb2f00f	Merge tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix a bug where guests on P9 with interrupts passed through could get stuck in synchronize_irq(). - Fix a bug in KVM on P8 where secondary threads entering a guest would write outside their allocated stack. - Fix a bug in KVM on P8 where secondary threads could confuse the host offline code and cause the guest or host to crash. Thanks to Cédric Le Goater. * tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest() powerpc/xive: Discard disabled interrupts in get_irqchip_state()	2021-10-17 18:01:32 -10:00
Linus Torvalds	6890acacde	Merge tag 'objtool_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Borislav Petkov: - Update section headers before the respective relocations to not trigger a safety check in elftoolchain's implementation of libelf - Do not add garbage data to the .rela.orc_unwind_ip section * tag 'objtool_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Update section header before relocations objtool: Check for gelf_update_rel[a] failures	2021-10-17 17:41:39 -10:00
Linus Torvalds	f644750ccc	Merge tag 'edac_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC fix from Borislav Petkov: - Log the "correct" uncorrectable error count in the armada_xp driver * tag 'edac_urgent_for_v5.15_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/armada-xp: Fix output of uncorrectable error counter	2021-10-17 17:36:39 -10:00

1 2 3 4 5 ...

1044475 Commits