136 Commits

Author SHA1 Message Date
Sathishkumar S
de258d06fd drm/amdgpu: Add amdgpu_vcn_sched_mask debugfs
Add debugfs entry to enable or disable job submission to
specific vcn instances. The entry is created only when
there is more than an instance and is unified queue type.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10 10:26:50 -05:00
Alex Deucher
f5d873f582 drm/amdgpu: add missing size check in amdgpu_debugfs_gprwave_read()
Avoid a possible buffer overflow if size is larger than 4K.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-05 10:35:59 -05:00
Alex Deucher
7ba9395430 drm/amdgpu: Adjust debugfs eviction and IB access permissions
Users should not be able to run these.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-05 10:35:56 -05:00
Alex Deucher
c0cfd2e652 drm/amdgpu: Adjust debugfs register access permissions
Regular users shouldn't have read access.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-05 10:35:49 -05:00
Jesse Zhang
d2e3961ae3 drm/amdgpu: add amdgpu_sdma_sched_mask debugfs
Userspace wants to run jobs on a specific sdma ring for verification purposes.
This debugfs entry helps to disable or enable submitting jobs to a specific ring.
This entry is populated only if there are at least two or more cores in the sdma ip.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-04 12:06:15 -05:00
Jesse Zhang
c5c63d9cb5 drm/amdgpu: add amdgpu_gfx_sched_mask and amdgpu_compute_sched_mask debugfs
compute/gfx may have multiple rings on some hardware.
In some cases, userspace wants to run jobs on a specific ring for validation purposes.
This debugfs entry helps to disable or enable submitting jobs to a specific ring.
This entry is populated only if there are at least two or more cores in the gfx/compute ip.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-04 12:05:53 -05:00
Sathishkumar S
f0b19b84d3 drm/amdgpu: add amdgpu_jpeg_sched_mask debugfs
JPEG_4_0_3 has up to 32 jpeg cores and a single mjpeg video decode
will use all available cores on the hardware. This debugfs entry
helps to disable or enable job submission to a cluster of cores or
one specific core in the ip for debugging. The entry is populated
only if there is at least two or more cores in the jpeg ip.

Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-26 17:06:17 -04:00
Sunil Khatri
17277da266 drm/amdgpu: Remove debugfs amdgpu_reset_dump_register_list
There are some problem with existing amdgpu_reset_dump_register_list
debugfs node. It is supposed to read a list of registers but there
could be cases when the IP is not in correct power state. Register
read in such cases could lead to more problems.

We are taking care of all such power states in devcoredump and
dumping the registers of need for debugging. So cleaning this code
and we dont need this functionality via debugfs anymore.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06 10:42:09 -04:00
Jesse Zhang
806e8c5579 drm/amdgpu: fix invadate operation for pg_flags
Since the type of pg_flags is u32, adev->pg_flags >> 16 >> 16 is 0
regardless of the values of its operands.

So removing the operations upper_32_bits and lower_32_bits.

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Suggested-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Tim Huang <Tim.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23 15:09:50 -04:00
Saleemkhan Jamadar
98a2e3a0d1 drm/amdgpu/umsch: add support to capture fw debug log
Added support to capture unsch fw debug logs in debugfs.
To enable set amdgpu_umschfw_log =1 in boot args.

v1 - rename variable to umsch_mm_fwlog (Veera)

Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Veerabadhran Gopalakrishnan <Veerabadhran.Gopalakrishnan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13 15:45:41 -04:00
Jesse Zhang
ea686fef54 drm/amdgpu: fix the warning about the expression (int)size - len
Converting size from size_t to int may overflow.
v2: keep reverse xmas tree order (Christian)

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-26 17:22:45 -04:00
Shashank Sharma
b8f67b9ddf drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
  reference counted.
- introducing two new helper funcs for task_info lifecycle management
    - amdgpu_vm_get_task_info: reference counts up task_info before
      returning this info
    - amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)

V3: Fixed review comments from Felix
   - Fix wrong indentation
   - No debug message for -ENOMEM
   - Add NULL check for task_info
   - Do not duplicate the debug messages (ti vs no ti)
   - Get first reference of task_info in vm_init(), put last
     in vm_fini()

V4: Fixed review comments from Felix
   - fix double reference increment in create_task_info
   - change amdgpu_vm_get_task_info_pasid
   - additional changes in amdgpu_gem.c while porting

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04 15:59:08 -05:00
Ma Jun
9749c86843 drm/amdgpu: Fix the warning info in mode1 reset
Fix the warning info below during mode1 reset.
[  +0.000004] Call Trace:
[  +0.000004]  <TASK>
[  +0.000006]  ? show_regs+0x6e/0x80
[  +0.000011]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.000005]  ? __warn+0x91/0x150
[  +0.000009]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.000006]  ? report_bug+0x19d/0x1b0
[  +0.000013]  ? handle_bug+0x46/0x80
[  +0.000012]  ? exc_invalid_op+0x1d/0x80
[  +0.000011]  ? asm_exc_invalid_op+0x1f/0x30
[  +0.000014]  ? __flush_work.isra.0+0x2e8/0x390
[  +0.000007]  ? __flush_work.isra.0+0x208/0x390
[  +0.000007]  ? _prb_read_valid+0x216/0x290
[  +0.000008]  __cancel_work_timer+0x11d/0x1a0
[  +0.000007]  ? try_to_grab_pending+0xe8/0x190
[  +0.000012]  cancel_work_sync+0x14/0x20
[  +0.000008]  amddrm_sched_stop+0x3c/0x1d0 [amd_sched]
[  +0.000032]  amdgpu_device_gpu_recover+0x29a/0xe90 [amdgpu]

This warning info was printed after applying the patch
"drm/sched: Convert drm scheduler to use a work queue rather than kthread".
The root cause is that amdgpu driver tries to use the uninitialized
work_struct in the struct drm_gpu_scheduler

v2:
 - Rename the function to amdgpu_ring_sched_ready and move it to
amdgpu_ring.c (Alex)
v3:
- Fix a few more checks based on Vitaly's patch (Alex)
v4:
- squash in fix noticed by Bert in
https://gitlab.freedesktop.org/drm/amd/-/issues/3139

Fixes: 11b3b9f461 ("drm/sched: Check scheduler ready before calling timeout handling")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Ma Jun <Jun.Ma2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-31 09:40:42 -05:00
Mangesh Gadre
5eb8094a9b drm/amdgpu: Add register read/write debugfs support for AID's
SMN address is larger than 32 bits for registers on different AID's
Updating existing interface to support access to such registers.

Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 10:30:19 -05:00
Alex Deucher
afe58346d5 drm/amdgpu/debugfs: fix error code when smc register accessors are NULL
Should be -EOPNOTSUPP.

Fixes: 5104fdf50d ("drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-14 15:27:42 -05:00
shaoyunl
b2662d4cc4 drm/amdgpu: SW part of MES event log enablement
This is the generic SW part, prepare the event log buffer and dump it through debugfs

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-07 17:43:13 -05:00
Dave Airlie
5edfd7d94b Merge tag 'amd-drm-next-6.8-2023-12-01' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.8-2023-12-01:

amdgpu:
- Add new 64 bit sequence number infrastructure.
  This will ultimately be used for user queue synchronization.
- GPUVM updates
- Misc code cleanups
- RAS updates
- DCN 3.5 updates
- Rework PCIe link speed handling
- Document GPU reset types
- DMUB fixes
- eDP fixes
- NBIO 7.9 updates
- NBIO 7.11 updates
- SubVP updates
- DCN 3.1.4 fixes
- ABM fixes
- AGP aperture fix
- DCN 3.1.5 fix
- Fix some potential error path memory leaks
- Enable PCIe PMEs
- Add XGMI, PCIe state dumping for aqua vanjaram
- GFX11 golden register updates
- Misc display fixes

amdkfd:
- Migrate TLB flushing logic to amdgpu
- Trap handler fixes
- Fix restore workers handling on suspend and reset
- Fix possible memory leak in pqm_uninit()

radeon:
- Fix some possible overflows in command buffer checking
- Check for errors in ring_lock

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231201181743.5313-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-12-05 12:11:41 +10:00
Lu Yao
7b194fdccb drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
For 'AMDGPU_FAMILY_SI' family cards, in 'si_common_early_init' func, init
'didt_rreg' and 'didt_wreg' to 'NULL'. But in func
'amdgpu_debugfs_regs_didt_read/write', using 'RREG32_DIDT' 'WREG32_DIDT'
lacks of relevant judgment. And other 'amdgpu_ip_block_version' that use
these two definitions won't be added for 'AMDGPU_FAMILY_SI'.

So, add null pointer judgment before calling.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:23 -05:00
Maxime Ripard
3bf3e21c15 Merge drm/drm-next into drm-misc-next
Let's kickstart the v6.8 release cycle.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2023-11-15 10:56:44 +01:00
Matthew Brost
35963cf2cd drm/sched: Add drm_sched_wqueue_* helpers
Add scheduler wqueue ready, stop, and start helpers to hide the
implementation details of the scheduler from the drivers.

v2:
  - s/sched_wqueue/sched_wqueue (Luben)
  - Remove the extra white line after the return-statement (Luben)
  - update drm_sched_wqueue_ready comment (Luben)

Cc: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20231031032439.1558703-2-matthew.brost@intel.com
Signed-off-by: Luben Tuikov <ltuikov89@gmail.com>
2023-11-01 17:29:20 -04:00
Qu Huang
5104fdf50d drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL
In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file could result in an abnormal null pointer access when the smc_rreg pointer is NULL. Below are the steps to reproduce this issue and the corresponding exception log:

1. Navigate to the directory: /sys/kernel/debug/dri/0
2. Execute command: cat amdgpu_regs_smc
3. Exception Log::
[4005007.702554] BUG: kernel NULL pointer dereference, address: 0000000000000000
[4005007.702562] #PF: supervisor instruction fetch in kernel mode
[4005007.702567] #PF: error_code(0x0010) - not-present page
[4005007.702570] PGD 0 P4D 0
[4005007.702576] Oops: 0010 [#1] SMP NOPTI
[4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G           OE     5.15.0-43-generic #46-Ubunt       u
[4005007.702590] RIP: 0010:0x0
[4005007.702598] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[4005007.702600] RSP: 0018:ffffa82b46d27da0 EFLAGS: 00010206
[4005007.702605] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa82b46d27e68
[4005007.702609] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9940656e0000
[4005007.702612] RBP: ffffa82b46d27dd8 R08: 0000000000000000 R09: ffff994060c07980
[4005007.702615] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f5e06753000
[4005007.702618] R13: ffff9940656e0000 R14: ffffa82b46d27e68 R15: 00007f5e06753000
[4005007.702622] FS:  00007f5e0755b740(0000) GS:ffff99479d300000(0000) knlGS:0000000000000000
[4005007.702626] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4005007.702629] CR2: ffffffffffffffd6 CR3: 00000003253fc000 CR4: 00000000003506e0
[4005007.702633] Call Trace:
[4005007.702636]  <TASK>
[4005007.702640]  amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu]
[4005007.703002]  full_proxy_read+0x5c/0x80
[4005007.703011]  vfs_read+0x9f/0x1a0
[4005007.703019]  ksys_read+0x67/0xe0
[4005007.703023]  __x64_sys_read+0x19/0x20
[4005007.703028]  do_syscall_64+0x5c/0xc0
[4005007.703034]  ? do_user_addr_fault+0x1e3/0x670
[4005007.703040]  ? exit_to_user_mode_prepare+0x37/0xb0
[4005007.703047]  ? irqentry_exit_to_user_mode+0x9/0x20
[4005007.703052]  ? irqentry_exit+0x19/0x30
[4005007.703057]  ? exc_page_fault+0x89/0x160
[4005007.703062]  ? asm_exc_page_fault+0x8/0x30
[4005007.703068]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[4005007.703075] RIP: 0033:0x7f5e07672992
[4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 00 0f 1f 44 00 00 f3 0f        1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e       c 28 48 89 54 24
[4005007.703083] RSP: 002b:00007ffe03097898 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[4005007.703088] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f5e07672992
[4005007.703091] RDX: 0000000000020000 RSI: 00007f5e06753000 RDI: 0000000000000003
[4005007.703094] RBP: 00007f5e06753000 R08: 00007f5e06752010 R09: 00007f5e06752010
[4005007.703096] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
[4005007.703099] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[4005007.703105]  </TASK>
[4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash af_alg binfmt_misc nls_       iso8859_1 ipmi_ssif ast intel_rapl_msr intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t       tm edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl sch_fq_codel ipmi_devintf ipm       i_msghandler msr parport_pc ppdev lp parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo       mon ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) iommu_v       2 amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core        drm igb ahci xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca
[4005007.703184] CR2: 0000000000000000
[4005007.703188] ---[ end trace ac65a538d240da39 ]---
[4005007.800865] RIP: 0010:0x0
[4005007.800871] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[4005007.800874] RSP: 0018:ffffa82b46d27da0 EFLAGS: 00010206
[4005007.800878] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffa82b46d27e68
[4005007.800881] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9940656e0000
[4005007.800883] RBP: ffffa82b46d27dd8 R08: 0000000000000000 R09: ffff994060c07980
[4005007.800886] R10: 0000000000020000 R11: 0000000000000000 R12: 00007f5e06753000
[4005007.800888] R13: ffff9940656e0000 R14: ffffa82b46d27e68 R15: 00007f5e06753000
[4005007.800891] FS:  00007f5e0755b740(0000) GS:ffff99479d300000(0000) knlGS:0000000000000000
[4005007.800895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[4005007.800898] CR2: ffffffffffffffd6 CR3: 00000003253fc000 CR4: 00000000003506e0

Signed-off-by: Qu Huang <qu.huang@linux.dev>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
André Almeida
2d6a2a28cd drm/amdgpu: Encapsulate all device reset info
To better organize struct amdgpu_device, keep all reset information
related fields together in a separated struct.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Reviewed-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Praful Swarnakar
ad19c200b1 drm/amdgpu: Fix style issues in amdgpu_debugfs.c
Fixes the following to align to linux coding style:

WARNING: Missing a blank line after declarations
WARNING: sizeof *rd should be sizeof(*rd)

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Praful Swarnakar <Praful.Swarnakar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:48 -04:00
Victor Lu
8ed49dd1d3 drm/amdgpu: Add RLCG interface driver implementation for gfx v9.4.3 (v3)
Add RLCG interface support for gfx v9.4.3 and multiple XCCs.
Do not enable it yet.

v2: Fix amdgpu_rlcg_reg_access_ctrl init, add support for multiple XCCs
    in amdgpu_mm_wreg_mmio_rlc

v3: Use GET_INST() when indexing amdgpu_rlcg_reg_access_ctrl

Signed-off-by: Victor Lu <victorchengchi.lu@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:16:41 -04:00
Dan Carpenter
9c9d501b28 drm/amd/amdgpu: Fix up locking etc in amdgpu_debugfs_gprwave_ioctl()
There are two bugs here.
1) Drop the lock if copy_from_user() fails.
2) If the copy fails then the correct error code is -EFAULT instead of
   -EINVAL.

I also broke up the long line and changed "sizeof rd->id" to
"sizeof(rd->id)".

Fixes: 553f973a0d ("drm/amd/amdgpu: Update debugfs for XCC support (v3)")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 11:10:04 -04:00