-
Notifications
You must be signed in to change notification settings - Fork 136
ASoC: hdac_hda: fix race in device removal #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASoC: hdac_hda: fix race in device removal #1301
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-pick on the comment.
Thanks for following up on this @kv2019i, much appreciated.
4ab26d7
to
2648c74
Compare
@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called. |
@plbossart wouldn't that be a bug in that driver? |
well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly. |
@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me. |
it we allow a codec driver to runtime-resume on .remove, then we implicitly require all other drivers which don't need it to add a pm_runtime_disable(). |
@plbossart aren't there such codec drivers now? Take hdac_hda.c - it calls snd_hdac_ext_bus_link_put() in its .remove() method, which accesses the hardware. hdac_hdmi_dev_remove() calls snd_hdac_display_power(), which then probably calls i915_audio_component_codec_wake_override(), which writes to the hardware too. But I admit, most codec drivers, that I looked at (and I've looked through a few for this discussion) don't access their hardware in .remove(). Some do that in their struct snd_soc_component_driver .remove() methods, but that would probably stay unaffected? |
@plbossart and @lyakh I think this is a valid discussion, but I don't think this actually applies to the patch under review. In the scenario this patch addresses, the resume is triggered in by upper level logic (full parent-children tree is resumed for removal). As we have dependencies between children (e.g. HDA controller must be functional to communicate with the codec), it is not safe to call platform_device_unregister() with some children suspended or in middle of resume. E.g. in this case the problem is failing resume() for a codec, because some of its dependencies (like the controller side) disappear mid-flight. Right? I'll address Ranjani's comment and resubmit a version. |
2648c74
to
a22df3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kv2019i, see comments below, a mix of nit-picking and complete confusion on my side...
sound/soc/sof/core.c
Outdated
* we may have races between child device resume flows | ||
* and platform device removal. | ||
*/ | ||
device_for_each_child(sdev->dev, NULL, __force_rpm_active); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not fully clear on the race condition.
The commit message seems to mean that the topology unload will result in the DSP being turned off.
But we have another step below with an snd_sof_remove() which as I understand it turns the DSP off?
In other words, is the race between resuming children/removing machine device, or resuming children/calling snd_sof_remove()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart Hmm, ok, I see what you mean. The race is between resuming children/removing machine device. But yeah, the commit message is confusing now. I used "DSP turned off" in commit message, as the topology unload will effectively do this (unloading the scheduler will actuall shut down core 0). So when snd_sof_remove() is finally called, DSP is no longer powered.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart Whole ADSP is not powered off, but the cores are. But, but, it's true this shouldn't block codec drivers (or any other machine driver components). I debugged more and it seems this goes to:
[ 64.352561] RIP: 0010:hdac_hda_codec_remove+0x9/0x80 [snd_soc_hdac_hda]
....
[ 64.352571] soc_remove_link_components+0x6c/0xb0 [snd_soc_core]
[ 64.352573] snd_soc_unbind_card+0x9b/0xe0 [snd_soc_core]
[ 64.352575] snd_soc_unregister_card+0x1d/0x60 [snd_soc_core]
[ 64.352577] release_nodes+0x121/0x220
[ 64.352579] device_release_driver_internal+0xf0/0x1c0
[ 64.352580] bus_remove_device+0xd6/0x140
[ 64.352580] device_del+0x15f/0x380
[ 64.352582] platform_device_del.part.14+0xe/0x60
[ 64.352583] platform_device_unregister+0x17/0x30
... and there we go to codec driver remove. So maybe we need to fix this in the codec driver after all. I'll continue debug and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plbossart OK, I uploaded a new version. I'll continue to polish tomorrow but wanted to get CI results for the new version already today. It survives my CFL test bench and is a much simpler fix.
@lyakh I think you'll like this version better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a major simplification indeed... wow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to go now. Thanks for the review comments @plbossart and @lyakh , good that the original version did not go in.
When ASoC card instance is removed containing a HDA codec, hdac_hda_codec_remove() may run in parallel with codec resume. This will cause problems if the HDA link is freed with snd_hdac_ext_bus_link_put() while the codec is still in middle of its resume process. To fix this, change the order such that pm_runtime_disable() is called before the link is freed. This will ensure any pending runtime PM action is completed before proceeding to free the link. This issue can be easily hit with e.g. SOF driver by loading and unloading the drivers. Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
a22df3a
to
47c428a
Compare
Travis build seems to fail to an unrelated issue in upstream:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @kv2019i
@plbossart I tried the script with this PR and it worked quite well. once keyon updted PR 1290 to allow edits, please do give it a try |
oops spoke too early. Its only added in the PR's branch but doesnt get added to the commit in sof-dev. I dont understand what I am missing between the force push and the merge |
I think I might know why. I added a 5s delay between when we force push and merge the PR. I'll test my theory with keyon's PR |
Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order and prevent the following deadlock from happening ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc3-build2+ thesofproject#1301 Tainted: G S W ------------------------------------------------------ cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200 but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ret_buf->chan_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_setup_session+0x81/0x4b0 cifs_get_smb_ses+0x771/0x900 cifs_mount_get_session+0x7e/0x170 cifs_mount+0x92/0x2d0 cifs_smb3_do_mount+0x161/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&ret_buf->ses_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_match_super+0x101/0x320 sget+0xab/0x270 cifs_smb3_do_mount+0x1e0/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}: check_noncircular+0x95/0xc0 check_prev_add+0x115/0x2f0 validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_signal_cifsd_for_reconnect+0x134/0x200 __cifs_reconnect+0x8f/0x500 cifs_handle_standard+0x112/0x280 cifs_demultiplex_thread+0x64d/0xbc0 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ret_buf->chan_lock); lock(&ret_buf->ses_lock); lock(&ret_buf->chan_lock); lock(&tcp_ses->srv_lock); *** DEADLOCK *** 3 locks held by cifsd/6055: #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Fixes: #1293
ASoC: hdac_hda: fix race in device removal
When ASoC card instance is removed containing a HDA codec,
hdac_hda_codec_remove() may run in parallel with codec resume.
This will cause problems if the HDA link is freed with
snd_hdac_ext_bus_link_put() while the codec is still in
middle of its resume process.
To fix this, change the order such that pm_runtime_disable()
is called before the link is freed. This will ensure any
pending runtime PM action is completed before proceeding
to free the link.
This issue can be easily hit with e.g. SOF driver by loading and
unloading the drivers.