ASoC: hdac_hda: fix race in device removal #1301

kv2019i · 2019-10-10T09:55:54Z

ASoC: hdac_hda: fix race in device removal

When ASoC card instance is removed containing a HDA codec,
hdac_hda_codec_remove() may run in parallel with codec resume.
This will cause problems if the HDA link is freed with
snd_hdac_ext_bus_link_put() while the codec is still in
middle of its resume process.

To fix this, change the order such that pm_runtime_disable()
is called before the link is freed. This will ensure any
pending runtime PM action is completed before proceeding
to free the link.

This issue can be easily hit with e.g. SOF driver by loading and
unloading the drivers.

plbossart

nit-pick on the comment.
Thanks for following up on this @kv2019i, much appreciated.

sound/soc/sof/core.c

plbossart · 2019-10-11T14:30:42Z

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

sound/soc/sof/core.c

lyakh · 2019-10-14T07:05:08Z

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

plbossart · 2019-10-14T20:49:53Z

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

lyakh · 2019-10-15T07:10:22Z

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

plbossart · 2019-10-15T14:15:29Z

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

it we allow a codec driver to runtime-resume on .remove, then we implicitly require all other drivers which don't need it to add a pm_runtime_disable().
If we do this at the SoundWire core level, then we simplify most codec drivers, but we also preclude the first scenario from happening.
I am having a hard time deciding if making things more complex for a hypothetical scenario is worth it.

lyakh · 2019-10-17T07:19:15Z

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

it we allow a codec driver to runtime-resume on .remove, then we implicitly require all other drivers which don't need it to add a pm_runtime_disable().
If we do this at the SoundWire core level, then we simplify most codec drivers, but we also preclude the first scenario from happening.
I am having a hard time deciding if making things more complex for a hypothetical scenario is worth it.

@plbossart aren't there such codec drivers now? Take hdac_hda.c - it calls snd_hdac_ext_bus_link_put() in its .remove() method, which accesses the hardware. hdac_hdmi_dev_remove() calls snd_hdac_display_power(), which then probably calls i915_audio_component_codec_wake_override(), which writes to the hardware too. But I admit, most codec drivers, that I looked at (and I've looked through a few for this discussion) don't access their hardware in .remove(). Some do that in their struct snd_soc_component_driver .remove() methods, but that would probably stay unaffected?

kv2019i · 2019-10-22T12:46:53Z

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal?
[...]
@plbossart aren't there such codec drivers now? Take hdac_hda.c - it calls snd_hdac_ext_bus_link_put() in its .remove() method, which accesses the hardware.

@plbossart and @lyakh I think this is a valid discussion, but I don't think this actually applies to the patch under review. In the scenario this patch addresses, the resume is triggered in by upper level logic (full parent-children tree is resumed for removal). As we have dependencies between children (e.g. HDA controller must be functional to communicate with the codec), it is not safe to call platform_device_unregister() with some children suspended or in middle of resume.

E.g. in this case the problem is failing resume() for a codec, because some of its dependencies (like the controller side) disappear mid-flight. Right?

I'll address Ranjani's comment and resubmit a version.

plbossart

Thanks @kv2019i, see comments below, a mix of nit-picking and complete confusion on my side...

sound/soc/sof/core.c

plbossart · 2019-10-25T16:00:18Z

sound/soc/sof/core.c

+	 * we may have races between child device resume flows
+	 * and platform device removal.
+	 */
+	device_for_each_child(sdev->dev, NULL, __force_rpm_active);


I am not fully clear on the race condition.

The commit message seems to mean that the topology unload will result in the DSP being turned off.
But we have another step below with an snd_sof_remove() which as I understand it turns the DSP off?

In other words, is the race between resuming children/removing machine device, or resuming children/calling snd_sof_remove()?

@plbossart Hmm, ok, I see what you mean. The race is between resuming children/removing machine device. But yeah, the commit message is confusing now. I used "DSP turned off" in commit message, as the topology unload will effectively do this (unloading the scheduler will actuall shut down core 0). So when snd_sof_remove() is finally called, DSP is no longer powered.

Makes sense?

@kv2019i I am not sure that on topology unload the DSP is actually completely powered-off. Maybe core0 is powered off but the rest of the controller isn't
@ranj063 would need to comment, if they get power in the East Bay?

@plbossart Whole ADSP is not powered off, but the cores are. But, but, it's true this shouldn't block codec drivers (or any other machine driver components). I debugged more and it seems this goes to:

[ 64.352561] RIP: 0010:hdac_hda_codec_remove+0x9/0x80 [snd_soc_hdac_hda]
....
[ 64.352571] soc_remove_link_components+0x6c/0xb0 [snd_soc_core]
[ 64.352573] snd_soc_unbind_card+0x9b/0xe0 [snd_soc_core]
[ 64.352575] snd_soc_unregister_card+0x1d/0x60 [snd_soc_core]
[ 64.352577] release_nodes+0x121/0x220
[ 64.352579] device_release_driver_internal+0xf0/0x1c0
[ 64.352580] bus_remove_device+0xd6/0x140
[ 64.352580] device_del+0x15f/0x380
[ 64.352582] platform_device_del.part.14+0xe/0x60
[ 64.352583] platform_device_unregister+0x17/0x30

... and there we go to codec driver remove. So maybe we need to fix this in the codec driver after all. I'll continue debug and report back.

@plbossart OK, I uploaded a new version. I'll continue to polish tomorrow but wanted to get CI results for the new version already today. It survives my CFL test bench and is a much simpler fix.

@lyakh I think you'll like this version better :)

that's a major simplification indeed... wow.

This looks good to go now. Thanks for the review comments @plbossart and @lyakh , good that the original version did not go in.

When ASoC card instance is removed containing a HDA codec, hdac_hda_codec_remove() may run in parallel with codec resume. This will cause problems if the HDA link is freed with snd_hdac_ext_bus_link_put() while the codec is still in middle of its resume process. To fix this, change the order such that pm_runtime_disable() is called before the link is freed. This will ensure any pending runtime PM action is completed before proceeding to free the link. This issue can be easily hit with e.g. SOF driver by loading and unloading the drivers. Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>

kv2019i · 2019-10-29T15:14:32Z

Travis build seems to fail to an unrelated issue in upstream:
https://travis-ci.org/thesofproject/linux/jobs/604052043?utm_medium=notification&utm_source=github_status

sound/soc/codecs/rt5677-spi.c: In function ‘rt5677_spi_pcm_close’:

sound/soc/codecs/rt5677-spi.c:114:30: error: unused variable ‘rtd’ [-Werror=unused-variable]

  struct snd_soc_pcm_runtime *rtd = substream->private_data;

plbossart

thanks @kv2019i

ranj063 · 2019-10-29T16:06:52Z

@plbossart I tried the script with this PR and it worked quite well. once keyon updted PR 1290 to allow edits, please do give it a try

ranj063 · 2019-10-29T16:10:51Z

@plbossart I tried the script with this PR and it worked quite well. once keyon updted PR 1290 to allow edits, please do give it a try

oops spoke too early. Its only added in the PR's branch but doesnt get added to the commit in sof-dev. I dont understand what I am missing between the force push and the merge

ranj063 · 2019-10-29T16:22:55Z

oops spoke too early. Its only added in the PR's branch but doesnt get added to the commit in sof-dev. I dont understand what I am missing between the force push and the merge

I think I might know why. I added a 5s delay between when we force push and merge the PR. I'll test my theory with keyon's PR

Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order and prevent the following deadlock from happening ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc3-build2+ thesofproject#1301 Tainted: G S W ------------------------------------------------------ cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200 but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ret_buf->chan_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_setup_session+0x81/0x4b0 cifs_get_smb_ses+0x771/0x900 cifs_mount_get_session+0x7e/0x170 cifs_mount+0x92/0x2d0 cifs_smb3_do_mount+0x161/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&ret_buf->ses_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_match_super+0x101/0x320 sget+0xab/0x270 cifs_smb3_do_mount+0x1e0/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}: check_noncircular+0x95/0xc0 check_prev_add+0x115/0x2f0 validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_signal_cifsd_for_reconnect+0x134/0x200 __cifs_reconnect+0x8f/0x500 cifs_handle_standard+0x112/0x280 cifs_demultiplex_thread+0x64d/0xbc0 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ret_buf->chan_lock); lock(&ret_buf->ses_lock); lock(&ret_buf->chan_lock); lock(&tcp_ses->srv_lock); *** DEADLOCK *** 3 locks held by cifsd/6055: #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>

kv2019i requested review from dbaluta and plbossart as code owners October 10, 2019 09:55

kv2019i mentioned this pull request Oct 10, 2019

snd-sof-pci module unload gets stuck due to hda_jackpoll_work #1293

Closed

plbossart requested changes Oct 10, 2019

View reviewed changes

sound/soc/sof/core.c Outdated Show resolved Hide resolved

lyakh requested changes Oct 11, 2019

View reviewed changes

sound/soc/sof/core.c Outdated Show resolved Hide resolved

kv2019i force-pushed the topic/fix-dev-remove branch from 4ab26d7 to 2648c74 Compare October 11, 2019 13:25

kv2019i changed the title ~~ASoC: SOF: core: disable runtime pm for device removal~~ ASoC: SOF: core: fix race in device removal Oct 11, 2019

kv2019i requested review from ranj063, lyakh and plbossart October 11, 2019 14:19

ranj063 reviewed Oct 12, 2019

View reviewed changes

sound/soc/sof/core.c Outdated Show resolved Hide resolved

kv2019i force-pushed the topic/fix-dev-remove branch from 2648c74 to a22df3a Compare October 25, 2019 11:05

kv2019i requested a review from ranj063 October 25, 2019 11:08

ranj063 previously approved these changes Oct 25, 2019

View reviewed changes

plbossart reviewed Oct 25, 2019

View reviewed changes

kv2019i dismissed ranj063’s stale review via 47c428a October 28, 2019 18:23

kv2019i force-pushed the topic/fix-dev-remove branch from a22df3a to 47c428a Compare October 28, 2019 18:23

kv2019i requested a review from lgirdwood as a code owner October 28, 2019 18:23

kv2019i changed the title ~~ASoC: SOF: core: fix race in device removal~~ ASoC: hdac_hda: fix race in device removal Oct 28, 2019

kv2019i requested review from plbossart and ranj063 October 29, 2019 15:18

plbossart approved these changes Oct 29, 2019

View reviewed changes

ranj063 approved these changes Oct 29, 2019

View reviewed changes

ranj063 merged commit f5e70ce into thesofproject:topic/sof-dev Oct 29, 2019

ASoC: hdac_hda: fix race in device removal #1301

ASoC: hdac_hda: fix race in device removal #1301

Uh oh!

Conversation

kv2019i commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

plbossart commented Oct 11, 2019

Uh oh!

Uh oh!

lyakh commented Oct 14, 2019

Uh oh!

plbossart commented Oct 14, 2019

Uh oh!

lyakh commented Oct 15, 2019

Uh oh!

plbossart commented Oct 15, 2019

Uh oh!

lyakh commented Oct 17, 2019

Uh oh!

kv2019i commented Oct 22, 2019

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

plbossart Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

plbossart Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

plbossart Oct 28, 2019

Choose a reason for hiding this comment

Uh oh!

kv2019i Oct 29, 2019

Choose a reason for hiding this comment

Uh oh!

kv2019i commented Oct 29, 2019

Uh oh!

plbossart left a comment

Choose a reason for hiding this comment

Uh oh!

ranj063 commented Oct 29, 2019

Uh oh!

ranj063 commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ranj063 commented Oct 29, 2019

Uh oh!

Uh oh!

kv2019i commented Oct 10, 2019 •

edited

Loading

kv2019i Oct 28, 2019 •

edited

Loading

ranj063 commented Oct 29, 2019 •

edited

Loading