Skip to content

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Oct 10, 2019

Fixes: #1293

ASoC: hdac_hda: fix race in device removal

When ASoC card instance is removed containing a HDA codec,
hdac_hda_codec_remove() may run in parallel with codec resume.
This will cause problems if the HDA link is freed with
snd_hdac_ext_bus_link_put() while the codec is still in
middle of its resume process.

To fix this, change the order such that pm_runtime_disable()
is called before the link is freed. This will ensure any
pending runtime PM action is completed before proceeding
to free the link.

This issue can be easily hit with e.g. SOF driver by loading and
unloading the drivers.

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick on the comment.
Thanks for following up on this @kv2019i, much appreciated.

@kv2019i kv2019i force-pushed the topic/fix-dev-remove branch from 4ab26d7 to 2648c74 Compare October 11, 2019 13:25
@kv2019i kv2019i changed the title ASoC: SOF: core: disable runtime pm for device removal ASoC: SOF: core: fix race in device removal Oct 11, 2019
@plbossart
Copy link
Member

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@lyakh
Copy link
Collaborator

lyakh commented Oct 14, 2019

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

@plbossart
Copy link
Member

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

@lyakh
Copy link
Collaborator

lyakh commented Oct 15, 2019

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

@plbossart
Copy link
Member

@lyakh the problem is that there doesn't seem to be any sort of time constraint between .remove and .resume, so it's possible that the child invoked its resume after its own remove was called.

@plbossart wouldn't that be a bug in that driver?

well, if you require the .remove to disable pm_runtime, that would work I guess. it's not clear to me there is any benefit in asking every codec driver to do the same thing, disable pm_runtime as the first action during its .remove operation, or let the framework do it directly.

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

it we allow a codec driver to runtime-resume on .remove, then we implicitly require all other drivers which don't need it to add a pm_runtime_disable().
If we do this at the SoundWire core level, then we simplify most codec drivers, but we also preclude the first scenario from happening.
I am having a hard time deciding if making things more complex for a hypothetical scenario is worth it.

@lyakh
Copy link
Collaborator

lyakh commented Oct 17, 2019

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal? For that it might need to runtime-resume the device. Whereas other drivers might only need to perform software clean up and don't need to access the hardware at all, both should be possible, it seems to me.

it we allow a codec driver to runtime-resume on .remove, then we implicitly require all other drivers which don't need it to add a pm_runtime_disable().
If we do this at the SoundWire core level, then we simplify most codec drivers, but we also preclude the first scenario from happening.
I am having a hard time deciding if making things more complex for a hypothetical scenario is worth it.

@plbossart aren't there such codec drivers now? Take hdac_hda.c - it calls snd_hdac_ext_bus_link_put() in its .remove() method, which accesses the hardware. hdac_hdmi_dev_remove() calls snd_hdac_display_power(), which then probably calls i915_audio_component_codec_wake_override(), which writes to the hardware too. But I admit, most codec drivers, that I looked at (and I've looked through a few for this discussion) don't access their hardware in .remove(). Some do that in their struct snd_soc_component_driver .remove() methods, but that would probably stay unaffected?

@kv2019i
Copy link
Collaborator Author

kv2019i commented Oct 22, 2019

@plbossart isn't it possible, that an audio (codec) driver needs to access the hardware during driver removal?
[...]
@plbossart aren't there such codec drivers now? Take hdac_hda.c - it calls snd_hdac_ext_bus_link_put() in its .remove() method, which accesses the hardware.

@plbossart and @lyakh I think this is a valid discussion, but I don't think this actually applies to the patch under review. In the scenario this patch addresses, the resume is triggered in by upper level logic (full parent-children tree is resumed for removal). As we have dependencies between children (e.g. HDA controller must be functional to communicate with the codec), it is not safe to call platform_device_unregister() with some children suspended or in middle of resume.

E.g. in this case the problem is failing resume() for a codec, because some of its dependencies (like the controller side) disappear mid-flight. Right?

I'll address Ranjani's comment and resubmit a version.

@kv2019i kv2019i force-pushed the topic/fix-dev-remove branch from 2648c74 to a22df3a Compare October 25, 2019 11:05
@kv2019i kv2019i requested a review from ranj063 October 25, 2019 11:08
ranj063
ranj063 previously approved these changes Oct 25, 2019
Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kv2019i, see comments below, a mix of nit-picking and complete confusion on my side...

* we may have races between child device resume flows
* and platform device removal.
*/
device_for_each_child(sdev->dev, NULL, __force_rpm_active);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully clear on the race condition.

The commit message seems to mean that the topology unload will result in the DSP being turned off.
But we have another step below with an snd_sof_remove() which as I understand it turns the DSP off?

In other words, is the race between resuming children/removing machine device, or resuming children/calling snd_sof_remove()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart Hmm, ok, I see what you mean. The race is between resuming children/removing machine device. But yeah, the commit message is confusing now. I used "DSP turned off" in commit message, as the topology unload will effectively do this (unloading the scheduler will actuall shut down core 0). So when snd_sof_remove() is finally called, DSP is no longer powered.

Makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i I am not sure that on topology unload the DSP is actually completely powered-off. Maybe core0 is powered off but the rest of the controller isn't
@ranj063 would need to comment, if they get power in the East Bay?

Copy link
Collaborator Author

@kv2019i kv2019i Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart Whole ADSP is not powered off, but the cores are. But, but, it's true this shouldn't block codec drivers (or any other machine driver components). I debugged more and it seems this goes to:

[ 64.352561] RIP: 0010:hdac_hda_codec_remove+0x9/0x80 [snd_soc_hdac_hda]
....
[ 64.352571] soc_remove_link_components+0x6c/0xb0 [snd_soc_core]
[ 64.352573] snd_soc_unbind_card+0x9b/0xe0 [snd_soc_core]
[ 64.352575] snd_soc_unregister_card+0x1d/0x60 [snd_soc_core]
[ 64.352577] release_nodes+0x121/0x220
[ 64.352579] device_release_driver_internal+0xf0/0x1c0
[ 64.352580] bus_remove_device+0xd6/0x140
[ 64.352580] device_del+0x15f/0x380
[ 64.352582] platform_device_del.part.14+0xe/0x60
[ 64.352583] platform_device_unregister+0x17/0x30

... and there we go to codec driver remove. So maybe we need to fix this in the codec driver after all. I'll continue debug and report back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plbossart OK, I uploaded a new version. I'll continue to polish tomorrow but wanted to get CI results for the new version already today. It survives my CFL test bench and is a much simpler fix.

@lyakh I think you'll like this version better :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a major simplification indeed... wow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go now. Thanks for the review comments @plbossart and @lyakh , good that the original version did not go in.

When ASoC card instance is removed containing a HDA codec,
hdac_hda_codec_remove() may run in parallel with codec resume.
This will cause problems if the HDA link is freed with
snd_hdac_ext_bus_link_put() while the codec is still in
middle of its resume process.

To fix this, change the order such that pm_runtime_disable()
is called before the link is freed. This will ensure any
pending runtime PM action is completed before proceeding
to free the link.

This issue can be easily hit with e.g. SOF driver by loading and
unloading the drivers.

Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i kv2019i force-pushed the topic/fix-dev-remove branch from a22df3a to 47c428a Compare October 28, 2019 18:23
@kv2019i kv2019i requested a review from lgirdwood as a code owner October 28, 2019 18:23
@kv2019i kv2019i changed the title ASoC: SOF: core: fix race in device removal ASoC: hdac_hda: fix race in device removal Oct 28, 2019
@kv2019i
Copy link
Collaborator Author

kv2019i commented Oct 29, 2019

Travis build seems to fail to an unrelated issue in upstream:
https://travis-ci.org/thesofproject/linux/jobs/604052043?utm_medium=notification&utm_source=github_status

sound/soc/codecs/rt5677-spi.c: In function ‘rt5677_spi_pcm_close’:

sound/soc/codecs/rt5677-spi.c:114:30: error: unused variable ‘rtd’ [-Werror=unused-variable]

  struct snd_soc_pcm_runtime *rtd = substream->private_data;

@kv2019i kv2019i requested review from plbossart and ranj063 October 29, 2019 15:18
Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kv2019i

@ranj063 ranj063 merged commit f5e70ce into thesofproject:topic/sof-dev Oct 29, 2019
@ranj063
Copy link
Collaborator

ranj063 commented Oct 29, 2019

@plbossart I tried the script with this PR and it worked quite well. once keyon updted PR 1290 to allow edits, please do give it a try

@ranj063
Copy link
Collaborator

ranj063 commented Oct 29, 2019

@plbossart I tried the script with this PR and it worked quite well. once keyon updted PR 1290 to allow edits, please do give it a try

oops spoke too early. Its only added in the PR's branch but doesnt get added to the commit in sof-dev. I dont understand what I am missing between the force push and the merge

@ranj063
Copy link
Collaborator

ranj063 commented Oct 29, 2019

oops spoke too early. Its only added in the PR's branch but doesnt get added to the commit in sof-dev. I dont understand what I am missing between the force push and the merge

I think I might know why. I added a 5s delay between when we force push and merge the PR. I'll test my theory with keyon's PR

bardliao pushed a commit to bardliao/linux that referenced this pull request Jul 8, 2025
Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order
and prevent the following deadlock from happening

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc3-build2+ thesofproject#1301 Tainted: G S      W
------------------------------------------------------
cifsd/6055 is trying to acquire lock:
ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200

but task is already holding lock:
ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ret_buf->chan_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_setup_session+0x81/0x4b0
       cifs_get_smb_ses+0x771/0x900
       cifs_mount_get_session+0x7e/0x170
       cifs_mount+0x92/0x2d0
       cifs_smb3_do_mount+0x161/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&ret_buf->ses_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_match_super+0x101/0x320
       sget+0xab/0x270
       cifs_smb3_do_mount+0x1e0/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}:
       check_noncircular+0x95/0xc0
       check_prev_add+0x115/0x2f0
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_signal_cifsd_for_reconnect+0x134/0x200
       __cifs_reconnect+0x8f/0x500
       cifs_handle_standard+0x112/0x280
       cifs_demultiplex_thread+0x64d/0xbc0
       kthread+0x2f7/0x310
       ret_from_fork+0x2a/0x230
       ret_from_fork_asm+0x1a/0x30

other info that might help us debug this:

Chain exists of:
  &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ret_buf->chan_lock);
                               lock(&ret_buf->ses_lock);
                               lock(&ret_buf->chan_lock);
  lock(&tcp_ses->srv_lock);

 *** DEADLOCK ***

3 locks held by cifsd/6055:
 #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200
 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200
 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

Cc: linux-cifs@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data")
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

snd-sof-pci module unload gets stuck due to hda_jackpoll_work
4 participants