forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 5
[pull] master from torvalds:master #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code. Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Link: https://lore.kernel.org/r/20181201155838.8619-1-tiny.windzz@gmail.com Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Should be 'a' rather than 'an'. Signed-off-by: WANG Wenhu <wenhu.wang@vivo.com> Link: https://lore.kernel.org/r/20200313165049.62907-1-wenhu.wang@vivo.com Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
If ida_simple_get() returns an error when called in rproc_alloc(), put_device() is called to clean things up. By this time the rproc device type has been assigned, with rproc_type_release() as the release function. The first thing rproc_type_release() does is call: idr_destroy(&rproc->notifyids); But at the time the ida_simple_get() call is made, the notifyids field in the remoteproc structure has not been initialized. I'm not actually sure this case causes an observable problem, but it's incorrect. Fix this by initializing the notifyids field before calling ida_simple_get() in rproc_alloc(). Fixes: b5ab5e2 ("remoteproc: maintain a generic child device for each rproc") Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Suman Anna <s-anna@ti.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20200415204858.2448-2-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Make the firmware name allocation a function on its own in an effort to cleanup function rproc_alloc(). Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200415204858.2448-3-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
In an effort to cleanup firmware name allocation, replace the cumbersome mechanic used to allocate a default firmware name with function kasprintf(). Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Suggested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200415204858.2448-4-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
This function allows drivers to correctly setup the coredump output elf information. Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Clement Leger <cleger@kalray.eu> Link: https://lore.kernel.org/r/20200410102433.2672-2-cleger@kalray.eu Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Modify drivers which are using remoteproc coredump functionality to use rproc_coredump_set_elf_info in order to create correct elf coredump format. Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Clement Leger <cleger@kalray.eu> Link: https://lore.kernel.org/r/20200410102433.2672-3-cleger@kalray.eu Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Current implementation of the sysmon driver does not support adding notifications for other remoteproc events - prepare, start, unprepare. Clients on the remoteproc side might be interested in knowing when a remoteproc boots up. This change adds the ability to send the notification type along with the name. For example, audio DSP is interested in knowing when modem has crashed so that it can perform cleanup and wait for modem to boot up before it starts processing data again. Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Siddharth Gupta <sidgup@codeaurora.org> Link: https://lore.kernel.org/r/1586389003-26675-2-git-send-email-sidgup@codeaurora.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Add notification for other stages of remoteproc boot and shutdown. This includes adding callback functions for the prepare and unprepare events, and fleshing out the callback function for start. Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Siddharth Gupta <sidgup@codeaurora.org> Link: https://lore.kernel.org/r/1586389003-26675-3-git-send-email-sidgup@codeaurora.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Clients/services running on a remoteproc that booted up might need to be aware of the state of already running remoteprocs. When a remoteproc boots up (fresh or after recovery) it is not aware of the remoteprocs that booted before it, i.e., the system state is incomplete. So to keep track of it we send sysmon on behalf of all 'ONLINE' remoteprocs. Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Siddharth Gupta <sidgup@codeaurora.org> Link: https://lore.kernel.org/r/1586389003-26675-4-git-send-email-sidgup@codeaurora.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
For cases where @firmware is declared "const char *", use function kstrdup_const() to avoid needlessly creating another copy on the heap. Suggested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20200420231601.16781-2-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Improve the readability of function rproc_alloc_firmware() by using a non-negated condition and moving the comment out of the conditional block Suggested-by: Alex Elder <elder@linaro.org> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20200420231601.16781-3-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Make the rproc_ops allocation a function on its own in an effort to clean up function rproc_alloc(). Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Link: https://lore.kernel.org/r/20200420231601.16781-4-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Get rid of tedious error management by moving firmware and operation allocation after calling device_initialize(). That way we take advantage of the automatic call to rproc_type_release() to cleanup after ourselves when put_device() is called. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Suman Anna <s-anna@ti.com> Link: https://lore.kernel.org/r/20200420231601.16781-5-mathieu.poirier@linaro.org Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
The current name field used in the remoteproc structure is simply a pointer to a name field supplied during the rproc_alloc() call. The pointer passed in by remoteproc drivers during registration is typically a dev_name pointer, but it is possible that the pointer will no longer remain valid if the devices themselves were created at runtime like in the case of of_platform_populate(), and were deleted upon any failures within the respective remoteproc driver probe function. So, allocate and maintain a local copy for this name field to keep it agnostic of the logic used in the remoteproc drivers. Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Suman Anna <s-anna@ti.com> Link: https://lore.kernel.org/r/20200417002036.24359-3-s-anna@ti.com Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Add API functions devm_rproc_alloc() and devm_rproc_add(), which behave like rproc_alloc() and rproc_add() respectively, but register their respective cleanup function to be called on driver detach. Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Paul Cercueil <paul@crapouillou.net> Link: https://lore.kernel.org/r/20200417170040.174319-2-paul@crapouillou.net Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Since checks are present in the remoteproc elf loader before calling da_to_va, loading a elf64 will work on 32bits flavors of kernel. Indeed, if a segment size is larger than what size_t can hold, the loader will return an error so the functionality is equivalent to what exists today. Acked-by: Suman Anna <s-anna@ti.com> Signed-off-by: Clement Leger <cleger@kalray.eu> Link: https://lore.kernel.org/r/20200422093017.10985-1-cleger@kalray.eu Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
On some SoC architecture, it is needed to enable HW like clock, bus, regulator, memory region... before loading co-processor firmware. This patch introduces prepare and unprepare ops to execute platform specific function before firmware loading and after stop execution. Signed-off-by: Loic Pallardy <loic.pallardy@st.com> Signed-off-by: Suman Anna <s-anna@ti.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/20200417002036.24359-2-s-anna@ti.com Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Message logged by 'dev_xxx()' or 'pr_xxx()' should end with a '\n'. Fixes: 791c13b ("remoteproc: Fix NULL pointer dereference in rproc_virtio_notify") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/20200411160750.32573-1-christophe.jaillet@wanadoo.fr Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Add SysFS attribute that provides the port number for PCI functions representing a single port of a multi-port device. Signed-off-by: Alexander Schmidt <alexs@linux.ibm.com> Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In the future the bus sysdata may not directly point to the zpci_dev. In preparation of upcoming patches let us abstract the access to the zpci_dev from the device inside the pci device. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Using PCI multifunctions in S390 is a new feature we may want to ignore to continue provide the same topology as in the past to userland even if the configuration supports exposing the topology of a multi-Function device. A new boolean parameters allows to overwrite the kernel pci configuration: - pci=norid when on, disallow the use a new firmware field, RID, which provides the PCI <bus>:<device>.<function> part of the PCI address. To be used in the following patches and satisfy the checkpatch.pl the variable is exposed in pci.h Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Firmware provides the bus/devfn part of the PCI addresses of a zPCI function inside the new field RID of the CLP query PCI function with a bit to know if this field is available to use. Let's add these fields to the clp_rsp_query_pci structure, add corresponding fields to zdev and initialize them. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The zPCI bus is in charge to handle common zPCI resources for zPCI devices. Creating the zPCI bus, the PCI bus, the zPCI devices and the PCI devices and hotplug slots done in a specific order: - PCI hotplug slot creation needs a PCI bus - PCI bus needs a PCI domain which is reported by the pci_domain_nr() when setting up the host bridge - PCI domain is set from the zPCI with devfn 0 this is necessary to have a reproducible enumeration Therefore we can not create devices or hotplug slots for any PCI device associated with a zPCI device before having discovered the function zero of the bus. The discovery and initialization of devices can be done at several points in the code: - On Events, serialized in a thread context - On initialization, in the kernel init thread context - When powering on the hotplug slot, in a user thread context The removal of devices and their parent bus may also be done on events or for devices when powering down the slot. To guarantee the existence of the bus and devices until they are no more needed we use kref in zPCI bus and introduce a reference count in the zPCI devices. In this patch the zPCI bus still only accept a device with a devfn 0. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Simplify the event handling. Set the zpci state explicitly. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current PCI implementation do not provide a bus resource. This leads to a notice being print at boot. Let's do it more nicely and provide the bus resource. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
We allow multiple functions on a single bus. We suppress the ZPCI_DEVFN definition and replace its occurences with zpci->devfn. We verify the number of device during the registration. There can never be more domains in use than existing devices, so we do not need to verify the count of domain after having verified the count of devices. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The Physical function should not be disabled until no virtual functions depends on it. Let's force the user to first use echo 0 > sriov_numfs before allowing to disable the PF with echo 0 > power. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There are changes in the usage of PCI for the user: - new kernel parameter - modification of the way functions are enumerated Let's document these. Signed-off-by: Pierre Morel <pmorel@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
For rolling back after an error, qdio_establish() calls qdio_shutdown(). If the error occurs early enough, then the qdio_irq's state still is QDIO_IRQ_STATE_INACTIVE and qdio_shutdown() does nothing. But at _any_ point where qdio_establish() bails out in this way, qdio_setup_irq() will have already replaced the IRQ handler. This then won't be restored after an early error, and the device can end up being returned to the device driver with qdio's IRQ handler still installed. Slightly reorder qdio_setup_irq() so we can be 100% sure that the IRQ handler was replaced. Then fix the bug in qdio_establish() by calling a helper that rolls back only the IRQ handler modification. Also use the new helper in qdio_shutdown() to keep things in sync, and slightly clean up the locking while doing so. This makes minor semantical changes, but holding setup_mutex gives us sufficient leeway to eg. pull qdio_shutdown_thinint() outside of the ccwdev lock's scope. Fixes: 779e6e1 ("[S390] qdio: new qdio driver.") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This requires flushing delayed work items in gfs2_make_fs_ro (which is called before unmounting a filesystem). When inodes are deleted and then recreated, pending gl_delete work items would have no effect because the inode generations will have changed, so we can cancel any pending gl_delete works before reusing iopen glocks. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When there's contention on the iopen glock, it means that the link count of the corresponding inode has dropped to zero on a remote node which is now trying to delete the inode. In that case, try to evict the inode so that the iopen glock will be released, which will allow the remote node to do its job. When the inode is still open locally, the inode's reference count won't drop to zero and so we'll keep holding the inode and its iopen glock. The remote node will time out its request to grab the iopen glock, and when the inode is finally closed locally, we'll try to delete it ourself. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When an inode's link count drops to zero and the inode is cached on other nodes, the current behavior of gfs2 is to immediately give up and to rely on the other node(s) to delete the inode if there is iopen glock contention. This leads to resource group glock bouncing and the loss of caching. With the previous patches in place, we can fix that by not giving up immediately. When the inode is still open on other nodes, those nodes won't be able to evict the inode and give up the iopen glock. In that case, our lock conversion request will time out. The unlink system call will block for the duration of the iopen lock conversion request. We're also holding the inode glock in EX mode for an extended duration, so other nodes won't be able to make progress on the inode, either. This is worse than what we had before, but we can prevent other nodes from getting stuck by aborting our iopen locking request if there is contention on the inode glock. This will the the subject of a future patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use a zero no_formal_ino instead of a NULL pointer to indicate that any inode generation number will qualify: a valid inode never has a zero no_formal_ino. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Move the inode generation number check from gfs2_lookup_by_inum into gfs2_inode_lookup: gfs2_inode_lookup may be able to decide that an inode with the given inode generation number cannot exist without having to verify the block type or reading the inode from disk. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In delete_work_func, if the iopen glock still has an inode attached, limit the inode lookup to that specific generation number: in the likely case that the inode was deleted on the node on which the inode's link count dropped to zero, we can skip verifying the on-disk block type and reading in the inode. The same applies if another node that had the inode open managed to delete the inode before us. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Wake up the sdp->sd_async_glock_wait wait queue when setting the GLF_DEMOTE flag. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When trying to upgrade the iopen glock from a shared to an exclusive lock in gfs2_evict_inode, abort the wait if there is contention on the corresponding inode glock: in that case, the inode must still be in active use on another node, and we're not guaranteed to get the iopen glock anytime soon. To make this work even better, when we notice contention on the iopen glock and we can't evict the corresponsing inode and release the iopen glock immediately, poke the inode glock. The other node(s) trying to acquire the lock can then abort instead of timing out. Thanks to Heinz Mauelshagen for pointing out a locking bug in a previous version of this patch. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since transactions may be freed shortly after they're created, before a log_flush occurs, we need to initialize their ail1 and ail2 lists earlier. Before this patch, the ail1 list was initialized in gfs2_log_flush(). This moves the initialization to the point when the transaction is first created. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch adds a new slab for gfs2 transactions. That allows us to reduce kernel memory fragmentation, have better organization of data for analysis of vmcore dumps. A new centralized function is added to free the slab objects, and it exposes use-after-free by giving warnings if a transaction is freed while it still has bd elements attached to its buffers or ail lists. We make sure to initialize those transaction ail lists so we can check their integrity when freeing. At a later time, we should add a slab initialization function to make it more efficient, but for this initial patch I wanted to minimize the impact. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, transactions could be merged into the system transaction by function gfs2_merge_trans(), but the transaction ail lists were never merged. Because the ail flushing mechanism can run separately, bd elements can be attached to the transaction's buffer list during the transaction (trans_add_meta, etc) but quickly moved to its ail lists. Later, in function gfs2_trans_end, the transaction can be freed (by gfs2_trans_end) while it still has bd elements queued to its ail lists, which can cause it to either lose track of the bd elements altogether (memory leak) or worse, reference the bd elements after the parent transaction has been freed. Although I've not seen any serious consequences, the problem becomes apparent with the previous patch's addition of: gfs2_assert_warn(sdp, list_empty(&tr->tr_ail1_list)); to function gfs2_trans_free(). This patch adds logic into gfs2_merge_trans() to move the merged transaction's ail lists to the sdp transaction. This prevents the use-after-free. To do this properly, we need to hold the ail lock, so we pass sdp into the function instead of the transaction itself. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
…it/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for multi-function devices in pci code. - Enable PF-VF linking for architectures using the pdev->no_vf_scan flag (currently just s390). - Add reipl from NVMe support. - Get rid of critical section cleanup in entry.S. - Refactor PNSO CHSC (perform network subchannel operation) in cio and qeth. - QDIO interrupts and error handling fixes and improvements, more refactoring changes. - Align ioremap() with generic code. - Accept requests without the prefetch bit set in vfio-ccw. - Enable path handling via two new regions in vfio-ccw. - Other small fixes and improvements all over the code. * tag 's390-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (52 commits) vfio-ccw: make vfio_ccw_regops variables declarations static vfio-ccw: Add trace for CRW event vfio-ccw: Wire up the CRW irq and CRW region vfio-ccw: Introduce a new CRW region vfio-ccw: Refactor IRQ handlers vfio-ccw: Introduce a new schib region vfio-ccw: Refactor the unregister of the async regions vfio-ccw: Register a chp_event callback for vfio-ccw vfio-ccw: Introduce new helper functions to free/destroy regions vfio-ccw: document possible errors vfio-ccw: Enable transparent CCW IPL from DASD s390/pci: Log new handle in clp_disable_fh() s390/cio, s390/qeth: cleanup PNSO CHSC s390/qdio: remove q->first_to_kick s390/qdio: fix up qdio_start_irq() kerneldoc s390: remove critical section cleanup from entry.S s390: add machine check SIGP s390/pci: ioremap() align with generic code s390/ap: introduce new ap function ap_get_qdev() Documentation/s390: Update / remove developerWorks web links ...
…/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - An iopen glock locking scheme rework that speeds up deletes of inodes accessed from multiple nodes - Various bug fixes and debugging improvements - Convert gfs2-glocks.txt to ReST * tag 'gfs2-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: fix use-after-free on transaction ail lists gfs2: new slab for transactions gfs2: initialize transaction tr_ailX_lists earlier gfs2: Smarter iopen glock waiting gfs2: Wake up when setting GLF_DEMOTE gfs2: Check inode generation number in delete_work_func gfs2: Move inode generation number check into gfs2_inode_lookup gfs2: Minor gfs2_lookup_by_inum cleanup gfs2: Try harder to delete inodes locally gfs2: Give up the iopen glock on contention gfs2: Turn gl_delete into a delayed work gfs2: Keep track of deleted inode generations in LVBs gfs2: Allow ASPACE glocks to also have an lvb gfs2: instrumentation wrt log_flush stuck gfs2: introduce new gfs2_glock_assert_withdraw gfs2: print mapping->nrpages in glock dump for address space glocks gfs2: Only do glock put in gfs2_create_inode for free inodes gfs2: Allow lock_nolock mount to specify jid=X gfs2: Don't ignore inode write errors during inode_go_sync docs: filesystems: convert gfs2-glocks.txt to ReST
Pull ceph updates from Ilya Dryomov: "The highlights are: - OSD/MDS latency and caps cache metrics infrastructure for the filesytem (Xiubo Li). Currently available through debugfs and will be periodically sent to the MDS in the future. - support for replica reads (balanced and localized reads) for rbd and the filesystem (myself). The default remains to always read from primary, users can opt-in with the new crush_location and read_from_replica options. Note that reading from replica is safe for general use only since Octopus. - support for RADOS allocation hint flags (myself). Currently used by rbd to propagate the compressible/incompressible hint given with the new compression_hint map option and ready for passing on more advanced hints, e.g. based on fadvise() from the filesystem. - support for efficient cross-quota-realm renames (Luis Henriques) - assorted cap handling improvements and cleanups, particularly untangling some of the locking (Jeff Layton)" * tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits) rbd: compression_hint option libceph: support for alloc hint flags libceph: read_from_replica option libceph: support for balanced and localized reads libceph: crush_location infrastructure libceph: decode CRUSH device/bucket types and names libceph: add non-asserting rbtree insertion helper ceph: skip checking caps when session reconnecting and releasing reqs ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock ceph: don't return -ESTALE if there's still an open file libceph, rbd: replace zero-length array with flexible-array ceph: allow rename operation under different quota realms ceph: normalize 'delta' parameter usage in check_quota_exceeded ceph: ceph_kick_flushing_caps needs the s_mutex ceph: request expedited service on session's last cap flush ceph: convert mdsc->cap_dirty to a per-session list ceph: reset i_requested_max_size if file write is not wanted ceph: throw a warning if we destroy session with mutex still locked ceph: fix potential race in ceph_check_caps ceph: document what protects i_dirty_item and i_flushing_item ...
…it/andersson/remoteproc Pull rpmsg updates from Bjorn Andersson: "This replaces a zero-length array with flexible-array and fixes a typo in a comment in the rpmsg core" * tag 'rpmsg-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: rpmsg: Replace zero-length array with flexible-array rpmsg: fix a comment typo for rpmsg_device_match()
…it/andersson/remoteproc Pull remoteproc updates from Bjorn Andersson: "This introduces device managed versions of functions used to register remoteproc devices, add support for remoteproc driver specific resource control, enables remoteproc drivers to specify ELF class and machine for coredumps. It integrates pm_runtime in the core for keeping resources active while the remote is booted and holds a wake source while recoverying a remote processor after a firmware crash. It refactors the remoteproc device's allocation path to simplify the logic, fix a few cleanup bugs and to not clone const strings onto the heap. Debugfs code is simplifies using the DEFINE_SHOW_ATTRIBUTE and a zero-length array is replaced with flexible-array. A new remoteproc driver for the JZ47xx VPU is introduced, the Qualcomm SM8250 gains support for audio, compute and sensor remoteprocs and the Qualcomm SC7180 modem support is cleaned up and improved. The Qualcomm glink subsystem-restart driver is merged into the main glink driver, the Qualcomm sysmon driver is extended to properly notify remote processors about all other remote processors' state transitions" * tag 'rproc-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc: (43 commits) remoteproc: Fix an error code in devm_rproc_alloc() MAINTAINERS: Add myself as reviewer for Ingenic rproc driver remoteproc: ingenic: Added remoteproc driver remoteproc: Add support for runtime PM dt-bindings: Document JZ47xx VPU auxiliary processor remoteproc: wcss: Fix arguments passed to qcom_add_glink_subdev() remoteproc: Fix and restore the parenting hierarchy for vdev remoteproc: Fall back to using parent memory pool if no dedicated available remoteproc: Replace zero-length array with flexible-array remoteproc: wcss: add support for rpmsg communication remoteproc: core: Prevent system suspend during remoteproc recovery remoteproc: qcom_q6v5_mss: Remove unused q6v5_da_to_va function remoteproc: qcom_q6v5_mss: map/unmap mpss segments before/after use remoteproc: qcom_q6v5_mss: Drop accesses to MPSS PERPH register space dt-bindings: remoteproc: qcom: Replace halt-nav with spare-regs remoteproc: qcom: pas: Add SM8250 PAS remoteprocs dt-bindings: remoteproc: qcom: pas: Add SM8250 remoteprocs remoteproc: qcom_q6v5_mss: Extract mba/mpss from memory-region dt-bindings: remoteproc: qcom: Use memory-region to reference memory remoteproc: qcom: pas: Add SC7180 Modem support ...
pull bot
pushed a commit
that referenced
this pull request
Aug 7, 2020
Fixes the following checkpatch.pl warning(s): CHECK: Please use a blank line after function/struct/union/enum declarations #46: FILE: drivers/iio/dummy/iio_simple_dummy.c:690: } +/* total: 0 errors, 0 warnings, 1 checks, 22 lines checked Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
pull bot
pushed a commit
that referenced
this pull request
Aug 7, 2020
Inspired by the commit 42d038c ("arm64: Add support for function error injection"), this patch supports function error injection for csky. This patch mainly support two functions: one is regs_set_return_value() which is used to overwrite the return value; the another function is override_function_with_return() which is to override the probed function returning and jump to its caller. Test log: cd /sys/kernel/debug/fail_function/ echo sys_clone > inject echo 100 > probability echo 1 > interval ls / [ 108.644163] FAULT_INJECTION: forcing a failure. [ 108.644163] name fail_function, interval 1, probability 100, space 0, times 1 [ 108.647799] CPU: 0 PID: 104 Comm: sh Not tainted 5.8.0-rc5+ #46 [ 108.648384] Call Trace: [ 108.649339] [<8005eed4>] walk_stackframe+0x0/0xf0 [ 108.649679] [<8005f16a>] show_stack+0x32/0x5c [ 108.649927] [<8040f9d2>] dump_stack+0x6e/0x9c [ 108.650271] [<80406f7e>] should_fail+0x15e/0x1ac [ 108.650720] [<80118ba8>] fei_kprobe_handler+0x28/0x5c [ 108.651519] [<80754110>] kprobe_breakpoint_handler+0x144/0x1cc [ 108.652289] [<8005d6da>] trap_c+0x8e/0x110 [ 108.652816] [<8005ce8c>] csky_trap+0x5c/0x70 -sh: can't fork: Invalid argument Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Cc: Arnd Bergmann <arnd@arndb.de>
pull bot
pushed a commit
that referenced
this pull request
Sep 18, 2020
The crash log as the below: [Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G OE 5.4.0-42-generic #46~18.04.1-Ubuntu [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020 [Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 [amdgpu] [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00 [Thu Aug 20 23:18:14 2020] RSP: 0018:ffffb29b52f6fc90 EFLAGS: 00010213 [Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038 [Thu Aug 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8a0e2e4178c0 [Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: 0000000000001b64 R09: 0000000000000004 [Thu Aug 20 23:18:14 2020] R10: ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28 [Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000 [Thu Aug 20 23:18:14 2020] FS: 0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000 [Thu Aug 20 23:18:14 2020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0 [Thu Aug 20 23:18:14 2020] Call Trace: [Thu Aug 20 23:18:14 2020] kfd_process_evict_queues+0x43/0xd0 [amdgpu] [Thu Aug 20 23:18:14 2020] kfd_suspend_all_processes+0x60/0xf0 [amdgpu] [Thu Aug 20 23:18:14 2020] kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu] [Thu Aug 20 23:18:14 2020] kgd2kfd_pre_reset+0x46/0x60 [amdgpu] [Thu Aug 20 23:18:14 2020] amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu] [Thu Aug 20 23:18:14 2020] amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu] [Thu Aug 20 23:18:14 2020] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [Thu Aug 20 23:18:14 2020] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [Thu Aug 20 23:18:14 2020] process_one_work+0x20f/0x400 [Thu Aug 20 23:18:14 2020] worker_thread+0x34/0x410 When GPU hang, user process will fail to create a compute queue whose struct object will be freed later, but driver wrongly add this queue to queue list of the proccess. And then kfd_process_evict_queues will access a freed memory, which cause a system crash. v2: The failure to execute_queues should probably not be reported to the caller of create_queue, because the queue was already created. Therefore change to ignore the return value from execute_queues. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
pull bot
pushed a commit
that referenced
this pull request
Nov 20, 2020
When removing the driver we would hit BUG_ON(!list_empty(&dev->ptype_specific)) in net/core/dev.c due to still having the NC-SI packet handler registered. # echo 1e660000.ethernet > /sys/bus/platform/drivers/ftgmac100/unbind ------------[ cut here ]------------ kernel BUG at net/core/dev.c:10254! Internal error: Oops - BUG: 0 [#1] SMP ARM CPU: 0 PID: 115 Comm: sh Not tainted 5.10.0-rc3-next-20201111-00007-g02e0365710c4 #46 Hardware name: Generic DT based system PC is at netdev_run_todo+0x314/0x394 LR is at cpumask_next+0x20/0x24 pc : [<806f5830>] lr : [<80863cb0>] psr: 80000153 sp : 855bbd58 ip : 00000001 fp : 855bbdac r10: 80c03d00 r9 : 80c06228 r8 : 81158c54 r7 : 00000000 r6 : 80c05dec r5 : 80c05d18 r4 : 813b9280 r3 : 813b9054 r2 : 8122c470 r1 : 00000002 r0 : 00000002 Flags: Nzcv IRQs on FIQs off Mode SVC_32 ISA ARM Segment none Control: 00c5387d Table: 85514008 DAC: 00000051 Process sh (pid: 115, stack limit = 0x7cb5703d) ... Backtrace: [<806f551c>] (netdev_run_todo) from [<80707eec>] (rtnl_unlock+0x18/0x1c) r10:00000051 r9:854ed710 r8:81158c54 r7:80c76bb0 r6:81158c10 r5:8115b410 r4:813b9000 [<80707ed4>] (rtnl_unlock) from [<806f5db8>] (unregister_netdev+0x2c/0x30) [<806f5d8c>] (unregister_netdev) from [<805a8180>] (ftgmac100_remove+0x20/0xa8) r5:8115b410 r4:813b9000 [<805a8160>] (ftgmac100_remove) from [<805355e4>] (platform_drv_remove+0x34/0x4c) Fixes: bd466c3 ("net/faraday: Support NCSI mode") Signed-off-by: Joel Stanley <joel@jms.id.au> Link: https://lore.kernel.org/r/20201117024448.1170761-1-joel@jms.id.au Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pull bot
pushed a commit
that referenced
this pull request
May 5, 2021
Ritesh reported a bug [1] against UML, noting that it crashed on startup. The backtrace shows the following (heavily redacted): (gdb) bt ... #26 0x0000000060015b5d in sem_init () at ipc/sem.c:268 #27 0x00007f89906d92f7 in ?? () from /lib/x86_64-linux-gnu/libcom_err.so.2 #28 0x00007f8990ab8fb2 in call_init (...) at dl-init.c:72 ... #40 0x00007f89909bf3a6 in nss_load_library (...) at nsswitch.c:359 ... #44 0x00007f8990895e35 in _nss_compat_getgrnam_r (...) at nss_compat/compat-grp.c:486 #45 0x00007f8990968b85 in __getgrnam_r [...] #46 0x00007f89909d6b77 in grantpt [...] #47 0x00007f8990a9394e in __GI_openpty [...] #48 0x00000000604a1f65 in openpty_cb (...) at arch/um/os-Linux/sigio.c:407 #49 0x00000000604a58d0 in start_idle_thread (...) at arch/um/os-Linux/skas/process.c:598 #50 0x0000000060004a3d in start_uml () at arch/um/kernel/skas/process.c:45 #51 0x00000000600047b2 in linux_main (...) at arch/um/kernel/um_arch.c:334 #52 0x000000006000574f in main (...) at arch/um/os-Linux/main.c:144 indicating that the UML function openpty_cb() calls openpty(), which internally calls __getgrnam_r(), which causes the nsswitch machinery to get started. This loads, through lots of indirection that I snipped, the libcom_err.so.2 library, which (in an unknown function, "??") calls sem_init(). Now, of course it wants to get libpthread's sem_init(), since it's linked against libpthread. However, the dynamic linker looks up that symbol against the binary first, and gets the kernel's sem_init(). Hajime Tazaki noted that "objcopy -L" can localize a symbol, so the dynamic linker wouldn't do the lookup this way. I tried, but for some reason that didn't seem to work. Doing the same thing in the linker script instead does seem to work, though I cannot entirely explain - it *also* works if I just add "VERSION { { global: *; }; }" instead, indicating that something else is happening that I don't really understand. It may be that explicitly doing that marks them with some kind of empty version, and that's different from the default. Explicitly marking them with a version breaks kallsyms, so that doesn't seem to be possible. Marking all the symbols as local seems correct, and does seem to address the issue, so do that. Also do it for static link, nsswitch libraries could still be loaded there. [1] https://bugs.debian.org/983379 Reported-by: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com> Tested-By: Ritesh Raj Sarraf <rrs@debian.org> Signed-off-by: Richard Weinberger <richard@nod.at>
pull bot
pushed a commit
that referenced
this pull request
Jul 3, 2021
In raise_backtrace_ipi() we iterate through the cpumask of CPUs, sending each an IPI asking them to do a backtrace, but we don't wait for the backtrace to happen. We then iterate through the CPU mask again, and if any CPU hasn't done the backtrace and cleared itself from the mask, we print a trace on its behalf, noting that the trace may be "stale". This works well enough when a CPU is not responding, because in that case it doesn't receive the IPI and the sending CPU is left to print the trace. But when all CPUs are responding we are left with a race between the sending and receiving CPUs, if the sending CPU wins the race then it will erroneously print a trace. This leads to spurious "stale" traces from the sending CPU, which can then be interleaved messily with the receiving CPU, note the CPU numbers, eg: [ 1658.929157][ C7] rcu: Stack dump where RCU GP kthread last ran: [ 1658.929223][ C7] Sending NMI from CPU 7 to CPUs 1: [ 1658.929303][ C1] NMI backtrace for cpu 1 [ 1658.929303][ C7] CPU 1 didn't respond to backtrace IPI, inspecting paca. [ 1658.929362][ C1] CPU: 1 PID: 325 Comm: kworker/1:1H Tainted: G W E 5.13.0-rc2+ #46 [ 1658.929405][ C7] irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 325 (kworker/1:1H) [ 1658.929465][ C1] Workqueue: events_highpri test_work_fn [test_lockup] [ 1658.929549][ C7] Back trace of paca->saved_r1 (0xc0000000057fb400) (possibly stale): [ 1658.929592][ C1] NIP: c00000000002cf50 LR: c008000000820178 CTR: c00000000002cfa0 To fix it, change the logic so that the sending CPU waits 5s for the receiving CPU to print its trace. If the receiving CPU prints its trace successfully then the sending CPU just continues, avoiding any spurious "stale" trace. This has the added benefit of allowing all CPUs to print their traces in order and avoids any interleaving of their output. Fixes: 5cc0591 ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()") Cc: stable@vger.kernel.org # v4.18+ Reported-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210625140408.3351173-1-mpe@ellerman.id.au
pull bot
pushed a commit
that referenced
this pull request
Dec 22, 2021
Hi, When testing install and uninstall of ipmi_si.ko and ipmi_msghandler.ko, the system crashed. The log as follows: [ 141.087026] BUG: unable to handle kernel paging request at ffffffffc09b3a5a [ 141.087241] PGD 8fe4c0d067 P4D 8fe4c0d067 PUD 8fe4c0f067 PMD 103ad89067 PTE 0 [ 141.087464] Oops: 0010 [#1] SMP NOPTI [ 141.087580] CPU: 67 PID: 668 Comm: kworker/67:1 Kdump: loaded Not tainted 4.18.0.x86_64 #47 [ 141.088009] Workqueue: events 0xffffffffc09b3a40 [ 141.088009] RIP: 0010:0xffffffffc09b3a5a [ 141.088009] Code: Bad RIP value. [ 141.088009] RSP: 0018:ffffb9094e2c3e88 EFLAGS: 00010246 [ 141.088009] RAX: 0000000000000000 RBX: ffff9abfdb1f04a0 RCX: 0000000000000000 [ 141.088009] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 [ 141.088009] RBP: 0000000000000000 R08: ffff9abfffee3cb8 R09: 00000000000002e1 [ 141.088009] R10: ffffb9094cb73d90 R11: 00000000000f4240 R12: ffff9abfffee8700 [ 141.088009] R13: 0000000000000000 R14: ffff9abfdb1f04a0 R15: ffff9abfdb1f04a8 [ 141.088009] FS: 0000000000000000(0000) GS:ffff9abfffec0000(0000) knlGS:0000000000000000 [ 141.088009] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 141.088009] CR2: ffffffffc09b3a30 CR3: 0000008fe4c0a001 CR4: 00000000007606e0 [ 141.088009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 141.088009] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 141.088009] PKRU: 55555554 [ 141.088009] Call Trace: [ 141.088009] ? process_one_work+0x195/0x390 [ 141.088009] ? worker_thread+0x30/0x390 [ 141.088009] ? process_one_work+0x390/0x390 [ 141.088009] ? kthread+0x10d/0x130 [ 141.088009] ? kthread_flush_work_fn+0x10/0x10 [ 141.088009] ? ret_from_fork+0x35/0x40] BUG: unable to handle kernel paging request at ffffffffc0b28a5a [ 200.223240] PGD 97fe00d067 P4D 97fe00d067 PUD 97fe00f067 PMD a580cbf067 PTE 0 [ 200.223464] Oops: 0010 [#1] SMP NOPTI [ 200.223579] CPU: 63 PID: 664 Comm: kworker/63:1 Kdump: loaded Not tainted 4.18.0.x86_64 #46 [ 200.224008] Workqueue: events 0xffffffffc0b28a40 [ 200.224008] RIP: 0010:0xffffffffc0b28a5a [ 200.224008] Code: Bad RIP value. [ 200.224008] RSP: 0018:ffffbf3c8e2a3e88 EFLAGS: 00010246 [ 200.224008] RAX: 0000000000000000 RBX: ffffa0799ad6bca0 RCX: 0000000000000000 [ 200.224008] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 [ 200.224008] RBP: 0000000000000000 R08: ffff9fe43fde3cb8 R09: 00000000000000d5 [ 200.224008] R10: ffffbf3c8cb53d90 R11: 00000000000f4240 R12: ffff9fe43fde8700 [ 200.224008] R13: 0000000000000000 R14: ffffa0799ad6bca0 R15: ffffa0799ad6bca8 [ 200.224008] FS: 0000000000000000(0000) GS:ffff9fe43fdc0000(0000) knlGS:0000000000000000 [ 200.224008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.224008] CR2: ffffffffc0b28a30 CR3: 00000097fe00a002 CR4: 00000000007606e0 [ 200.224008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 200.224008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 200.224008] PKRU: 55555554 [ 200.224008] Call Trace: [ 200.224008] ? process_one_work+0x195/0x390 [ 200.224008] ? worker_thread+0x30/0x390 [ 200.224008] ? process_one_work+0x390/0x390 [ 200.224008] ? kthread+0x10d/0x130 [ 200.224008] ? kthread_flush_work_fn+0x10/0x10 [ 200.224008] ? ret_from_fork+0x35/0x40 [ 200.224008] kernel fault(0x1) notification starting on CPU 63 [ 200.224008] kernel fault(0x1) notification finished on CPU 63 [ 200.224008] CR2: ffffffffc0b28a5a [ 200.224008] ---[ end trace c82a412d93f57412 ]--- The reason is as follows: T1: rmmod ipmi_si. ->ipmi_unregister_smi() -> ipmi_bmc_unregister() -> __ipmi_bmc_unregister() -> kref_put(&bmc->usecount, cleanup_bmc_device); -> schedule_work(&bmc->remove_work); T2: rmmod ipmi_msghandler. ipmi_msghander module uninstalled, and the module space will be freed. T3: bmc->remove_work doing cleanup the bmc resource. -> cleanup_bmc_work() -> platform_device_unregister(&bmc->pdev); -> platform_device_del(pdev); -> device_del(&pdev->dev); -> kobject_uevent(&dev->kobj, KOBJ_REMOVE); -> kobject_uevent_env() -> dev_uevent() -> if (dev->type && dev->type->name) 'dev->type'(bmc_device_type) pointer space has freed when uninstall ipmi_msghander module, 'dev->type->name' cause the system crash. drivers/char/ipmi/ipmi_msghandler.c: 2820 static const struct device_type bmc_device_type = { 2821 .groups = bmc_dev_attr_groups, 2822 }; Steps to reproduce: Add a time delay in cleanup_bmc_work() function, and uninstall ipmi_si and ipmi_msghandler module. 2910 static void cleanup_bmc_work(struct work_struct *work) 2911 { 2912 struct bmc_device *bmc = container_of(work, struct bmc_device, 2913 remove_work); 2914 int id = bmc->pdev.id; /* Unregister overwrites id */ 2915 2916 msleep(3000); <--- 2917 platform_device_unregister(&bmc->pdev); 2918 ida_simple_remove(&ipmi_bmc_ida, id); 2919 } Use 'remove_work_wq' instead of 'system_wq' to solve this issues. Fixes: b2cfd8a ("ipmi: Rework device id and guid handling to catch changing BMCs") Signed-off-by: Wu Bo <wubo40@huawei.com> Message-Id: <1640070034-56671-1-git-send-email-wubo40@huawei.com> Signed-off-by: Corey Minyard <cminyard@mvista.com>
pull bot
pushed a commit
that referenced
this pull request
Oct 4, 2022
The dump_cpu_task() function does not print registers on architectures that do not support NMIs. However, registers can be useful for debugging. Fortunately, in the case where dump_cpu_task() is invoked from an interrupt handler and is dumping the current CPU's stack, the get_irq_regs() function can be used to get the registers. Therefore, this commit makes dump_cpu_task() check to see if it is being asked to dump the current CPU's stack from within an interrupt handler, and, if so, it uses the get_irq_regs() function to obtain the registers. On systems that do support NMIs, this commit has the further advantage of avoiding a self-NMI in this case. This is an example of rcu self-detected stall on arm64, which does not support NMIs: [ 27.501721] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.502238] rcu: 0-....: (1250 ticks this GP) idle=4f7/1/0x4000000000000000 softirq=2594/2594 fqs=619 [ 27.502632] (t=1251 jiffies g=2989 q=29 ncpus=4) [ 27.503845] CPU: 0 PID: 306 Comm: test0 Not tainted 5.19.0-rc7-00009-g1c1a6c29ff99-dirty #46 [ 27.504732] Hardware name: linux,dummy-virt (DT) [ 27.504947] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 27.504998] pc : arch_counter_read+0x18/0x24 [ 27.505301] lr : arch_counter_read+0x18/0x24 [ 27.505328] sp : ffff80000b29bdf0 [ 27.505345] x29: ffff80000b29bdf0 x28: 0000000000000000 x27: 0000000000000000 [ 27.505475] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 27.505553] x23: 0000000000001f40 x22: ffff800009849c48 x21: 000000065f871ae0 [ 27.505627] x20: 00000000000025ec x19: ffff80000a6eb300 x18: ffffffffffffffff [ 27.505654] x17: 0000000000000001 x16: 0000000000000000 x15: ffff80000a6d0296 [ 27.505681] x14: ffffffffffffffff x13: ffff80000a29bc18 x12: 0000000000000426 [ 27.505709] x11: 0000000000000162 x10: ffff80000a2f3c18 x9 : ffff80000a29bc18 [ 27.505736] x8 : 00000000ffffefff x7 : ffff80000a2f3c18 x6 : 00000000759bd013 [ 27.505761] x5 : 01ffffffffffffff x4 : 0002dc6c00000000 x3 : 0000000000000017 [ 27.505787] x2 : 00000000000025ec x1 : ffff80000b29bdf0 x0 : 0000000075a30653 [ 27.505937] Call trace: [ 27.506002] arch_counter_read+0x18/0x24 [ 27.506171] ktime_get+0x48/0xa0 [ 27.506207] test_task+0x70/0xf0 [ 27.506227] kthread+0x10c/0x110 [ 27.506243] ret_from_fork+0x10/0x20 This is a marked improvement over the old output: [ 27.944550] rcu: INFO: rcu_preempt self-detected stall on CPU [ 27.944980] rcu: 0-....: (1249 ticks this GP) idle=cbb/1/0x4000000000000000 softirq=2610/2610 fqs=614 [ 27.945407] (t=1251 jiffies g=2681 q=28 ncpus=4) [ 27.945731] Task dump for CPU 0: [ 27.945844] task:test0 state:R running task stack: 0 pid: 306 ppid: 2 flags:0x0000000a [ 27.946073] Call trace: [ 27.946151] dump_backtrace.part.0+0xc8/0xd4 [ 27.946378] show_stack+0x18/0x70 [ 27.946405] sched_show_task+0x150/0x180 [ 27.946427] dump_cpu_task+0x44/0x54 [ 27.947193] rcu_dump_cpu_stacks+0xec/0x130 [ 27.947212] rcu_sched_clock_irq+0xb18/0xef0 [ 27.947231] update_process_times+0x68/0xac [ 27.947248] tick_sched_handle+0x34/0x60 [ 27.947266] tick_sched_timer+0x4c/0xa4 [ 27.947281] __hrtimer_run_queues+0x178/0x360 [ 27.947295] hrtimer_interrupt+0xe8/0x244 [ 27.947309] arch_timer_handler_virt+0x38/0x4c [ 27.947326] handle_percpu_devid_irq+0x88/0x230 [ 27.947342] generic_handle_domain_irq+0x2c/0x44 [ 27.947357] gic_handle_irq+0x44/0xc4 [ 27.947376] call_on_irq_stack+0x2c/0x54 [ 27.947415] do_interrupt_handler+0x80/0x94 [ 27.947431] el1_interrupt+0x34/0x70 [ 27.947447] el1h_64_irq_handler+0x18/0x24 [ 27.947462] el1h_64_irq+0x64/0x68 <--- the above backtrace is worthless [ 27.947474] arch_counter_read+0x18/0x24 [ 27.947487] ktime_get+0x48/0xa0 [ 27.947501] test_task+0x70/0xf0 [ 27.947520] kthread+0x10c/0x110 [ 27.947538] ret_from_fork+0x10/0x20 Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com>
pull bot
pushed a commit
that referenced
this pull request
Oct 10, 2022
Local testing revealed that we can trigger a use-after-free during rhashtable lookup as follows: | BUG: KASAN: use-after-free in memcmp lib/string.c:757 | Read of size 8 at addr ffff888107544dc0 by task perf-rhltable-n/1293 | | CPU: 0 PID: 1293 Comm: perf-rhltable-n Not tainted 6.0.0-rc3-00014-g85260862789c #46 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 | Call Trace: | <TASK> | memcmp lib/string.c:757 | rhashtable_compare include/linux/rhashtable.h:577 [inline] | __rhashtable_lookup include/linux/rhashtable.h:602 [inline] | rhltable_lookup include/linux/rhashtable.h:688 [inline] | task_bp_pinned kernel/events/hw_breakpoint.c:324 | toggle_bp_slot kernel/events/hw_breakpoint.c:462 | __release_bp_slot kernel/events/hw_breakpoint.c:631 [inline] | release_bp_slot kernel/events/hw_breakpoint.c:639 | register_perf_hw_breakpoint kernel/events/hw_breakpoint.c:742 | hw_breakpoint_event_init kernel/events/hw_breakpoint.c:976 | perf_try_init_event kernel/events/core.c:11261 | perf_init_event kernel/events/core.c:11325 [inline] | perf_event_alloc kernel/events/core.c:11619 | __do_sys_perf_event_open kernel/events/core.c:12157 | do_syscall_x64 arch/x86/entry/common.c:50 [inline] | do_syscall_64 arch/x86/entry/common.c:80 | entry_SYSCALL_64_after_hwframe | </TASK> | | Allocated by task 1292: | perf_event_alloc kernel/events/core.c:11505 | __do_sys_perf_event_open kernel/events/core.c:12157 | do_syscall_x64 arch/x86/entry/common.c:50 [inline] | do_syscall_64 arch/x86/entry/common.c:80 | entry_SYSCALL_64_after_hwframe | | Freed by task 1292: | perf_event_alloc kernel/events/core.c:11716 | __do_sys_perf_event_open kernel/events/core.c:12157 | do_syscall_x64 arch/x86/entry/common.c:50 [inline] | do_syscall_64 arch/x86/entry/common.c:80 | entry_SYSCALL_64_after_hwframe | | The buggy address belongs to the object at ffff888107544c00 | which belongs to the cache perf_event of size 1352 | The buggy address is located 448 bytes inside of | 1352-byte region [ffff888107544c00, ffff888107545148) This happens because the first perf_event_open() managed to reserve a HW breakpoint slot, however, later fails for other reasons and returns. The second perf_event_open() runs concurrently, and during rhltable_lookup() looks up an entry which is being freed: since rhltable_lookup() may run concurrently (under the RCU read lock) with rhltable_remove(), we may end up with a stale entry, for which memory may also have already been freed when being accessed. To fix, only free the failed perf_event after an RCU grace period. This allows subsystems that store references to an event to always access it concurrently under the RCU read lock, even if initialization will fail. Given failure is unlikely and a slow-path, turning the immediate free into a call_rcu()-wrapped free does not affect performance elsewhere. Fixes: 0370dc3 ("perf/hw_breakpoint: Optimize list of per-task breakpoints") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220927172025.1636995-1-elver@google.com
pull bot
pushed a commit
that referenced
this pull request
Feb 24, 2023
Patch series "mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare", v4. Problem ======= huge_pte_offset() is a major helper used by hugetlb code paths to walk a hugetlb pgtable. It's used mostly everywhere since that's needed even before taking the pgtable lock. huge_pte_offset() is always called with mmap lock held with either read or write. It was assumed to be safe but it's actually not. One race condition can easily trigger by: (1) firstly trigger pmd share on a memory range, (2) do huge_pte_offset() on the range, then at the meantime, (3) another thread unshare the pmd range, and the pgtable page is prone to lost if the other shared process wants to free it completely (by either munmap or exit mm). The recent work from Mike on vma lock can resolve most of this already. It's achieved by forbidden pmd unsharing during the lock being taken, so no further risk of the pgtable page being freed. It means if we can take the vma lock around all huge_pte_offset() callers it'll be safe. There're already a bunch of them that we did as per the latest mm-unstable, but also quite a few others that we didn't for various reasons especially on huge_pte_offset() usage. One more thing to mention is that besides the vma lock, i_mmap_rwsem can also be used to protect the pgtable page (along with its pgtable lock) from being freed from under us. IOW, huge_pte_offset() callers need to either hold the vma lock or i_mmap_rwsem to safely walk the pgtables. A reproducer of such problem, based on hugetlb GUP (NOTE: since the race is very hard to trigger, one needs to apply another kernel delay patch too, see below): ======8<======= #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <linux/memfd.h> #include <assert.h> #include <pthread.h> #define MSIZE (1UL << 30) /* 1GB */ #define PSIZE (2UL << 20) /* 2MB */ #define HOLD_SEC (1) int pipefd[2]; void *buf; void *do_map(int fd) { unsigned char *tmpbuf, *p; int ret; ret = posix_memalign((void **)&tmpbuf, MSIZE, MSIZE); if (ret) { perror("posix_memalign() failed"); return NULL; } tmpbuf = mmap(tmpbuf, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0); if (tmpbuf == MAP_FAILED) { perror("mmap() failed"); return NULL; } printf("mmap() -> %p\n", tmpbuf); for (p = tmpbuf; p < tmpbuf + MSIZE; p += PSIZE) { *p = 1; } return tmpbuf; } void do_unmap(void *buf) { munmap(buf, MSIZE); } void proc2(int fd) { unsigned char c; buf = do_map(fd); if (!buf) return; read(pipefd[0], &c, 1); /* * This frees the shared pgtable page, causing use-after-free in * proc1_thread1 when soft walking hugetlb pgtable. */ do_unmap(buf); printf("Proc2 quitting\n"); } void *proc1_thread1(void *data) { /* * Trigger follow-page on 1st 2m page. Kernel hack patch needed to * withhold this procedure for easier reproduce. */ madvise(buf, PSIZE, MADV_POPULATE_WRITE); printf("Proc1-thread1 quitting\n"); return NULL; } void *proc1_thread2(void *data) { unsigned char c; /* Wait a while until proc1_thread1() start to wait */ sleep(0.5); /* Trigger pmd unshare */ madvise(buf, PSIZE, MADV_DONTNEED); /* Kick off proc2 to release the pgtable */ write(pipefd[1], &c, 1); printf("Proc1-thread2 quitting\n"); return NULL; } void proc1(int fd) { pthread_t tid1, tid2; int ret; buf = do_map(fd); if (!buf) return; ret = pthread_create(&tid1, NULL, proc1_thread1, NULL); assert(ret == 0); ret = pthread_create(&tid2, NULL, proc1_thread2, NULL); assert(ret == 0); /* Kick the child to share the PUD entry */ pthread_join(tid1, NULL); pthread_join(tid2, NULL); do_unmap(buf); } int main(void) { int fd, ret; fd = memfd_create("test-huge", MFD_HUGETLB | MFD_HUGE_2MB); if (fd < 0) { perror("open failed"); return -1; } ret = ftruncate(fd, MSIZE); if (ret) { perror("ftruncate() failed"); return -1; } ret = pipe(pipefd); if (ret) { perror("pipe() failed"); return -1; } if (fork()) { proc1(fd); } else { proc2(fd); } close(pipefd[0]); close(pipefd[1]); close(fd); return 0; } ======8<======= The kernel patch needed to present such a race so it'll trigger 100%: ======8<======= : diff --git a/mm/hugetlb.c b/mm/hugetlb.c : index 9d97c9a..f8d99dad5004 100644 : --- a/mm/hugetlb.c : +++ b/mm/hugetlb.c : @@ -38,6 +38,7 @@ : #include <asm/page.h> : #include <asm/pgalloc.h> : #include <asm/tlb.h> : +#include <asm/delay.h> : : #include <linux/io.h> : #include <linux/hugetlb.h> : @@ -6290,6 +6291,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, : bool unshare = false; : int absent; : struct page *page; : + unsigned long c = 0; : : /* : * If we have a pending SIGKILL, don't keep faulting pages and : @@ -6309,6 +6311,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, : */ : pte = huge_pte_offset(mm, vaddr & huge_page_mask(h), : huge_page_size(h)); : + : + pr_info("%s: withhold 1 sec...\n", __func__); : + for (c = 0; c < 100; c++) { : + udelay(10000); : + } : + pr_info("%s: withhold 1 sec...done\n", __func__); : + : if (pte) : ptl = huge_pte_lock(h, mm, pte); : absent = !pte || huge_pte_none(huge_ptep_get(pte)); : ======8<======= It'll trigger use-after-free of the pgtable spinlock: ======8<======= [ 16.959907] follow_hugetlb_page: withhold 1 sec... [ 17.960315] follow_hugetlb_page: withhold 1 sec...done [ 17.960550] ------------[ cut here ]------------ [ 17.960742] DEBUG_LOCKS_WARN_ON(1) [ 17.960756] WARNING: CPU: 3 PID: 542 at kernel/locking/lockdep.c:231 __lock_acquire+0x955/0x1fa0 [ 17.961264] Modules linked in: [ 17.961394] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Not tainted 6.1.0-rc4-peterx+ #46 [ 17.961704] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.962266] RIP: 0010:__lock_acquire+0x955/0x1fa0 [ 17.962516] Code: c0 0f 84 5f fe ff ff 44 8b 1d 0f 9a 29 02 45 85 db 0f 85 4f fe ff ff 48 c7 c6 75 50 83 82 48 c7 c7 1b 4b 7d 82 e8 d3 22 d8 00 <0f> 0b 31 c0 4c 8b 54 24 08 4c 8b 04 24 e9 [ 17.963494] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010096 [ 17.963704] RAX: 0000000000000016 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.963989] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.964276] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffc90000e4fa58 [ 17.964557] R10: 0000000000000003 R11: ffffffff83162688 R12: 0000000000000000 [ 17.964839] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.965123] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.965443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.965672] CR2: 00007f17c09ffef8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.965956] PKRU: 55555554 [ 17.966068] Call Trace: [ 17.966172] <TASK> [ 17.966268] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.966455] lock_acquire+0xbf/0x2b0 [ 17.966603] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.966799] ? _printk+0x48/0x4e [ 17.966934] _raw_spin_lock+0x2f/0x40 [ 17.967087] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967285] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.967473] __get_user_pages+0xbb/0x620 [ 17.967635] faultin_vma_page_range+0x9a/0x100 [ 17.967817] madvise_vma_behavior+0x3c0/0xbd0 [ 17.967998] ? mas_prev+0x11/0x290 [ 17.968141] ? find_vma_prev+0x5e/0xa0 [ 17.968304] ? madvise_vma_anon_name+0x70/0x70 [ 17.968486] madvise_walk_vmas+0xa9/0x120 [ 17.968650] do_madvise.part.0+0xfa/0x270 [ 17.968813] __x64_sys_madvise+0x5a/0x70 [ 17.968974] do_syscall_64+0x37/0x90 [ 17.969123] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 17.969329] RIP: 0033:0x7f1840f0efdb [ 17.969477] Code: c3 66 0f 1f 44 00 00 48 8b 15 39 6e 0e 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0d 68 [ 17.970205] RSP: 002b:00007f17c09ffe38 EFLAGS: 00000202 ORIG_RAX: 000000000000001c [ 17.970504] RAX: ffffffffffffffda RBX: 00007f17c0a00640 RCX: 00007f1840f0efdb [ 17.970786] RDX: 0000000000000017 RSI: 0000000000200000 RDI: 00007f1800000000 [ 17.971068] RBP: 00007f17c09ffe50 R08: 0000000000000000 R09: 00007ffd3954164f [ 17.971353] R10: 00007f1840e10348 R11: 0000000000000202 R12: ffffffffffffff80 [ 17.971709] R13: 0000000000000000 R14: 00007ffd39541550 R15: 00007f17c0200000 [ 17.972083] </TASK> [ 17.972199] irq event stamp: 2353 [ 17.972372] hardirqs last enabled at (2353): [<ffffffff8117fe4e>] __up_console_sem+0x5e/0x70 [ 17.972869] hardirqs last disabled at (2352): [<ffffffff8117fe33>] __up_console_sem+0x43/0x70 [ 17.973365] softirqs last enabled at (2330): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.973857] softirqs last disabled at (2323): [<ffffffff810f763d>] __irq_exit_rcu+0xed/0x160 [ 17.974341] ---[ end trace 0000000000000000 ]--- [ 17.974614] BUG: kernel NULL pointer dereference, address: 00000000000000b8 [ 17.975012] #PF: supervisor read access in kernel mode [ 17.975314] #PF: error_code(0x0000) - not-present page [ 17.975615] PGD 103f7b067 P4D 103f7b067 PUD 106cd7067 PMD 0 [ 17.975943] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 17.976197] CPU: 3 PID: 542 Comm: hugetlb-pmd-sha Tainted: G W 6.1.0-rc4-peterx+ #46 [ 17.976712] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 17.977370] RIP: 0010:__lock_acquire+0x190/0x1fa0 [ 17.977655] Code: 98 00 00 00 41 89 46 24 81 e2 ff 1f 00 00 48 0f a3 15 e4 ba dd 02 0f 83 ff 05 00 00 48 8d 04 52 48 c1 e0 06 48 05 c0 d2 f4 83 <44> 0f b6 a0 b8 00 00 00 41 0f b7 46 20 6f [ 17.979170] RSP: 0018:ffffc90000e4fba8 EFLAGS: 00010046 [ 17.979787] RAX: 0000000000000000 RBX: fffffffffd3925a8 RCX: 0000000000000000 [ 17.980838] RDX: 0000000000000002 RSI: ffffffff82863ccf RDI: 00000000ffffffff [ 17.982048] RBP: 0000000000000000 R08: ffff888105eac720 R09: ffffc90000e4fa58 [ 17.982892] R10: ffff888105eab900 R11: ffffffff83162688 R12: 0000000000000000 [ 17.983771] R13: 0000000000000001 R14: ffff888105eac748 R15: 0000000000000001 [ 17.984815] FS: 00007f17c0a00640(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 [ 17.985924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.986265] CR2: 00000000000000b8 CR3: 000000010c87a005 CR4: 0000000000770ee0 [ 17.986674] PKRU: 55555554 [ 17.986832] Call Trace: [ 17.987012] <TASK> [ 17.987266] ? tick_nohz_tick_stopped+0x12/0x30 [ 17.987770] lock_acquire+0xbf/0x2b0 [ 17.988118] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.988575] ? _printk+0x48/0x4e [ 17.988889] _raw_spin_lock+0x2f/0x40 [ 17.989243] ? follow_hugetlb_page.cold+0x75/0x5c4 [ 17.989687] follow_hugetlb_page.cold+0x75/0x5c4 [ 17.990119] __get_user_pages+0xbb/0x620 [ 17.990500] faultin_vma_page_range+0x9a/0x100 [ 17.990928] madvise_vma_behavior+0x3c0/0xbd0 [ 17.991354] ? mas_prev+0x11/0x290 [ 17.991678] ? find_vma_prev+0x5e/0xa0 [ 17.992024] ? madvise_vma_anon_name+0x70/0x70 [ 17.992421] madvise_walk_vmas+0xa9/0x120 [ 17.992793] do_madvise.part.0+0xfa/0x270 [ 17.993166] __x64_sys_madvise+0x5a/0x70 [ 17.993539] do_syscall_64+0x37/0x90 [ 17.993879] entry_SYSCALL_64_after_hwframe+0x63/0xcd ======8<======= Resolution ========== This patchset protects all the huge_pte_offset() callers to also take the vma lock properly. Patch Layout ============ Patch 1-2: cleanup, or dependency of the follow up patches Patch 3: before fixing, document huge_pte_offset() on lock required Patch 4-8: each patch resolves one possible race condition Patch 9: introduce hugetlb_walk() to replace huge_pte_offset() Tests ===== The series is verified with the above reproducer so the race cannot trigger anymore. It also passes all hugetlb kselftests. This patch (of 9): Even though vma_offset_start() is named like that, it's not returning "the start address of the range" but rather the offset we should use to offset the vma->vm_start address. Make it return the real value of the start vaddr, and it also helps for all the callers because whenever the retval is used, it'll be ultimately added into the vma->vm_start anyway, so it's better. Link: https://lkml.kernel.org/r/20221216155100.2043537-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221216155100.2043537-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jann Horn <jannh@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]. Want to support this open source service? Please star it : )