Skip to content

Conversation

rahrawat
Copy link
Owner

No description provided.

Alexey Dobriyan and others added 30 commits April 5, 2018 21:36
size_index_elem() always works with small sizes (kmalloc caches are
32-bit) and returns small indexes.

Link: http://lkml.kernel.org/r/20180305200730.15812-8-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->remote_node_defrag_ratio is in range 0..1000.

This also adds a check and modifies the behavior to return an error
code.  Before this patch invalid values were ignored.

Link: http://lkml.kernel.org/r/20180305200730.15812-9-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->max_attr_size is maximum length of every SLAB memcg attribute
ever written. VFS limits those to INT_MAX.

Link: http://lkml.kernel.org/r/20180305200730.15812-10-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Padding length can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-11-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->reserved is either 0 or sizeof(struct rcu_head), can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-12-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kmem cache alignment can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-13-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->inuse is "the number of bytes in actual use by the object",
can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-14-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
	/*
	 * cpu_partial determined the maximum number of objects
	 * kept in the per cpu partial lists of a processor.
	 */

Can't be negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->offset is free pointer offset from the start of the object, can't be
negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-16-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux doesn't support negative length objects.

Link: http://lkml.kernel.org/r/20180305200730.15812-17-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux doesn't support negative length objects (including meta data).

Link: http://lkml.kernel.org/r/20180305200730.15812-18-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that all sizes are properly typed, propagate "unsigned int" down the
callgraph.

Link: http://lkml.kernel.org/r/20180305200730.15812-19-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If SLAB doesn't support 4GB+ kmem caches (it never did), KASAN should
not do it as well.

Link: http://lkml.kernel.org/r/20180305200730.15812-20-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kmem case sizes are 32-bit, then usecopy region should be too.

Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab_index() returns index of an object within a slab which is at most
u15 (or u16?).

Iterators additionally guarantee that "p >= addr".

Link: http://lkml.kernel.org/r/20180305200730.15812-22-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct kmem_cache_order_objects is for mixing order and number of
objects, and orders aren't big enough to warrant 64-bit width.

Propagate unsignedness down so that everything fits.

!!! Patch assumes that "PAGE_SIZE << order" doesn't overflow. !!!

Link: http://lkml.kernel.org/r/20180305200730.15812-23-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function returns size of the object without red zone which can't be
negative.

Link: http://lkml.kernel.org/r/20180305200730.15812-24-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SLAB doesn't support 4GB+ of objects per slab, therefore randomization
doesn't need size_t.

Link: http://lkml.kernel.org/r/20180305200730.15812-25-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I have noticed on debug kernel with SLAB, the size of some non-root
slabs were larger than their corresponding root slabs.

e.g. for radix_tree_node:
  $cat /proc/slabinfo | grep radix
  name     <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
  radix_tree_node 15052    15075      4096         1             1 ...

  $cat /cgroup/memory/temp/memory.kmem.slabinfo | grep radix
  name     <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> ...
  radix_tree_node 1581      158       4120         1             2 ...

However for SLUB in debug kernel, the sizes were same.  On further
inspection it is found that SLUB always use kmem_cache.object_size to
measure the kmem_cache.size while SLAB use the given kmem_cache.size.
In the debug kernel the slab's size can be larger than its object_size.
Thus in the creation of non-root slab, the SLAB uses the root's size as
base to calculate the non-root slab's size and thus non-root slab's size
can be larger than the root slab's size.  For SLUB, the non-root slab's
size is measured based on the root's object_size and thus the size will
remain same for root and non-root slab.

This patch makes slab's object_size the default base to measure the
slab's size.

Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com
Fixes: 794b124 ("memcg, slab: separate memcg vs root cache creation paths")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit db265ec ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel always duplicates the slab cache name when
creating a slab cache, so the test if the slab name is accessible is
useless.

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kasan quarantine is designed to delay freeing slab objects to catch
use-after-free.  The quarantine can be large (several percent of machine
memory size).  When kmem_caches are deleted related objects are flushed
from the quarantine but this requires scanning the entire quarantine
which can be very slow.  We have seen the kernel busily working on this
while holding slab_mutex and badly affecting cache_reaper, slabinfo
readers and memcg kmem cache creations.

It can easily reproduced by following script:

	yes . | head -1000000 | xargs stat > /dev/null
	for i in `seq 1 10`; do
		seq 500 | (cd /cg/memory && xargs mkdir)
		seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
			/cg/memory/{}/tasks && exec stat .' > /dev/null
		seq 500 | (cd /cg/memory && xargs rmdir)
	done

The busy stack:
    kasan_cache_shutdown
    shutdown_cache
    memcg_destroy_kmem_caches
    mem_cgroup_css_free
    css_free_rwork_fn
    process_one_work
    worker_thread
    kthread
    ret_from_fork

This patch is based on the observation that if the kmem_cache to be
destroyed is empty then there should not be any objects of this cache in
the quarantine.

Without the patch the script got stuck for couple of hours.  With the
patch the script completed within a second.

Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stable_node_dup() is local to the source and does not need to be in
global scope, so make it static.

Cleans up sparse warning:

  mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation for ignore_rlimit_data says that it will print a
warning at first misuse.  Yet it doesn't seem to do that.

Fix the code to print the warning even when we allow the process to
continue.

Link: http://lkml.kernel.org/r/1517935505-9321-1-git-send-email-dwmw@amazon.co.uk
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_contig_range() initiates compaction and eventual migration for the
purpose of either CMA or HugeTLB allocations.  At present, the reason
code remains the same MR_CMA for either of these cases.  Let's make it
MR_CONTIG_RANGE which will appropriately reflect the reason code in both
these cases.

Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For mm/swap_slots.c, use the traditional Linux method of conditional
compilation and linking instead of always compiling it by using #ifdef
CONFIG_SWAP and #endif for the entire source file (excluding header
files).

Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand
initialization is finished, allocations may fail.  While this is highly
unlikely scenario, since this kind of allocation request must be large,
and must come from interrupt handler, we still want to cover it.

We solve this by initializing deferred pages with interrupts disabled,
and holding node_size_lock spin lock while pages in the node are being
initialized.  The on-demand deferred page initialization that comes
later will use the same lock, and thus synchronize with
deferred_init_memmap().

It is unlikely for threads that initialize deferred pages to be
interrupted.  They run soon after smp_init(), but before modules are
initialized, and long before user space programs.  This is why there is
no adverse effect of having these threads running with interrupts
disabled.

[pasha.tatashin@oracle.com: v6]
  Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the
rest later on.

It is, however, problematic to know how many pages the kernel needs
during boot.  Different modules and kernel parameters may change the
requirement, so the boot cpu either initializes too many pages or runs
out of memory.

To fix that, initialize early pages on demand.  This ensures the kernel
does the minimum amount of work to initialize pages during boot and
leaves the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).

The on-demand code is permanently disabled using static branching once
deferred pages are initialized.  After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.

Sergey Senozhatsky noticed that while deferred pages currently make
sense only on NUMA machines (we start one thread per latency node),
CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT,
so that is also must be addressed in the patch.

[akpm@linux-foundation.org: fix typo in comment, make deferred_pages static]
[pasha.tatashin@oracle.com: fix min() type mismatch warning]
  Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()]
  Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: might_sleep warning]
  Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()]
[pasha.tatashin@oracle.com: v5]
  Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: tweak comments]
[pasha.tatashin@oracle.com: v6]
  Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
…_pte_refs_one()

For PTE-mapped THP, the compound THP has not been split to normal 4K
pages yet, the whole THP is considered referenced if any one of sub page
is referenced.

When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
to retrieve referenced bit.  But, the current code just returns the
result of the last PTE.  If the last PTE has not referenced, the
referenced flag will be cleared.

Just set referenced when ptep{pmdp}_clear_young_notify() returns true.

Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Gang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "optimize memory hotplug", v3.

This patchset:

 - Improves hotplug performance by eliminating a number of struct page
   traverses during memory hotplug.

 - Fixes some issues with hotplugging, where boundaries were not
   properly checked. And on x86 block size was not properly aligned with
   end of memory

 - Also, potentially improves boot performance by eliminating condition
   from __init_single_page().

 - Adds robustness by verifying that that struct pages are correctly
   poisoned when flags are accessed.

The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
2.60GHz with 1T RAM:

booting in qemu with 960G of memory, time to initialize struct pages:

no-kvm:
	TRY1		TRY2
BEFORE:	39.433668	39.39705
AFTER:	36.903781	36.989329

with-kvm:
BEFORE:	10.977447	11.103164
AFTER:	10.929072	10.751885

Hotplug 896G memory:
no-kvm:
	TRY1		TRY2
BEFORE: 848.740000	846.910000
AFTER:  783.070000	786.560000

with-kvm:
	TRY1		TRY2
BEFORE: 34.410000	33.57
AFTER:	29.810000	29.580000

This patch (of 6):

Start qemu with the following arguments:

  -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

Which: boots machine with 64G, and adds a device mem1 with 2G which can
be hotplugged later.

Also make sure that config has the following turned on:
  CONFIG_MEMORY_HOTPLUG
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
  CONFIG_ACPI_HOTPLUG_MEMORY

Using the qemu monitor hotplug the memory (make sure config has (qemu)
device_add pc-dimm,id=dimm1,memdev=mem1

The operation will fail with the following trace:

    WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
    pages_correctly_reserved+0xe6/0x110
    Modules linked in:
    CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
    BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:pages_correctly_reserved+0xe6/0x110
    Call Trace:
     memory_subsys_online+0x44/0xa0
     device_online+0x51/0x80
     store_mem_state+0x5e/0xe0
     kernfs_fop_write+0xfa/0x170
     __vfs_write+0x2e/0x150
     vfs_write+0xa8/0x1a0
     SyS_write+0x4d/0xb0
     do_syscall_64+0x5d/0x110
     entry_SYSCALL_64_after_hwframe+0x21/0x86
    ---[ end trace 6203bc4f1a5d30e8 ]---

The problem is detected in: drivers/base/memory.c

   static bool pages_correctly_reserved(unsigned long start_pfn)
   205                 if (WARN_ON_ONCE(!pfn_valid(pfn)))

This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.

The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000

or

   $ dmesg | grep "block size"
   [    0.086469] x86/mm: Memory block size: 2048MB

During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations.  See:
Documentation/memory-hotplug.txt

So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.

In our case the start_pfn starts from the last_pfn (end of physical
memory).

   $ dmesg | grep last_pfn
   [    0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

0x1040000 == 65G, and so is not 2G aligned!

The fix is to enforce that memory that is hotplugged and hotremoved is
block size aligned.

With this fix, running the above sequence yield to the following result:

   (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
   Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
   							size 0x80000000
   acpi PNP0C80:00: add_memory failed
   acpi PNP0C80:00: acpi_memory_enable_device() error
   acpi PNP0C80:00: Enumeration failure

Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
…memory

Memory sections are combined into "memory block" chunks.  These chunks
are the units upon which memory can be added and removed.

On x86, the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hot-plugging/hot-removing can be broken.

Memory sections are combined into "memory block" chunks.  These chunks
are the units upon which memory can be added and removed.

On x86 the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hotplugging/hotremoving can be broken.

Currently, whenever machine is booted with more than 64G the block size
is unconditionally increased to 2G from the base 128M.  This is done in
order to reduce number of memory device files in sysfs:

	/sys/devices/system/memory/memoryXXX

We must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.

So, when memory is larger or equal to 64G, we check the end address and
find the largest block size that is still power of two but smaller or
equal to 2G.

Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
							size 0x80000000
acpi PNP0C80:00: add_memory failed
acpi PNP0C80:00: acpi_memory_enable_device() error
acpi PNP0C80:00: Enumeration failure

With the fix memory is added successfully as the block size is set to
1G, and therefore aligns with start address 0x1040000000.

[pasha.tatashin@oracle.com: v4]
  Link: http://lkml.kernel.org/r/20180215165920.8570-3-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180213193159.14606-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vinod Koul and others added 17 commits April 10, 2018 08:55
KTHREAD_SIZE has never been used since it has been defined for c6x arch.
Let's remove this useless definition.

Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Signed-off-by: Mark Salter <msalter@redhat.com>
Fix build error reported by the 0day bot by including the header
file for that macro.

Fixes this build error: (should fix; not tested)
arch/c6x/platforms/plldata.c: In function 'c6472_setup_clocks':
arch/c6x/platforms/plldata.c:279:33: error: implicit declaration of function 'get_coreid'; did you mean 'get_order'? [-Werror=implicit-function-declaration]
      c6x_core_clk.parent = &sysclks[get_coreid() + 1];

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: linux-c6x-dev@linux-c6x.org
Cc: Ingo Molnar <mingo@kernel.org>

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mark Salter <msalter@redhat.com>
c6x depends on the macro '_BIG_ENDIAN' being defined or not
to correctly select or define endian-specific macros, structures
or pieces of code.

This macro is predefined by the compiler but sparse knows nothing
about it and thus may pre-process files differently from what
gcc would.

Fix this by passing '-D_BIG_ENDIAN' when compiling a big-endian
kernel, like GCC would have done.

To: Mark Salter <msalter@redhat.com>
To: Aurelien Jacquiot <a-jacquiot@ti.com>
CC: linux-c6x-dev@linux-c6x.org
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Mark Salter <msalter@redhat.com>
…l/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
 "A series of media updates/fixes for 4.17.

  There are two important core fix patches in this series:

   - A regression fix on Kernel 4.16 with causes it to not work with
     some input devices that depend on media core

   - A fix at compat32 bits with causes it to OOPS on overlay, and
     affects the Kernels where the CVE-2017-13166 was backported

  The remaining ones are other random fixes at the documentation and on
  drivers.

  The biggest part of this series is a set of 18 patches for the Intel
  atomisp driver. Currently, it produces hundreds of warnings/errors on
  sparse/smatch, causing me to sometimes ignore new warnings on other
  drivers that are not so broken. This driver is on really poor state,
  even for staging standards: it has several layers of abstraction on
  it, and it supports two different hardware. Selecting between them
  require to add a define (there isn't even a Kconfig option for such
  purpose). Just on this smatch cleanup, I could easily get rid of 8
  "do-nothing" files. So, I'm seriously considering its removal from
  upstream, if I don't see any real work on addressing the problems
  there along this year"

* tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (48 commits)
  media: v4l2-core: fix size of devnode_nums[] bitarray
  media: v4l2-compat-ioctl32: don't oops on overlay
  media: i2c: adv748x: afe: fix sparse warning
  media: extended-controls.rst: transmitter -> receiver
  media: staging: atomisp: stop duplicating input format types
  media: staging: atomisp: get rid of an unused var
  media: staging: atomisp: stop mixing enum types
  media: staging: atomisp: get rid of some static warnings
  media: staging: atomisp: use %p to print pointers
  media: staging: atomisp: remove an useless check
  media: staging: atomisp: avoid a warning if 32 bits build
  media: staging: atomisp: don't access a NULL var
  media: staging: atomisp: Get rid of *default.host.[ch]
  media: staging: atomisp: get rid of an unused function
  media: staging: atomisp: remove unused set_pd_base()
  media: staging: atomisp: fix endianess issues
  media: staging: atomisp: add a missing include
  media: staging: atomisp: get rid of stupid statements
  media: staging: atomisp: declare static vars as such
  media: staging: atomisp: ia_css_output.host: don't use var before check
  ...
…kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "The main purpose of this pull request is a fix for a regression in the
  recent PCM OSS emulation code that may lead to RCU stall. Since
  syzkaller hits this too often, I send the pull request now with a
  minimal collection. Possibly another pull request may follow before
  RC1.

  The other fixes here are for USB-audio class 2 and 3 to improve the
  parser for the clock descriptors. These are rather cleanups but good
  for security, too.

  Last but not least, another included fix is the trivial one to remove
  superfluous WARN_ON() that annoyed syzbot"

* tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() error
  ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
  ALSA: usb-audio: Add sanity checks in UAC3 clock parsers
  ALSA: usb-audio: More strict sanity checks for clock parsers
  ALSA: usb-audio: Refactor clock finder helpers
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
 "There is nothing really major here, just a couple of small bugfixes,
  improvements and cleanups:

   - make it possible to load radeonfb driver when offb driver is loaded
     first (Mathieu Malaterre)

   - fix memory leak in offb driver (Mathieu Malaterre)

   - fix unaligned access in udlfb driver (Ladislav Michl)

   - convert atmel_lcdfb driver to use GPIO descriptors (Ludovic
     Desroches)

   - avoid mismatched prototypes in sisfb driver (Arnd Bergmann)

   - remove VLA usage from viafb driver (Gustavo A. R. Silva)

   - add missing help text to FB_I810_I2 config option (Ulf Magnusson)

   - misc fixes (Gustavo A. R. Silva, Colin Ian King, Markus Elfring)

   - remove dead code from s3c-fb driver for Exynos and S5PV210
     platforms

   - misc cleanups (Corentin Labbe, Ladislav Michl, Ulf Magnusson,
     Vladimir Zapolskiy, Markus Elfring)"

* tag 'fbdev-v4.17' of git://github.com/bzolnier/linux: (32 commits)
  video: fbdev: s3c-fb: remove dead platform code for Exynos and S5PV210 platforms
  video: au1100fb: Delete an unnecessary variable initialisation in au1100fb_drv_probe()
  video: au1100fb: Improve a size determination in au1100fb_drv_probe()
  video: au1100fb: Delete an error message for a failed memory allocation in au1100fb_drv_probe()
  video/console/sticore: Delete an error message for a failed memory allocation in sti_try_rom_generic()
  video: ARM CLCD: Improve a size determination in clcdfb_probe()
  video: ARM CLCD: Delete an error message for a failed memory allocation in clcdfb_probe()
  video: matroxfb: Delete an error message for a failed memory allocation in matroxfb_crtc2_probe()
  video: s3c-fb: Improve a size determination in s3c_fb_probe()
  video: s3c-fb: Delete an error message for a failed memory allocation in s3c_fb_probe()
  video: fsl-diu-fb: Delete an error message for a failed memory allocation in fsl_diu_init()
  video: ssd1307fb: Improve a size determination in ssd1307fb_probe()
  video: smscufx: Delete an error message for a failed memory allocation in ufx_realloc_framebuffer()
  video: smscufx: Return an error code only as a constant in ufx_realloc_framebuffer()
  video: smscufx: Less checks in ufx_usb_probe() after error detection
  video: udlfb: Return an error code only as a constant in dlfb_realloc_framebuffer()
  video/fbdev/stifb: Delete an error message for a failed memory allocation in stifb_init_fb()
  video/fbdev/stifb: Return -ENOMEM after a failed kzalloc() in stifb_init_fb()
  video: fbdev: aty128fb: use true and false for boolean values
  fbdev: aty: fix missing indentation in if statement
  ...
…/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "This contains a few series that have been in preparation for a while
  and that will help systems with RTCs that will fail in 2038, 2069 or
  2100.

  Subsystem:
   - Add tracepoints
   - Rework of the RTC/nvmem API to allow drivers to discard struct
     nvmem_config after registration
   - New range API, drivers can now expose the useful range of the RTC
   - New offset API the core is now able to add an offset to the RTC
     time, modifying the supported range.
   - Multiple rtc_time64_to_tm fixes
   - Handle time_t overflow on 32 bit platforms in the core instead of
     letting drivers do crazy things.
   - remove rtc_control API

  New driver:
   - Intersil ISL12026

  Drivers:
   - Drivers exposing the RTC non volatile memory have been converted to
     use nvmem
   - Removed useless time and date validation
   - Removed an indirection pattern that was a cargo cult from ancient
     drivers
   - Removed VLA usage
   - Fixed a possible race condition in probe functions
   - AB8540 support is dropped from ab8500
   - pcf85363 now has alarm support"

* tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (128 commits)
  rtc: snvs: Fix usage of snvs_rtc_enable
  rtc: mt7622: fix module autoloading for OF platform drivers
  rtc: isl12022: use true and false for boolean values
  rtc: ab8500: Drop AB8540 support
  rtc: remove a warning during scripts/kernel-doc step
  rtc: 88pm860x: remove artificial limitation
  rtc: 88pm80x: remove artificial limitation
  rtc: st-lpc: remove artificial limitation
  rtc: mrst: remove artificial limitation
  rtc: mv: remove artificial limitation
  rtc: hctosys: Ensure system time doesn't overflow time_t
  parisc: time: stop validating rtc_time in .read_time
  rtc: pcf85063: fix clearing bits in pcf85063_start_clock
  rtc: at91sam9: Set name of regmap_config
  rtc: s5m: Remove VLA usage
  rtc: s5m: Move enum from rtc.h to rtc-s5m.c
  rtc: remove VLA usage
  rtc: Add useful timestamp definitions
  rtc: Add one offset seconds to expand RTC range
  rtc: Factor out the RTC range validation into rtc_valid_range()
  ...
…kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This cycle was was not something I ever want to repeat as there were
  several late changes that have only now just settled.

  Half of the branch up to commit d2c997c ("fs, dax: use
  page->mapping to warn...") have been in -next for several releases.
  The of_pmem driver and the address range scrub rework were late
  arrivals, and the dax work was scaled back at the last moment.

  The of_pmem driver missed a previous merge window due to an oversight.
  A sense of obligation to rectify that miss is why it is included for
  4.17. It has acks from PowerPC folks. Stephen reported a build failure
  that only occurs when merging it with your latest tree, for now I have
  fixed that up by disabling modular builds of of_pmem. A test merge
  with your tree has received a build success report from the 0day robot
  over 156 configs.

  An initial version of the ARS rework was submitted before the merge
  window. It is self contained to libnvdimm, a net code reduction, and
  passing all unit tests.

  The filesystem-dax changes are based on the wait_var_event()
  functionality from tip/sched/core. However, late review feedback
  showed that those changes regressed truncate performance to a large
  degree. The branch was rewound to drop the truncate behavior change
  and now only includes preparation patches and cleanups (with full acks
  and reviews). The finalization of this dax-dma-vs-trnucate work will
  need to wait for 4.18.

  Summary:

   - A rework of the filesytem-dax implementation provides for detection
     of unmap operations (truncate / hole punch) colliding with
     in-progress device-DMA. A fix for these collisions remains a
     work-in-progress pending resolution of truncate latency and
     starvation regressions.

   - The of_pmem driver expands the users of libnvdimm outside of x86
     and ACPI to describe an implementation of persistent memory on
     PowerPC with Open Firmware / Device tree.

   - Address Range Scrub (ARS) handling is completely rewritten to
     account for the fact that ARS may run for 100s of seconds and there
     is no platform defined way to cancel it. ARS will now no longer
     block namespace initialization.

   - The NVDIMM Namespace Label implementation is updated to handle
     label areas as small as 1K, down from 128K.

   - Miscellaneous cleanups and updates to unit test infrastructure"

* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
  libnvdimm, of_pmem: workaround OF_NUMA=n build error
  nfit, address-range-scrub: add module option to skip initial ars
  nfit, address-range-scrub: rework and simplify ARS state machine
  nfit, address-range-scrub: determine one platform max_ars value
  powerpc/powernv: Create platform devs for nvdimm buses
  doc/devicetree: Persistent memory region bindings
  libnvdimm: Add device-tree based driver
  libnvdimm: Add of_node to region and bus descriptors
  libnvdimm, region: quiet region probe
  libnvdimm, namespace: use a safe lookup for dimm device name
  libnvdimm, dimm: fix dpa reservation vs uninitialized label area
  libnvdimm, testing: update the default smart ctrl_temperature
  libnvdimm, testing: Add emulation for smart injection commands
  nfit, address-range-scrub: introduce nfit_spa->ars_state
  libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
  nfit, address-range-scrub: fix scrub in-progress reporting
  dax, dm: allow device-mapper to operate without dax support
  dax: introduce CONFIG_DAX_DRIVER
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  ext2, dax: introduce ext2_dax_aops
  ...
…git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New features:

   - Tom Zanussi's extended histogram work.

     This adds the synthetic events to have histograms from multiple
     event data Adds triggers "onmatch" and "onmax" to call the
     synthetic events Several updates to the histogram code from this

   - Allow way to nest ring buffer calls in the same context

   - Allow absolute time stamps in ring buffer

   - Rewrite of filter code parsing based on Al Viro's suggestions

   - Setting of trace_clock to global if TSC is unstable (on boot)

   - Better OOM handling when allocating large ring buffers

   - Added initcall tracepoints (consolidated initcall_debug code with
     them)

  And other various fixes and clean ups"

* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
  init: Have initcall_debug still work without CONFIG_TRACEPOINTS
  init, tracing: Have printk come through the trace events for initcall_debug
  init, tracing: instrument security and console initcall trace events
  init, tracing: Add initcall trace events
  tracing: Add rcu dereference annotation for test func that touches filter->prog
  tracing: Add rcu dereference annotation for filter->prog
  tracing: Fixup logic inversion on setting trace_global_clock defaults
  tracing: Hide global trace clock from lockdep
  ring-buffer: Add set/clear_current_oom_origin() during allocations
  ring-buffer: Check if memory is available before allocation
  lockdep: Add print_irqtrace_events() to __warn
  vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
  tracing: Uninitialized variable in create_tracing_map_fields()
  tracing: Make sure variable string fields are NULL-terminated
  tracing: Add action comparisons when testing matching hist triggers
  tracing: Don't add flag strings when displaying variable references
  tracing: Fix display of hist trigger expressions containing timestamps
  ftrace: Drop a VLA in module_exists()
  tracing: Mention trace_clock=global when warning about unstable clocks
  tracing: Default to using trace_global_clock if sched_clock is unstable
  ...
…t/jhogan/mips

Pull MIPS updates from James Hogan:
 "These are the main MIPS changes for 4.17. Rough overview:

   (1) generic platform: Add support for Microsemi Ocelot SoCs

   (2) crypto: Add CRC32 and CRC32C HW acceleration module

   (3) Various cleanups and misc improvements

  More detailed summary:

  Miscellaneous:
   - hang more efficiently on halt/powerdown/restart
   - pm-cps: Block system suspend when a JTAG probe is present
   - expand make help text for generic defconfigs
   - refactor handling of legacy defconfigs
   - determine the entry point from the ELF file header to fix microMIPS
     for certain toolchains
   - introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code

  Minor cleanups:
   - DTS: boston/ci20: Unit name cleanups and correction
   - kdump: Make the default for PHYSICAL_START always 64-bit
   - constify gpio_led in Alchemy, AR7, and TXX9
   - silence a couple of W=1 warnings
   - remove duplicate includes

  Platform support:
  Generic platform:
   - add support for Microsemi Ocelot
   - dt-bindings: Add vendor prefix for Microsemi Corporation
   - dt-bindings: Add bindings for Microsemi SoCs
   - add ocelot SoC & PCB123 board DTS files
   - MAINTAINERS: Add entry for Microsemi MIPS SoCs
   - enable crc32-mips on r6 configs

  ath79:
   - fix AR724X_PLL_REG_PCIE_CONFIG offset

  BCM47xx:
   - firmware: Use mac_pton() for MAC address parsing
   - add Luxul XAP1500/XWR1750 WiFi LEDs
   - use standard reset button for Luxul XWR-1750

  BMIPS:
   - enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
   - add STB PM, wake-up timer, watchdog DT nodes

  Octeon:
   - drop '.' after newlines in printk calls

  ralink:
   - pci-mt7621: Enable PCIe on MT7688"

* tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (37 commits)
  MIPS: BCM47XX: Use standard reset button for Luxul XWR-1750
  MIPS: BCM47XX: Add Luxul XAP1500/XWR1750 WiFi LEDs
  MIPS: Make the default for PHYSICAL_START always 64-bit
  MIPS: Use the entry point from the ELF file header
  MAINTAINERS: Add entry for Microsemi MIPS SoCs
  MIPS: generic: Add support for Microsemi Ocelot
  MIPS: mscc: Add ocelot PCB123 device tree
  MIPS: mscc: Add ocelot dtsi
  dt-bindings: mips: Add bindings for Microsemi SoCs
  dt-bindings: Add vendor prefix for Microsemi Corporation
  MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
  MIPS: pci-mt7620: Enable PCIe on MT7688
  MIPS: pm-cps: Block system suspend when a JTAG probe is present
  MIPS: VDSO: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: BPF: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV
  MIPS: Introduce isa-rev.h to define MIPS_ISA_REV
  MIPS: Hang more efficiently on halt/powerdown/restart
  FIRMWARE: bcm47xx_nvram: Replace mac address parsing
  MIPS: BMIPS: Add Broadcom STB watchdog nodes
  ...
…pstreaming

Pull c6x updates from Mark Salter.

* tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
  c6x: pass endianness info to sparse
  c6x: fix platforms/plldata.c get_coreid build error
  c6x: remove unused KTHREAD_SIZE definition
Pull rpmsg updates from Bjorn Andersson:

 - transition the rpmsg_trysend() code paths of SMD and GLINK to use
   non-sleeping locks

 - revert the overly optimistic handling of discovered SMD channels

 - fix an issue in SMD where incoming messages race with the probing of
   a client driver

* tag 'rpmsg-v4.17' of git://github.com/andersson/remoteproc:
  rpmsg: smd: Use announce_create to process any receive work
  rpmsg: Only invoke announce_create for rpdev with endpoints
  rpmsg: smd: Fix container_of macros
  Revert "rpmsg: smd: Create device for all channels"
  rpmsg: glink: Use spinlock in tx path
  rpmsg: smd: Use spinlock in tx path
  rpmsg: smd: use put_device() if device_register fail
  rpmsg: glink: use put_device() if device_register fail
Pull remoteproc updates from Bjorn Andersson:

 - add support for generating coredumps for remoteprocs using
   devcoredump

 - add the Qualcomm sysmon driver for intra-remoteproc crash handling

 - a number of fixes in Qualcomm and IMX drivers

* tag 'rproc-v4.17' of git://github.com/andersson/remoteproc:
  remoteproc: fix null pointer dereference on glink only platforms
  soc: qcom: qmi: add CONFIG_NET dependency
  remoteproc: imx_rproc: Slightly simplify code in 'imx_rproc_probe()'
  remoteproc: imx_rproc: Re-use existing error handling path in 'imx_rproc_probe()'
  remoteproc: imx_rproc: Fix an error handling path in 'imx_rproc_probe()'
  samples: Introduce Qualcomm QMI sample client
  remoteproc: qcom: Introduce sysmon
  remoteproc: Pass type of shutdown to subdev remove
  remoteproc: qcom: Register segments for core dump
  soc: qcom: mdt-loader: Return relocation base
  remoteproc: Rename "load_rsc_table" to "parse_fw"
  remoteproc: Add remote processor coredump support
  remoteproc: Remove null character write of shared mem
…/slave-dma

Pull dmaengine updates from Vinod Koul:
 "This time we have couple of new drivers along with updates to drivers:

   - new drivers for the DesignWare AXI DMAC and MediaTek High-Speed DMA
     controllers

   - stm32 dma and qcom bam dma driver updates

   - norandom test option for dmatest"

* tag 'dmaengine-4.17-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (30 commits)
  dmaengine: stm32-dma: properly mask irq bits
  dmaengine: stm32-dma: fix max items per transfer
  dmaengine: stm32-dma: fix DMA IRQ status handling
  dmaengine: stm32-dma: Improve memory burst management
  dmaengine: stm32-dma: fix typo and reported checkpatch warnings
  dmaengine: stm32-dma: fix incomplete configuration in cyclic mode
  dmaengine: stm32-dma: threshold manages with bitfield feature
  dt-bindings: stm32-dma: introduce DMA features bitfield
  dt-bindings: rcar-dmac: Document r8a77470 support
  dmaengine: rcar-dmac: Fix too early/late system suspend/resume callbacks
  dmaengine: dw-axi-dmac: fix spelling mistake: "catched" -> "caught"
  dmaengine: edma: Check the memory allocation for the memcpy dma device
  dmaengine: at_xdmac: fix rare residue corruption
  dmaengine: mediatek: update MAINTAINERS entry with MediaTek DMA driver
  dmaengine: mediatek: Add MediaTek High-Speed DMA controller for MT7622 and MT7623 SoC
  dt-bindings: dmaengine: Add MediaTek High-Speed DMA controller bindings
  dt-bindings: Document the Synopsys DW AXI DMA bindings
  dmaengine: Introduce DW AXI DMAC driver
  dmaengine: pl330: fix a race condition in case of threaded irqs
  dmaengine: imx-sdma: fix pagefault when channel is disabled during interrupt
  ...
…inux-platform-drivers-x86

Pull x86 platform driver updates from Andy Shevchenko:

 - Dell SMBIOS driver fixed against memory leaks.

 - The fujitsu-laptop driver is cleaned up and now supports hotkeys for
   Lifebook U7x7 models. Besides that the typo introduced by one of
   previous clean up series has been fixed.

 - Specific to x86-based laptops HID device now supports
   KEY_ROTATE_LOCK_TOGGLE event which is emitted, for example, by Wacom
   MobileStudio Pro 13.

 - Turbo MAX 3 technology is enabled for the rest of platforms that
   support Hardware-P-States feature which have core priority described
   by ACPI CPPC table.

 - Mellanox on x86 gets better support of I2C bus in use including
   support of hotpluggable ones.

 - Silead touchscreen is enabled on two tablet models, i.e Yours Y8W81
   and I.T.Works TW701.

 - From now on the second fan on Thinkpad P50 is supported.

 - The topstar-laptop driver is reworked to support new models, in
   particular Topstar U931.

* tag 'platform-drivers-x86-v4.17-1' of git://git.infradead.org/linux-platform-drivers-x86: (41 commits)
  platform/x86: thinkpad_acpi: Add 2nd Fan Support for Thinkpad P50
  platform/x86: dell-smbios: Fix memory leaks in build_tokens_sysfs()
  intel-hid: support KEY_ROTATE_LOCK_TOGGLE
  intel-hid: clean up and sort header files
  platform/x86: silead_dmi: Add entry for the Yours Y8W81 tablet
  platform/x86: fujitsu-laptop: Support Lifebook U7x7 hotkeys
  platform/x86: mlx-platform: Add physical bus number auto detection
  platform/mellanox: mlxreg-hotplug: Change input for device create routine
  platform/x86: mlx-platform: Add deffered bus functionality
  platform/x86: mlx-platform: Use define for the channel numbers
  platform/x86: fujitsu-laptop: Revert UNSUPPORTED_CMD back to an int
  platform/x86: Fix dell driver init order
  platform/x86: dell-smbios: Resolve dependency error on ACPI_WMI
  platform/x86: dell-smbios: Resolve dependency error on DCDBAS
  platform/x86: Allow for SMBIOS backend defaults
  platform/x86: dell-smbios: Link all dell-smbios-* modules together
  platform/x86: dell-smbios: Rename dell-smbios source to dell-smbios-base
  platform/x86: dell-smbios: Correct some style warnings
  platform/x86: wmi: Fix misuse of vsprintf extension %pULL
  platform/x86: intel-hid: Reset wakeup capable flag on removal
  ...
Pull ceph updates from Ilya Dryomov:
 "The big ticket items are:

   - support for rbd "fancy" striping (myself).

     The striping feature bit is now fully implemented, allowing mapping
     v2 images with non-default striping patterns. This completes
     support for --image-format 2.

   - CephFS quota support (Luis Henriques and Zheng Yan).

     This set is based on the new SnapRealm code in the upcoming v13.y.z
     ("Mimic") release. Quota handling will be rejected on older
     filesystems.

   - memory usage improvements in CephFS (Chengguang Xu).

     Directory specific bits have been split out of ceph_file_info and
     some effort went into improving cap reservation code to avoid OOM
     crashes.

  Also included a bunch of assorted fixes all over the place from
  Chengguang and others"

* tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client: (67 commits)
  ceph: quota: report root dir quota usage in statfs
  ceph: quota: add counter for snaprealms with quota
  ceph: quota: cache inode pointer in ceph_snap_realm
  ceph: fix root quota realm check
  ceph: don't check quota for snap inode
  ceph: quota: update MDS when max_bytes is approaching
  ceph: quota: support for ceph.quota.max_bytes
  ceph: quota: don't allow cross-quota renames
  ceph: quota: support for ceph.quota.max_files
  ceph: quota: add initial infrastructure to support cephfs quotas
  rbd: remove VLA usage
  rbd: fix spelling mistake: "reregisteration" -> "reregistration"
  ceph: rename function drop_leases() to a more descriptive name
  ceph: fix invalid point dereference for error case in mdsc destroy
  ceph: return proper bool type to caller instead of pointer
  ceph: optimize memory usage
  ceph: optimize mds session register
  libceph, ceph: add __init attribution to init funcitons
  ceph: filter out used flags when printing unused open flags
  ceph: don't wait on writeback when there is no more dirty pages
  ...
@rahrawat rahrawat merged commit 827fece into rahrawat:master Apr 11, 2018
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
We have canceled dma map for ppgtt entries. Also we need to do it for
ggtt entries when them are invalidated.

This can fix task hung issue as:
[13517.791767] INFO: task gvt_service_thr:1081 blocked for more than 120 seconds.
[13517.792584] Not tainted 4.14.15+ #3
[13517.793417] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13517.794267] gvt_service_thr D 0 1081 2 0x80000000
[13517.795132] Call Trace:
[13517.795996] ? __schedule+0x493/0x77b
[13517.796859] schedule+0x79/0x82
[13517.797740] schedule_preempt_disabled+0x5/0x6
[13517.798614] __mutex_lock.isra.0+0x2b5/0x445
[13517.799504] ? __switch_to_asm+0x24/0x60
[13517.800381] ? intel_gvt_cleanup+0x10/0x10
[13517.801261] ? intel_gvt_schedule+0x19/0x2b9
[13517.802107] intel_gvt_schedule+0x19/0x2b9
[13517.802954] ? intel_gvt_cleanup+0x10/0x10
[13517.803824] gvt_service_thread+0xe3/0x10d
[13517.804704] ? wait_woken+0x68/0x68
[13517.805588] kthread+0x118/0x120
[13517.806478] ? kthread_create_on_node+0x3a/0x3a
[13517.807381] ? call_usermodehelper_exec_async+0x113/0x11a
[13517.808307] ret_from_fork+0x35/0x40

v3: split out ggtt reset case.
v2: also unmap ggtt during reset.

Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2]

We have 2 problems in start_sync_thread if error path is
taken, eg. on memory allocation error or failure to configure
sockets for mcast group or addr/port binding:

1. recursive locking: holding rtnl_lock while calling sock_release
which in turn calls again rtnl_lock in ip_mc_drop_socket to leave
the mcast group, as noticed by Florian Westphal. Additionally,
sock_release can not be called while holding sync_mutex (ABBA
deadlock).

2. task hung: holding rtnl_lock while calling kthread_stop to
stop the running kthreads. As the kthreads do the same to leave
the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock)
they hang.

Fix the problems by calling rtnl_unlock early in the error path,
now sock_release is called after unlocking both mutexes.

Problem 3 (task hung reported by syzkaller [2]) is variant of
problem 2: use _trylock to prevent one user to call rtnl_lock and
then while waiting for sync_mutex to block kthreads that execute
sock_release when they are stopped by stop_sync_thread.

[1]
IPVS: stopping backup sync thread 4500 ...
WARNING: possible recursive locking detected
4.16.0-rc7+ #3 Not tainted
--------------------------------------------
syzkaller688027/4497 is trying to acquire lock:
  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

but task is already holding lock:
IPVS: stopping backup sync thread 4495 ...
  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(rtnl_mutex);
   lock(rtnl_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

2 locks held by syzkaller688027/4497:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
  #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>]
do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388

stack backtrace:
CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  print_deadlock_bug kernel/locking/lockdep.c:1761 [inline]
  check_deadlock kernel/locking/lockdep.c:1805 [inline]
  validate_chain kernel/locking/lockdep.c:2401 [inline]
  __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431
  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
  __mutex_lock_common kernel/locking/mutex.c:756 [inline]
  __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
  ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643
  inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413
  sock_release+0x8d/0x1e0 net/socket.c:595
  start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924
  do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261
  udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406
  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975
  SYSC_setsockopt net/socket.c:1849 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1828
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x446a69
RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69
RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000
R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8
R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60

[2]
IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4,
id = 0
IPVS: stopping backup sync thread 25415 ...
INFO: task syz-executor7:25421 blocked for more than 120 seconds.
       Not tainted 4.16.0-rc6+ torvalds#284
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor7   D23688 25421   4408 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2862 [inline]
  __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440
  schedule+0xf5/0x430 kernel/sched/core.c:3499
  schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777
  do_wait_for_common kernel/sched/completion.c:86 [inline]
  __wait_for_common kernel/sched/completion.c:107 [inline]
  wait_for_common kernel/sched/completion.c:118 [inline]
  wait_for_completion+0x415/0x770 kernel/sched/completion.c:139
  kthread_stop+0x14a/0x7a0 kernel/kthread.c:530
  stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996
  do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394
  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
  ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253
  sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154
  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039
  SYSC_setsockopt net/socket.c:1850 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1829
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x454889
RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889
RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017
RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000
R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001

Showing all locks held in the system:
2 locks held by khungtaskd/868:
  #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>]
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0:  (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60
kernel/hung_task.c:249
  #1:  (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>]
debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470
1 lock held by rsyslogd/4247:
  #0:  (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>]
__fdget_pos+0x12b/0x190 fs/file.c:765
2 locks held by getty/4338:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4339:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4340:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4341:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4342:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4343:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
2 locks held by getty/4344:
  #0:  (&tty->ldisc_sem){++++}, at: [<00000000bee98654>]
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>]
n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
3 locks held by kworker/0:5/6494:
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646
[inline]
  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
[<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>]
process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
  #2:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
1 lock held by syz-executor7/25421:
  #0:  (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>]
do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393
2 locks held by syz-executor7/25427:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
  #1:  (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>]
do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388
1 lock held by syz-executor7/25435:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74
1 lock held by ipvs-b:2:0/25415:
  #0:  (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20
net/core/rtnetlink.c:74

Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com
Fixes: e0b26cc ("ipvs: call rtnl_lock early")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
  ======================================================
  WARNING: possible circular locking dependency detected
  4.16.0-rc7+ #3 Not tainted
  ------------------------------------------------------
  syz-executor7/24531 is trying to acquire lock:
   (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854

  but task is already holding lock:
   (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&mm->mmap_sem){++++}:
       __might_fault+0x13a/0x1d0 mm/memory.c:4571
       _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
       copy_to_user include/linux/uaccess.h:155 [inline]
       bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
       perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
       _perf_ioctl kernel/events/core.c:4750 [inline]
       perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
       vfs_ioctl fs/ioctl.c:46 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
       SYSC_ioctl fs/ioctl.c:701 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  -> #0 (bpf_event_mutex){+.+.}:
       lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
       __mutex_lock_common kernel/locking/mutex.c:756 [inline]
       __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
       mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
       perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
       perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
       _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
       put_event+0x24/0x30 kernel/events/core.c:4204
       perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
       remove_vma+0xb4/0x1b0 mm/mmap.c:172
       remove_vma_list mm/mmap.c:2490 [inline]
       do_munmap+0x82a/0xdf0 mm/mmap.c:2731
       mmap_region+0x59e/0x15a0 mm/mmap.c:1646
       do_mmap+0x6c0/0xe00 mm/mmap.c:1483
       do_mmap_pgoff include/linux/mm.h:2223 [inline]
       vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
       SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
       SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
       SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
       SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
       do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7

  other info that might help us debug this:

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&mm->mmap_sem);
                                 lock(bpf_event_mutex);
                                 lock(&mm->mmap_sem);
    lock(bpf_event_mutex);

   *** DEADLOCK ***
  ======================================================

The bug is introduced by Commit f371b30 ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.

As suggested by Daniel, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.

Fixes: f371b30 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
Patch series "kexec_file, x86, powerpc: refactoring for other
architecutres", v2.

This is a preparatory patchset for adding kexec_file support on arm64.

It was originally included in a arm64 patch set[1], but Philipp is also
working on their kexec_file support on s390[2] and some changes are now
conflicting.

So these common parts were extracted and put into a separate patch set
for better integration.  What's more, my original patch#4 was split into
a few small chunks for easier review after Dave's comment.

As such, the resulting code is basically identical with my original, and
the only *visible* differences are:

 - renaming of _kexec_kernel_image_probe() and  _kimage_file_post_load_cleanup()

 - change one of types of arguments at prepare_elf64_headers()

Those, unfortunately, require a couple of trivial changes on the rest
(#1, #6 to torvalds#13) of my arm64 kexec_file patch set[1].

Patch #1 allows making a use of purgatory optional, particularly useful
for arm64.

Patch #2 commonalizes arch_kexec_kernel_{image_probe, image_load,
verify_sig}() and arch_kimage_file_post_load_cleanup() across
architectures.

Patches #3-#7 are also intended to generalize parse_elf64_headers(),
along with exclude_mem_range(), to be made best re-use of.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/561182.html
[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.1/02596.html

This patch (of 7):

On arm64, crash dump kernel's usable memory is protected by *unmapping*
it from kernel virtual space unlike other architectures where the region
is just made read-only.  It is highly unlikely that the region is
accidentally corrupted and this observation rationalizes that digest
check code can also be dropped from purgatory.  The resulting code is so
simple as it doesn't require a bit ugly re-linking/relocation stuff,
i.e.  arch_kexec_apply_relocations_add().

Please see:

   http://lists.infradead.org/pipermail/linux-arm-kernel/2017-December/545428.html

All that the purgatory does is to shuffle arguments and jump into a new
kernel, while we still need to have some space for a hash value
(purgatory_sha256_digest) which is never checked against.

As such, it doesn't make sense to have trampline code between old kernel
and new kernel on arm64.

This patch introduces a new configuration, ARCH_HAS_KEXEC_PURGATORY, and
allows related code to be compiled in only if necessary.

[takahiro.akashi@linaro.org: fix trivial screwup]
  Link: http://lkml.kernel.org/r/20180309093346.GF25863@linaro.org
Link: http://lkml.kernel.org/r/20180306102303.9063-2-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Acked-by: Dave Young <dyoung@redhat.com>
Tested-by: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
Patch series "kexec_file: Clean up purgatory load", v2.

Following the discussion with Dave and AKASHI, here are the common code
patches extracted from my recent patch set (Add kexec_file_load support
to s390) [1].  The patches were extracted to allow upstream integration
together with AKASHI's common code patches before the arch code gets
adjusted to the new base.

The reason for this series is to prepare common code for adding
kexec_file_load to s390 as well as cleaning up the mis-use of the
sh_offset field during purgatory load.  In detail this series contains:

Patch #1&2: Minor cleanups/fixes.

Patch #3-9: Clean up the purgatory load/relocation code.  Especially
remove the mis-use of the purgatory_info->sechdrs->sh_offset field,
currently holding a pointer into either kexec_purgatory (ro) or
purgatory_buf (rw) depending on the section.  With these patches the
section address will be calculated verbosely and sh_offset will contain
the offset of the section in the stripped purgatory binary
(purgatory_buf).

Patch torvalds#10: Allows architectures to set the purgatory load address.  This
patch is important for s390 as the kernel and purgatory have to be
loaded to fixed addresses.  In current code this is impossible as the
purgatory load is opaque to the architecture.

Patch torvalds#11: Moves x86 purgatories sha implementation to common lib/
directory to allow reuse in other architectures.

This patch (of 11)

When building the kernel with CONFIG_KEXEC_FILE enabled gcc prints a
compile warning multiple times.

  In file included from <path>/linux/init/initramfs.c:526:0:
  <path>/include/linux/kexec.h:120:9: warning: `struct kimage' declared inside parameter list [enabled by default]
           unsigned long cmdline_len);
           ^

This is because the typedefs for kexec_file_load uses struct kimage
before it is declared.  Fix this by simply forward declaring struct
kimage.

Link: http://lkml.kernel.org/r/20180321112751.22196-2-prudo@linux.vnet.ibm.com
Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
Edward Cree says:

====================
sfc: ARFS fixes

Three issues introduced by my recent asynchronous filter handling changes:
1. The old filter_rfs_insert would replace a matching filter of equal
   priority; we need to pass the appropriate argument to filter_insert to
   make it do the same.
2. We're lying to the kernel with our return value from ndo_rx_flow_steer,
   so we need to lie consistently when calling rps_may_expire_flow.  This
   is only a partial fix, as the lie still prevents us from steering
   multiple flows with the same ID to different queues; a proper fix that
   stops us lying at all will hopefully follow later.
3. It's possible to cause the kernel to hammer ndo_rx_flow_steer very
   hard, so make sure we don't build up too huge a backlog of workitems.

Possibly it would be better to fix #3 on the kernel side; I have a patch
 which I think does that but it's not a regression in 4.17 so isn't 'net'
 material.
There's also the issue that we come up in the bad configuration that
 triggers #3 by default, but that too is a problem for another time.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
rahrawat pushed a commit that referenced this pull request Apr 26, 2018
Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
rahrawat pushed a commit that referenced this pull request Dec 20, 2018
This reverts commit:

c54c737 ("drm/dp_mst: Skip validating ports during destruction, just ref")

ugh.

In drm_dp_destroy_connector_work(), we have a pretty good chance of
freeing the actual struct drm_dp_mst_port. However, after destroying
things we send a hotplug through (*mgr->cbs->hotplug)(mgr) which is
where the problems start.

For i915, this calls all the way down to the fbcon probing helpers,
which start trying to access the port in a modeset.

[   45.062001] ==================================================================
[   45.062112] BUG: KASAN: use-after-free in ex_handler_refcount+0x146/0x180
[   45.062196] Write of size 4 at addr ffff8882b4b70968 by task kworker/3:1/53

[   45.062325] CPU: 3 PID: 53 Comm: kworker/3:1 Kdump: loaded Tainted: G           O      4.20.0-rc4Lyude-Test+ #3
[   45.062442] Hardware name: LENOVO 20BWS1KY00/20BWS1KY00, BIOS JBET71WW (1.35 ) 09/14/2018
[   45.062554] Workqueue: events drm_dp_destroy_connector_work [drm_kms_helper]
[   45.062641] Call Trace:
[   45.062685]  dump_stack+0xbd/0x15a
[   45.062735]  ? dump_stack_print_info.cold.0+0x1b/0x1b
[   45.062801]  ? printk+0x9f/0xc5
[   45.062847]  ? kmsg_dump_rewind_nolock+0xe4/0xe4
[   45.062909]  ? ex_handler_refcount+0x146/0x180
[   45.062970]  print_address_description+0x71/0x239
[   45.063036]  ? ex_handler_refcount+0x146/0x180
[   45.063095]  kasan_report.cold.5+0x242/0x30b
[   45.063155]  __asan_report_store4_noabort+0x1c/0x20
[   45.063313]  ex_handler_refcount+0x146/0x180
[   45.063371]  ? ex_handler_clear_fs+0xb0/0xb0
[   45.063428]  fixup_exception+0x98/0xd7
[   45.063484]  ? raw_notifier_call_chain+0x20/0x20
[   45.063548]  do_trap+0x6d/0x210
[   45.063605]  ? _GLOBAL__sub_I_65535_1_drm_dp_aux_unregister_devnode+0x2f/0x1c6 [drm_kms_helper]
[   45.063732]  do_error_trap+0xc0/0x170
[   45.063802]  ? _GLOBAL__sub_I_65535_1_drm_dp_aux_unregister_devnode+0x2f/0x1c6 [drm_kms_helper]
[   45.063929]  do_invalid_op+0x3b/0x50
[   45.063997]  ? _GLOBAL__sub_I_65535_1_drm_dp_aux_unregister_devnode+0x2f/0x1c6 [drm_kms_helper]
[   45.064103]  invalid_op+0x14/0x20
[   45.064162] RIP: 0010:_GLOBAL__sub_I_65535_1_drm_dp_aux_unregister_devnode+0x2f/0x1c6 [drm_kms_helper]
[   45.064274] Code: 00 48 c7 c7 80 fe 53 a0 48 89 e5 e8 5b 6f 26 e1 5d c3 48 8d 0e 0f 0b 48 8d 0b 0f 0b 48 8d 0f 0f 0b 48 8d 0f 0f 0b 49 8d 4d 00 <0f> 0b 49 8d 0e 0f 0b 48 8d 08 0f 0b 49 8d 4d 00 0f 0b 48 8d 0b 0f
[   45.064569] RSP: 0018:ffff8882b789ee10 EFLAGS: 00010282
[   45.064637] RAX: ffff8882af47ae70 RBX: ffff8882af47aa60 RCX: ffff8882b4b70968
[   45.064723] RDX: ffff8882af47ae70 RSI: 0000000000000008 RDI: ffff8882b788bdb8
[   45.064808] RBP: ffff8882b789ee28 R08: ffffed1056f13db4 R09: ffffed1056f13db3
[   45.064894] R10: ffffed1056f13db3 R11: ffff8882b789ed9f R12: ffff8882af47ad28
[   45.064980] R13: ffff8882b4b70968 R14: ffff8882acd86728 R15: ffff8882b4b75dc8
[   45.065084]  drm_dp_mst_reset_vcpi_slots+0x12/0x80 [drm_kms_helper]
[   45.065225]  intel_mst_disable_dp+0xda/0x180 [i915]
[   45.065361]  intel_encoders_disable.isra.107+0x197/0x310 [i915]
[   45.065498]  haswell_crtc_disable+0xbe/0x400 [i915]
[   45.065622]  ? i9xx_disable_plane+0x1c0/0x3e0 [i915]
[   45.065750]  intel_atomic_commit_tail+0x74e/0x3e60 [i915]
[   45.065884]  ? intel_pre_plane_update+0xbc0/0xbc0 [i915]
[   45.065968]  ? drm_atomic_helper_swap_state+0x88b/0x1d90 [drm_kms_helper]
[   45.066054]  ? kasan_check_write+0x14/0x20
[   45.066165]  ? i915_gem_track_fb+0x13a/0x330 [i915]
[   45.066277]  ? i915_sw_fence_complete+0xe9/0x140 [i915]
[   45.066406]  ? __i915_sw_fence_complete+0xc50/0xc50 [i915]
[   45.066540]  intel_atomic_commit+0x72e/0xef0 [i915]
[   45.066635]  ? drm_dev_dbg+0x200/0x200 [drm]
[   45.066764]  ? intel_atomic_commit_tail+0x3e60/0x3e60 [i915]
[   45.066898]  ? intel_atomic_commit_tail+0x3e60/0x3e60 [i915]
[   45.067001]  drm_atomic_commit+0xc4/0xf0 [drm]
[   45.067074]  restore_fbdev_mode_atomic+0x562/0x780 [drm_kms_helper]
[   45.067166]  ? drm_fb_helper_debug_leave+0x690/0x690 [drm_kms_helper]
[   45.067249]  ? kasan_check_read+0x11/0x20
[   45.067324]  restore_fbdev_mode+0x127/0x4b0 [drm_kms_helper]
[   45.067364]  ? kasan_check_read+0x11/0x20
[   45.067406]  drm_fb_helper_restore_fbdev_mode_unlocked+0x164/0x200 [drm_kms_helper]
[   45.067462]  ? drm_fb_helper_hotplug_event+0x30/0x30 [drm_kms_helper]
[   45.067508]  ? kasan_check_write+0x14/0x20
[   45.070360]  ? mutex_unlock+0x22/0x40
[   45.073748]  drm_fb_helper_set_par+0xb2/0xf0 [drm_kms_helper]
[   45.075846]  drm_fb_helper_hotplug_event.part.33+0x1cd/0x290 [drm_kms_helper]
[   45.078088]  drm_fb_helper_hotplug_event+0x1c/0x30 [drm_kms_helper]
[   45.082614]  intel_fbdev_output_poll_changed+0x9f/0x140 [i915]
[   45.087069]  drm_kms_helper_hotplug_event+0x67/0x90 [drm_kms_helper]
[   45.089319]  intel_dp_mst_hotplug+0x37/0x50 [i915]
[   45.091496]  drm_dp_destroy_connector_work+0x510/0x6f0 [drm_kms_helper]
[   45.093675]  ? drm_dp_update_payload_part1+0x1220/0x1220 [drm_kms_helper]
[   45.095851]  ? kasan_check_write+0x14/0x20
[   45.098473]  ? kasan_check_read+0x11/0x20
[   45.101155]  ? strscpy+0x17c/0x530
[   45.103808]  ? __switch_to_asm+0x34/0x70
[   45.106456]  ? syscall_return_via_sysret+0xf/0x7f
[   45.109711]  ? read_word_at_a_time+0x20/0x20
[   45.113138]  ? __switch_to_asm+0x40/0x70
[   45.116529]  ? __switch_to_asm+0x34/0x70
[   45.119891]  ? __switch_to_asm+0x40/0x70
[   45.123224]  ? __switch_to_asm+0x34/0x70
[   45.126540]  ? __switch_to_asm+0x34/0x70
[   45.129824]  process_one_work+0x88d/0x15d0
[   45.133172]  ? pool_mayday_timeout+0x850/0x850
[   45.136459]  ? pci_mmcfg_check_reserved+0x110/0x128
[   45.139739]  ? wake_q_add+0xb0/0xb0
[   45.143010]  ? check_preempt_wakeup+0x652/0x1050
[   45.146304]  ? worker_enter_idle+0x29e/0x740
[   45.149589]  ? __schedule+0x1ec0/0x1ec0
[   45.152937]  ? kasan_check_read+0x11/0x20
[   45.156179]  ? _raw_spin_lock_irq+0xa3/0x130
[   45.159382]  ? _raw_read_unlock_irqrestore+0x30/0x30
[   45.162542]  ? kasan_check_write+0x14/0x20
[   45.165657]  worker_thread+0x1a5/0x1470
[   45.168725]  ? set_load_weight+0x2e0/0x2e0
[   45.171755]  ? process_one_work+0x15d0/0x15d0
[   45.174806]  ? __switch_to_asm+0x34/0x70
[   45.177645]  ? __switch_to_asm+0x40/0x70
[   45.180323]  ? __switch_to_asm+0x34/0x70
[   45.182936]  ? __switch_to_asm+0x40/0x70
[   45.185539]  ? __switch_to_asm+0x34/0x70
[   45.188100]  ? __switch_to_asm+0x40/0x70
[   45.190628]  ? __schedule+0x7d4/0x1ec0
[   45.193143]  ? save_stack+0xa9/0xd0
[   45.195632]  ? kasan_check_write+0x10/0x20
[   45.198162]  ? kasan_kmalloc+0xc4/0xe0
[   45.200609]  ? kmem_cache_alloc_trace+0xdd/0x190
[   45.203046]  ? kthread+0x9f/0x3b0
[   45.205470]  ? ret_from_fork+0x35/0x40
[   45.207876]  ? unwind_next_frame+0x43/0x50
[   45.210273]  ? __save_stack_trace+0x82/0x100
[   45.212658]  ? deactivate_slab.isra.67+0x3d4/0x580
[   45.215026]  ? default_wake_function+0x35/0x50
[   45.217399]  ? kasan_check_read+0x11/0x20
[   45.219825]  ? _raw_spin_lock_irqsave+0xae/0x140
[   45.222174]  ? __lock_text_start+0x8/0x8
[   45.224521]  ? replenish_dl_entity.cold.62+0x4f/0x4f
[   45.226868]  ? __kthread_parkme+0x87/0xf0
[   45.229200]  kthread+0x2f7/0x3b0
[   45.231557]  ? process_one_work+0x15d0/0x15d0
[   45.233923]  ? kthread_park+0x120/0x120
[   45.236249]  ret_from_fork+0x35/0x40

[   45.240875] Allocated by task 242:
[   45.243136]  save_stack+0x43/0xd0
[   45.245385]  kasan_kmalloc+0xc4/0xe0
[   45.247597]  kmem_cache_alloc_trace+0xdd/0x190
[   45.249793]  drm_dp_add_port+0x1e0/0x2170 [drm_kms_helper]
[   45.252000]  drm_dp_send_link_address+0x4a7/0x740 [drm_kms_helper]
[   45.254389]  drm_dp_check_and_send_link_address+0x1a7/0x210 [drm_kms_helper]
[   45.256803]  drm_dp_mst_link_probe_work+0x6f/0xb0 [drm_kms_helper]
[   45.259200]  process_one_work+0x88d/0x15d0
[   45.261597]  worker_thread+0x1a5/0x1470
[   45.264038]  kthread+0x2f7/0x3b0
[   45.266371]  ret_from_fork+0x35/0x40

[   45.270937] Freed by task 53:
[   45.273170]  save_stack+0x43/0xd0
[   45.275382]  __kasan_slab_free+0x139/0x190
[   45.277604]  kasan_slab_free+0xe/0x10
[   45.279826]  kfree+0x99/0x1b0
[   45.282044]  drm_dp_free_mst_port+0x4a/0x60 [drm_kms_helper]
[   45.284330]  drm_dp_destroy_connector_work+0x43e/0x6f0 [drm_kms_helper]
[   45.286660]  process_one_work+0x88d/0x15d0
[   45.288934]  worker_thread+0x1a5/0x1470
[   45.291231]  kthread+0x2f7/0x3b0
[   45.293547]  ret_from_fork+0x35/0x40

[   45.298206] The buggy address belongs to the object at ffff8882b4b70968
                which belongs to the cache kmalloc-2k of size 2048
[   45.303047] The buggy address is located 0 bytes inside of
                2048-byte region [ffff8882b4b70968, ffff8882b4b71168)
[   45.308010] The buggy address belongs to the page:
[   45.310477] page:ffffea000ad2dc00 count:1 mapcount:0 mapping:ffff8882c080cf40 index:0x0 compound_mapcount: 0
[   45.313051] flags: 0x8000000000010200(slab|head)
[   45.315635] raw: 8000000000010200 ffffea000aac2808 ffffea000abe8608 ffff8882c080cf40
[   45.318300] raw: 0000000000000000 00000000000d000d 00000001ffffffff 0000000000000000
[   45.320966] page dumped because: kasan: bad access detected

[   45.326312] Memory state around the buggy address:
[   45.329085]  ffff8882b4b70800: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   45.331845]  ffff8882b4b70880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   45.334584] >ffff8882b4b70900: fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb
[   45.337302]                                                           ^
[   45.340061]  ffff8882b4b70980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   45.342910]  ffff8882b4b70a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   45.345748] ==================================================================

So, this definitely isn't a fix that we want. This being said; there's
no real easy fix for this problem because of some of the catch-22's of
the MST helpers current design. For starters; we always need to validate
a port with drm_dp_get_validated_port_ref(), but validation relies on
the lifetime of the port in the actual topology. So once the port is
gone, it can't be validated again.

If we were to try to make the payload helpers not use port validation,
then we'd cause another problem: if the port isn't validated, it could
be freed and we'd just start causing more KASAN issues. There are
already hacks that attempt to workaround this in
drm_dp_mst_destroy_connector_work() by re-initializing the kref so that
it can be used again and it's memory can be freed once the VCPI helpers
finish removing the port's respective payloads. But none of these really
do anything helpful since the port still can't be validated since it's
gone from the topology. Also, that workaround is immensely confusing to
read through.

What really needs to be done in order to fix this is to teach DRM how to
track the lifetime of the structs for MST ports and branch devices
separately from their lifetime in the actual topology. Simply put; this
means having two different krefs-one that removes the port/branch device
from the topology, and one that finally calls kfree(). This would let us
simplify things, since we'd now be able to keep ports around without
having to keep them in the topology at the same time, which is exactly
what we need in order to teach our VCPI helpers to only validate ports
when it's actually necessary without running the risk of trying to use
unallocated memory.

Such a fix is on it's way, but for now let's play it safe and just
revert this. If this bug has been around for well over a year, we can
wait a little while to get an actual proper fix here.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: c54c737 ("drm/dp_mst: Skip validating ports during destruction, just ref")
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sean Paul <sean@poorly.run>
Cc: Jerry Zuo <Jerry.Zuo@amd.com>
Cc: Harry Wentland <Harry.Wentland@amd.com>
Cc: stable@vger.kernel.org # v4.6+
Acked-by: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20181128210005.24434-1-lyude@redhat.com
rahrawat pushed a commit that referenced this pull request Dec 20, 2018
Yonghong Song says:

====================
This patch set added name checking for PTR, ARRAY, VOLATILE, TYPEDEF,
CONST, RESTRICT, STRUCT, UNION, ENUM and FWD types. Such a strict
name checking makes BTF more sound in the kernel and future
BTF-to-header-file converesion ([1]) less fragile.

Patch #1 implemented btf_name_valid_identifier() for name checking
which will be used in Patch #2.
Patch #2 checked name validity for the above mentioned types.
Patch #3 fixed two existing test_btf unit tests exposed by the strict
name checking.
Patch #4 added additional test cases.

This patch set is against bpf tree.

Patch #1 has been implemented in bpf-next commit
Commit 2667a26 ("bpf: btf: Add BTF_KIND_FUNC
and BTF_KIND_FUNC_PROTO"), so there is no need to apply this
patch to bpf-next. In case this patch is applied to bpf-next,
there will be a minor conflict like
  diff --cc kernel/bpf/btf.c
  index a09b2f9,93c233ab2db6..000000000000
  --- a/kernel/bpf/btf.c
  +++ b/kernel/bpf/btf.c
  @@@ -474,7 -451,7 +474,11 @@@ static bool btf_name_valid_identifier(c
          return !*src;
    }

  ++<<<<<<< HEAD
   +const char *btf_name_by_offset(const struct btf *btf, u32 offset)
  ++=======
  + static const char *btf_name_by_offset(const struct btf *btf, u32 offset)
  ++>>>>>>> fa9566b0847d... bpf: btf: implement btf_name_valid_identifier()
    {
          if (!offset)
                  return "(anon)";
Just resolve the conflict by taking the "const char ..." line.

Patches #2, #3 and #4 can be applied to bpf-next without conflict.

[1]: http://vger.kernel.org/lpc-bpf2018.html#session-2
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
rahrawat pushed a commit that referenced this pull request Dec 20, 2018
It was observed that a process blocked indefintely in
__fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
to be cleared via fscache_wait_for_deferred_lookup().

At this time, ->backing_objects was empty, which would normaly prevent
__fscache_read_or_alloc_page() from getting to the point of waiting.
This implies that ->backing_objects was cleared *after*
__fscache_read_or_alloc_page was was entered.

When an object is "killed" and then "dropped",
FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
->backing_objects cleared.  This leaves a window where
something else can set FSCACHE_COOKIE_LOOKING_UP and
__fscache_read_or_alloc_page() can start waiting, before
->backing_objects is cleared

There is some uncertainty in this analysis, but it seems to be fit the
observations.  Adding the wake in this patch will be handled correctly
by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
is empty again, after waiting.

Customer which reported the hang, also report that the hang cannot be
reproduced with this fix.

The backtrace for the blocked process looked like:

PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
 #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
 #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
 #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
 #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
 #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
 #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
 #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
 #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
 #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
 #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
torvalds#10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
torvalds#11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
torvalds#12 [ffff881ff43eff18] sys_read at ffffffff811fda62
torvalds#13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David Howells <dhowells@redhat.com>
rahrawat pushed a commit that referenced this pull request Dec 20, 2018
Function graph tracing recurses into itself when stackleak is enabled,
causing the ftrace graph selftest to run for up to 90 seconds and
trigger the softlockup watchdog.

Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:200
200             mcount_get_lr_addr        x0    //     pointer to function's saved lr
(gdb) bt
\#0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:200
\#1  0xffffff80081d5280 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:153
\#2  0xffffff8008555484 in stackleak_track_stack () at ../kernel/stackleak.c:106
\#3  0xffffff8008421ff8 in ftrace_ops_test (ops=0xffffff8009eaa840 <graph_ops>, ip=18446743524091297036, regs=<optimized out>) at ../kernel/trace/ftrace.c:1507
\#4  0xffffff8008428770 in __ftrace_ops_list_func (regs=<optimized out>, ignored=<optimized out>, parent_ip=<optimized out>, ip=<optimized out>) at ../kernel/trace/ftrace.c:6286
\#5  ftrace_ops_no_ops (ip=18446743524091297036, parent_ip=18446743524091242824) at ../kernel/trace/ftrace.c:6321
\#6  0xffffff80081d5280 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:153
\#7  0xffffff800832fd10 in irq_find_mapping (domain=0xffffffc03fc4bc80, hwirq=27) at ../kernel/irq/irqdomain.c:876
\#8  0xffffff800832294c in __handle_domain_irq (domain=0xffffffc03fc4bc80, hwirq=27, lookup=true, regs=0xffffff800814b840) at ../kernel/irq/irqdesc.c:650
\#9  0xffffff80081d52b4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:205

Rework so we mark stackleak_track_stack as notrace

Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
rahrawat pushed a commit that referenced this pull request Dec 20, 2018
Ido Schimmel says:

====================
mlxsw: Various fixes

Patches #1 and #2 fix two VxLAN related issues. The first patch removes
warnings that can currently be triggered from user space. Second patch
avoids leaking a FID in an error path.

Patch #3 fixes a too strict check that causes certain host routes not to
be promoted to perform GRE decapsulation in hardware.

Last patch avoids a use-after-free when deleting a VLAN device via an
ioctl when it is enslaved to a bridge. I have a patchset for net-next
that reworks this code and makes the driver more robust.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.