Skip to content

proc.vpgid* fields behaving differently in host and container #2076

@Andreagit97

Description

@Andreagit97

Repro

Using OSS sysdig for instance

sudo sysdig --modern-bpf "proc.vpgid.name=ps and proc.name=grep and proc.cmdline contains sleep and evt.type=execve

Then run ps aux | grep sleep in another terminal. You’ll get different results in host and container

TL;DR; this is because we rely on vpgid that is related to a pid namespace and not to the host one.

Explanation

Using this sysdig command

sudo ./usr/bin/sysdig --modern-bpf "(proc.name=grep or proc.name=ps or proc.name=zsh) and (evt.type in (clone,clone3,fork,execve,execveat))"

we can see the following events:

159146 15:17:55.175862813 3 zsh (134447.134447) > clone 
159319 15:17:55.176206881 3 zsh (134447.134447) < clone res=164440(zsh) exe=/usr/bin/zsh args=NULL tid=134447(zsh) pid=134447(zsh) ptid=126257(terminator) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=66733 vm_size=19140 vm_rss=9740 vm_swap=0 comm=zsh cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1000 gid=1000 vtid=134447(zsh) vpid=134447(zsh) pidns_init_start_ts=0 
159321 15:17:55.176213356 2 zsh (164440.164440) < clone res=0 exe=/usr/bin/zsh args=NULL tid=164440(zsh) pid=164440(zsh) ptid=134447(zsh) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=1 vm_size=19140 vm_rss=4904 vm_swap=0 comm=zsh cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1000 gid=1000 vtid=164440(zsh) vpid=164440(zsh) pidns_init_start_ts=147475851 
159415 15:17:55.176351074 2 zsh (164440.164440) > execve filename=/usr/bin/ps 
159420 15:17:55.176359518 3 zsh (134447.134447) > clone 
159449 15:17:55.176562605 3 zsh (134447.134447) < clone res=164441(zsh) exe=/usr/bin/zsh args=NULL tid=134447(zsh) pid=134447(zsh) ptid=126257(terminator) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=66749 vm_size=19140 vm_rss=9740 vm_swap=0 comm=zsh cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1000 gid=1000 vtid=134447(zsh) vpid=134447(zsh) pidns_init_start_ts=0 
159460 15:17:55.176612572 2 ps (164440.164440) < execve res=0 exe=ps args=aux. tid=164440(ps) pid=164440(ps) ptid=134447(zsh) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=62 vm_size=668 vm_rss=0 vm_swap=0 comm=ps cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... env=GJS_DEBUG_TOPICS=JS ERROR;JS LOG.SYSTEMD_EXEC_PID=1965.SESSION_MANAGER=local/... tty=34818 pgid=164440(ps) loginuid=1000(andrea) flags=0 cap_inheritable=0 cap_permitted=0 cap_effective=0 exe_ino=17433682 exe_ino_ctime=2023-11-15 10:26:13.621836742 exe_ino_mtime=2023-10-31 12:36:04.000000000 uid=1000(andrea) trusted_exepath=/usr/bin/ps 
159470 15:17:55.176650662 2 zsh (164441.164441) < clone res=0 exe=/usr/bin/zsh args=NULL tid=164441(zsh) pid=164441(zsh) ptid=134447(zsh) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=1 vm_size=19140 vm_rss=4904 vm_swap=0 comm=zsh cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1000 gid=1000 vtid=164441(zsh) vpid=164441(zsh) pidns_init_start_ts=147475851 
159626 15:17:55.176799910 2 zsh (164441.164441) > execve filename=/usr/bin/grep 
159804 15:17:55.177021111 2 grep (164441.164441) < execve res=0 exe=grep args=--color=auto.--exclude-dir=.bzr.--exclude-dir=CVS.--exclude-dir=.git.--exclud... tid=164441(grep) pid=164441(grep) ptid=134447(zsh) cwd=<NA> fdlimit=1024 pgft_maj=0 pgft_min=63 vm_size=636 vm_rss=0 vm_swap=0 comm=grep cgroups=cpuset=/user.slice.cpu=/user.slice.cpuacct=/.io=/user.slice.memory=/user.slic... env=GJS_DEBUG_TOPICS=JS ERROR;JS LOG.SYSTEMD_EXEC_PID=1965.SESSION_MANAGER=local/... tty=34818 pgid=164440(ps) loginuid=1000(andrea) flags=0 cap_inheritable=0 cap_permitted=0 cap_effective=0 exe_ino=17433021 exe_ino_ctime=2022-12-24 22:47:30.224405240 exe_ino_mtime=2022-03-23 14:56:13.000000000 uid=1000(andrea) trusted_exepath=/usr/bin/grep 

Please note that the shell zsh creates 2 new children and then each child calls execve so grep and ps are not involved in a child-parent relationship. So the shell is the parent of both processes.

Now if we look at the sinsp code, we notice that it runs under the assumption that being in the same process group implies that there is a child-parent relationship. But this is not true...

   case TYPE_VPGID_NAME:
   	{
   		int64_t vpgid = tinfo->m_vpgid;

   		if(!tinfo->is_in_pid_namespace())
   		{
   			// Relying on the convention that a process group id is the process id of the process group leader.
   			// `threadinfo` lookup only applies when the process is running on the host and not in a pid
   			// namespace. However, if the process is running in a pid namespace, we instead traverse the process
   			// lineage until we find a match.
   			sinsp_threadinfo* vpgidinfo = m_inspector->get_thread_ref(vpgid, false, true).get();
   			if(vpgidinfo != NULL)
   			{
   				m_tstr = vpgidinfo->get_comm();
   				RETURN_EXTRACT_STRING(m_tstr);
   			}
   		}
   		// This can occur when the process group leader process has exited or if the process
   		// is running in a pid namespace and we only have the virtual process group id, as
   		// seen from its pid namespace.
   		// Find the highest ancestor process that has the same process group id and
   		// declare it to be the process group leader.
   		sinsp_threadinfo* group_leader = tinfo;

   		sinsp_threadinfo::visitor_func_t visitor = [vpgid, &group_leader](sinsp_threadinfo* pt)
   		{
   			if(pt->m_vpgid != vpgid)
   			{
   				return false;
   			}
   			group_leader = pt;
   			return true;
   		};

   		tinfo->traverse_parent_state(visitor);

   		// group_leader has been updated to the highest process that has the same process group id.
   		// group_leader's comm is considered the process group leader.
   		m_tstr = group_leader->get_comm();
   		RETURN_EXTRACT_STRING(m_tstr);
   	} 

when we type ps aux | grep sleep inside a container we fall in the case tinfo->is_in_pid_namespace() and so we try to traverse the lineage to find the process group leader. Still, we will never find another process with vpgid==grep_vpgid and so we will erroneously set grep as the process leader name because this sinsp_threadinfo* group_leader = tinfo; will never change.

Since we have just the virtual pgid I’m not sure we can recover the right pgid name when we are in a container and the process group leader is not in the hierarchy. proc.vpgid* fields provide a best-effort detection but sometimes they could be misleading like in this case. I know we have the is_vpgid_leader filter check to be sure we are really using the leader but at this point I'm just asking myself if it wouldn't be better to add the pgid field directly from the kernel and deprecate the filter checks based on vpgid

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions