CVE-2026-46223 - Vulnerability Details

- cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

Description

In the Linux kernel, the following vulnerability has been resolved:

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.

[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
fixed the wait's condition; [5] made nr_dying_subsys_* visible
synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the
rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.
host PID 1 systemd reaping orphan pids that were re-parented to it during
the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those
pids to free, the pids can't free because PID 1 is the reaper and it's stuck
in rmdir, and the system A-A deadlocks. No internal lock ordering breaks
this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made
cleanly asynchronous: ->css_offline() is already async, run from
css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to
make that chain start only after all tasks have left the cgroup. rmdir's
user-visible side then returns as soon as cgroup.procs and friends are
empty, while ->css_offline() still runs only after the cgroup is fully
drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs
under vng) which hangs vanilla and succeeds here, and by per-commit
deterministic repros for [2], [3], [4], [5] with a boot parameter that
widens the post-exit_signals() window so each state is reliably reachable.
Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race:
when a controller is disabled via subtree_control, kill_css() ran
synchronously while tasks past exit_signals() could still be linked to
the cgroup's csets, and ->css_offline() could fire before they drained.
This patch preserves the existing synchronous behavior at that call site
(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch
will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The
changes are somewhat invasive but not excessively so, so backporting to
-stable should be okay. If something does turn out to be wrong, the fallback
is to revert the entire chain ([1]-[5]) and rework in the development branch
instead.

v2: Pin cgrp across the deferred destroy work with explicit
cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1
wasn't actually broken (ordered cgroup_offline_wq + queue_work order
in cgroup_task_dead() saved it) but the explicit ref removes the
dependency on those non-obvious invariants. Also note the
pre-existing cgroup_apply_control_disable() race in the description;
a follow-up will defer kill_css_finish() there.

Published: 2026-05-28

Score: 5.5 Medium

EPSS: < 1% Very Low

KEV: No

Impact:

Action:

Analysis

Analysis and contextual insights are available on OpenCVE Cloud.

Default status is the baseline for the product, each version can override it (e.g. patched versions marked unaffected).

Vendor Product Default status Versions

Linux

unaffected

Version	Status	Constraints
`1b164b876c36c3eb5561dd9b37702b04401b0166`	affected	< 33fa2e6b1507a0a377a151a8826438bedad1d0b0
`1b164b876c36c3eb5561dd9b37702b04401b0166`	affected	< 93618edf753838a727dbff63c7c291dee22d656b
`78c72bce4a87819126211c0d24e18350010604fb`	affected	—
`6.19.12`	affected	< 6.20

Linux

affected

Version	Status	Constraints
`7.0`	affected	—
`0`	unaffected	< 7.0
`7.0.9`	unaffected	≤ 7.0.*
`7.1`	unaffected	≤ *

Configuration 1 [-]

OR	cpe:2.3:o:linux:linux_kernel::::::::
	cpe:2.3:o:linux:linux_kernel::::::::
	cpe:2.3:o:linux:linux_kernel:7.0:-::::::
	cpe:2.3:o:linux:linux_kernel:7.0:rc7::::::
	cpe:2.3:o:linux:linux_kernel:7.1:rc1::::::
	cpe:2.3:o:linux:linux_kernel:7.1:rc2::::::

No data.

Vendor Product Confidence Versions

Linux

Linux Kernel

99%

Version	Status	Scheme	Platform
`[1b164b876c36c3eb5561dd9b37702b04401b0166,33fa2e6b1507a0a377a151a8826438bedad1d0b0)`	affected	code_commit	—
`[1b164b876c36c3eb5561dd9b37702b04401b0166,93618edf753838a727dbff63c7c291dee22d656b)`	affected	code_commit	—
`78c72bce4a87819126211c0d24e18350010604fb`	affected	code_commit	—
`[6.19.12,6.20)`	affected	semver	—

Linux

Linux Kernel

99%

Version	Status	Scheme	Platform
`7.0`	affected	generic	—
`[0,7.0)`	unaffected	semver	—
`[7.0.9,7.1.0)`	unaffected	semver	—
`[7.1-rc3,*]`	unaffected	generic	—

Found an issue or want to improve our Enrichment? You can suggest it directly by opening an issue on our dedicated GitHub repository .

Remediation

No vendor fix or workaround currently provided.

Additional remediation guidance may be available on OpenCVE Cloud.

Tracking

Sign in to view the affected projects.

Advisories

No advisories yet.

No CVSS v4.0

Attack Vector Local

Attack Complexity Low

Privileges Required Low

Scope Unchanged

Confidentiality Impact None

Integrity Impact None

Availability Impact High

User Interaction None

No CVSS v3.0

No CVSS v2

This CVE is not in the KEV list.

The EPSS score is 0.00083.

Key SSVC decision points have not yet been added.

References

Link	Providers
https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0
https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b
https://lore.kernel.org/linux-cve-announce/2026052837-CVE-2026-46223-3e37@gregkh/T
https://nvd.nist.gov/vuln/detail/CVE-2026-46223
https://www.cve.org/CVERecord?id=CVE-2026-46223

History

Thu, 11 Jun 2026 18:45:00 +0000

Type	Values Removed	Values Added
Weaknesses		CWE-667
CPEs		cpe:2.3:o:linux:linux_kernel:7.0:-:::::: cpe:2.3:o:linux:linux_kernel:7.0:rc7:::::: cpe:2.3:o:linux:linux_kernel:7.1:rc1:::::: cpe:2.3:o:linux:linux_kernel:7.1:rc2::::::
Metrics		cvssV3_1 `{'score': 5.5, 'vector': 'CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H'}`

Fri, 29 May 2026 04:00:00 +0000

Type	Values Removed	Values Added
Weaknesses	CWE-362

Fri, 29 May 2026 00:15:00 +0000

Type	Values Removed	Values Added
Weaknesses		CWE-833
References		https://lore.kernel.org/linux-cve-announce/2026052837-CVE-2026-46223-3e37@gregkh/T https://nvd.nist.gov/vuln/detail/CVE-2026-46223 https://www.cve.org/CVERecord?id=CVE-2026-46223

Thu, 28 May 2026 13:00:00 +0000

Type	Values Removed	Values Added
Weaknesses		CWE-362

Thu, 28 May 2026 10:15:00 +0000

Type	Values Removed	Values Added
Description		In the Linux kernel, the following vulnerability has been resolved: cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.
Title		cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
First Time appeared		Linux Linux linux Kernel
CPEs		cpe:2.3:o:linux:linux_kernel::::::::
Vendors & Products		Linux Linux linux Kernel
References		https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0 https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b

Subscriptions

Linux Linux Kernel

MITRE

Status: PUBLISHED

Assigner: Linux

Published: 2026-05-28T09:40:40.791Z

Updated: 2026-06-14T18:03:52.610Z

Reserved: 2026-05-13T15:03:33.106Z

Link: CVE-2026-46223

Vulnrichment

No data.

NVD

Status : Analyzed

Published: 2026-05-28T10:16:37.913

Modified: 2026-06-11T18:30:56.360

Link: CVE-2026-46223

Redhat

Severity :

Publid Date: 2026-05-28T00:00:00Z

Links: CVE-2026-46223 - Bugzilla

OpenCVE Enrichment

Updated: 2026-06-11T21:30:05Z

Weaknesses

Tracking

Attack Vector Local

Attack Complexity Low

Privileges Required Low

Scope Unchanged

Confidentiality Impact None

Integrity Impact None

Availability Impact High

User Interaction None

Subscriptions

JSON object

JSON object

JSON object

JSON object

JSON object