ghsa-4f2j-5xcx-5g9w
Vulnerability from github
In the Linux kernel, the following vulnerability has been resolved:
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases.
{ "affected": [], "aliases": [ "CVE-2024-53079" ], "database_specific": { "cwe_ids": [ "CWE-667" ], "github_reviewed": false, "github_reviewed_at": null, "nvd_published_at": "2024-11-19T18:15:27Z", "severity": "MODERATE" }, "details": "In the Linux kernel, the following vulnerability has been resolved:\n\nmm/thp: fix deferred split unqueue naming and locking\n\nRecent changes are putting more pressure on THP deferred split queues:\nunder load revealing long-standing races, causing list_del corruptions,\n\"Bad page state\"s and worse (I keep BUGs in both of those, so usually\ndon\u0027t get to see how badly they end up without). The relevant recent\nchanges being 6.8\u0027s mTHP, 6.10\u0027s mTHP swapout, and 6.12\u0027s mTHP swapin,\nimproved swap allocation, and underused THP splitting.\n\nBefore fixing locking: rename misleading folio_undo_large_rmappable(),\nwhich does not undo large_rmappable, to folio_unqueue_deferred_split(),\nwhich is what it does. But that and its out-of-line __callee are mm\ninternals of very limited usability: add comment and WARN_ON_ONCEs to\ncheck usage; and return a bool to say if a deferred split was unqueued,\nwhich can then be used in WARN_ON_ONCEs around safety checks (sparing\ncallers the arcane conditionals in __folio_unqueue_deferred_split()).\n\nJust omit the folio_unqueue_deferred_split() from free_unref_folios(), all\nof whose callers now call it beforehand (and if any forget then bad_page()\nwill tell) - except for its caller put_pages_list(), which itself no\nlonger has any callers (and will be deleted separately).\n\nSwapout: mem_cgroup_swapout() has been resetting folio-\u003ememcg_data 0\nwithout checking and unqueueing a THP folio from deferred split list;\nwhich is unfortunate, since the split_queue_lock depends on the memcg\n(when memcg is enabled); so swapout has been unqueueing such THPs later,\nwhen freeing the folio, using the pgdat\u0027s lock instead: potentially\ncorrupting the memcg\u0027s list. __remove_mapping() has frozen refcount to 0\nhere, so no problem with calling folio_unqueue_deferred_split() before\nresetting memcg_data.\n\nThat goes back to 5.4 commit 87eaceb3faa5 (\"mm: thp: make deferred split\nshrinker memcg aware\"): which included a check on swapcache before adding\nto deferred queue, but no check on deferred queue before adding THP to\nswapcache. That worked fine with the usual sequence of events in reclaim\n(though there were a couple of rare ways in which a THP on deferred queue\ncould have been swapped out), but 6.12 commit dafff3f4c850 (\"mm: split\nunderused THPs\") avoids splitting underused THPs in reclaim, which makes\nswapcache THPs on deferred queue commonplace.\n\nKeep the check on swapcache before adding to deferred queue? Yes: it is\nno longer essential, but preserves the existing behaviour, and is likely\nto be a worthwhile optimization (vmstat showed much more traffic on the\nqueue under swapping load if the check was removed); update its comment.\n\nMemcg-v1 move (deprecated): mem_cgroup_move_account() has been changing\nfolio-\u003ememcg_data without checking and unqueueing a THP folio from the\ndeferred list, sometimes corrupting \"from\" memcg\u0027s list, like swapout. \nRefcount is non-zero here, so folio_unqueue_deferred_split() can only be\nused in a WARN_ON_ONCE to validate the fix, which must be done earlier:\nmem_cgroup_move_charge_pte_range() first try to split the THP (splitting\nof course unqueues), or skip it if that fails. Not ideal, but moving\ncharge has been requested, and khugepaged should repair the THP later:\nnobody wants new custom unqueueing code just for this deprecated case.\n\nThe 87eaceb3faa5 commit did have the code to move from one deferred list\nto another (but was not conscious of its unsafety while refcount non-0);\nbut that was removed by 5.6 commit fac0516b5534 (\"mm: thp: don\u0027t need care\ndeferred split queue in memcg charge move path\"), which argued that the\nexistence of a PMD mapping guarantees that the THP cannot be on a deferred\nlist. As above, false in rare cases, and now commonly false.\n\nBackport to 6.11 should be straightforward. Earlier backports must take\ncare that other _deferred_list fixes and dependencies are included. There\nis not a strong case for backports, but they can fix cornercases.", "id": "GHSA-4f2j-5xcx-5g9w", "modified": "2024-11-27T18:34:01Z", "published": "2024-11-19T18:31:07Z", "references": [ { "type": "ADVISORY", "url": "https://nvd.nist.gov/vuln/detail/CVE-2024-53079" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/afb1352d06b1b6b2cfd1f901c766a430c87078b3" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/f8f931bba0f92052cf842b7e30917b1afcc77d5a" }, { "type": "WEB", "url": "https://git.kernel.org/stable/c/fc4951c3e3358dd82ea508e893695b916c813f17" } ], "schema_version": "1.4.0", "severity": [ { "score": "CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H", "type": "CVSS_V3" } ] }
Sightings
Author | Source | Type | Date |
---|
Nomenclature
- Seen: The vulnerability was mentioned, discussed, or seen somewhere by the user.
- Confirmed: The vulnerability is confirmed from an analyst perspective.
- Exploited: This vulnerability was exploited and seen by the user reporting the sighting.
- Patched: This vulnerability was successfully patched by the user reporting the sighting.
- Not exploited: This vulnerability was not exploited or seen by the user reporting the sighting.
- Not confirmed: The user expresses doubt about the veracity of the vulnerability.
- Not patched: This vulnerability was not successfully patched by the user reporting the sighting.