Skip to content

[BUG] VMM/PMM State Corruption #342

@jcjgraf

Description

@jcjgraf

Describe the bug
There is a bug in the VMM/PMM where it corrupts its state by invalidating a frame that is still mapped somewhere else.

To Reproduce

  1. The testcase test_kernel_task_func2 demonstrates the issue (Unittest That Stress-Tests the VMM and PMM  #341)
  2. Set KT2_ROUNDS to 1 and KT2_CHUNKS to 1000, to rule out the that the issue is caused by the unmapping/deallocation
  3. The testcase will either result in an assertion in the PMM getting triggered or by locking-up
  4. (optional) Fix the srand seed to one where you repeatedly observe the assertions

Expected behavior
The testcase just passes. Throwing more memory at the issue, by increasing .qemu_config:QEMU_RAM, has seemingly no impact.

Additional context
After a certain number of allocations, mfn_invalid suddenly starts to return false. This is caused by the underlying mbi_get_memory_range call. If you print out the _start and _end variable, you will notice that shortly before the assertions gets triggered, these ranges change. However, the elements of multiboot_mmap are never actively modified once initialized.

Patch:
diff --git a/arch/x86/boot/multiboot.c b/arch/x86/boot/multiboot.c
--- a/arch/x86/boot/multiboot.c
+++ b/arch/x86/boot/multiboot.c
@@ -234,12 +234,15 @@ int mbi_get_avail_memory_range(unsigned index, addr_range_t *r) {
 
 int mbi_get_memory_range(paddr_t pa, addr_range_t *r) {
     paddr_t _start, _end;
+    printk("%s()\n", __func__);
 
     for (unsigned int i = 0; i < multiboot_mmap_num; i++) {
         multiboot2_memory_map_t *entry = &multiboot_mmap[i];
 
         _start = _paddr(entry->addr);
         _end = _paddr(_start + entry->len);
+        printk("start: 0x%lx\n", _start);
+        printk("end:   0x%lx\n", _end);
 
         if (pa >= _start && pa < _end)
             goto found;
Results:
mbi_get_memory_range()
start: 0x0
end:   0x9fc00
start: 0x9fc00
end:   0xa0000
start: 0xf0000
end:   0x100000
start: 0x100000
end:   0xffe0000
mbi_get_memory_range()
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
start: 0xffffffffff000
end:   0x1fffffffffe000
************** PANIC **************
CPU[1]: BUG in tmp_map_mfn() at line 45
***********************************

To better localize the issue, we put many BUG_ON(mfn_invalid(mfn)) into the pagetable code. The two interesting checks are following, as the first one passes while the second one gets triggers the assertion:

diff --git a/arch/x86/pagetables.c b/arch/x86/pagetables.c
--- a/arch/x86/pagetables.c
+++ b/arch/x86/pagetables.c
@@ -250,7 +250,9 @@ static mfn_t get_pgentry_mfn(mfn_t tab_mfn, pt_index_t index, unsigned long flag
         mfn = frame->mfn;
         set_pgentry(entry, mfn, flags);
         tab = tmp_map_mfn(mfn);
+        BUG_ON(mfn_invalid(tab_mfn));
         clean_pagetable(tab);
+        BUG_ON(mfn_invalid(tab_mfn));
     }
     else {
         /* Page table already exists but its flags may conflict with our. Maybe fixup */

Based on this and the values that _start and _end take, it seems that clean_pagetable cleans a frame that is used by the memory manager itself. So a frame that is already used seems to be given out by the pmm. We have not found the exact cause of this yet.

Props to Sandro Rüegge (@sparchatus) for helping me to debugging this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions