Booting recent Ubuntu machines with custom built Linux Kernels

One frustration I had after beginning to use Ubuntu 20.04 (LTS) was that from some point, I just could not install a Linux kernel I built and boot the system up. My usual workflow was:

make install -jXX
sudo make modules_install
sudo make install

However my boot would stall while loading the ramdisk – much like this ask-ubuntu question. Fortunately, now, it seems to be that the initramfs size is too large (why does this matter?) causing the stall. The answer to the question above recommends stripping the modules:

sudo make modules_install INSTALL_MOD_STRIP=1

This shall ‘strip’ the modules after they are installed. (And I assume, this would reduce the module size in the initramfs).

Another method that Alireza found was to change the initramfs behavior. Changing the MODULES=dep in the /etc/initramfs-tools/initramfs.conf should reduce the initramfs size down. After changing this, my guess is either

$ sudo make install
$ #or
$ update-initramfs -c -k KERNEL_VERSION

And try booting again 🙂

Writing a (Super Simple) Linux Module

Okay, so Today I’ll need the CR3 value given a PID. The Linux kernel does not give this information out to the userspace, so I’ll be building my own module to take care of this. The code is available on my Gitlab.

Two awesome references I’ll be using for this task:

Making the basic module

We need to the actual module that is loadable into the kernel. This can be done by using the tutorial from the first reference. The main module source code needs to have the proper headers (to call proper module MACROs), the proper calls to the macros (description, author, license), and finally actually define which functions of the module are the entry and exit points.


#include 	
#include 	
#include 	

MODULE_DESCRIPTION("Kernel module to translate given PID to CR3 physical address");
MODULE_AUTHOR("Chang Hyun Park");
MODULE_LICENSE("GPL");

static int pid2cr3_init(void)
{
  printk("%s\n", __func__);
  return 0;
}

static void pid2cr3_exit(void)
{
  printk("%s\n", __func__);
}

module_init(pid2cr3_init);
module_exit(pid2cr3_exit);

The Makefile and the Kbuild files are identical to the first reference, except the filename of the code above (pid2cr3.c, add the object as pid2cr3.o​).

The full code can be found at Gitlab here

Actually doing the intended work

Now, given a PID we want to retrieve the CR3 Value! There are a few ways to pass on the PID value to the module.

  1. Register the module to listen on a system call [here]
  2. Pass on parameters when loading the module (and dynamically change the parameter at runtime) [here]
  3. Open a pseudo-file to communicate with the module (sysfs, procfs).

We take the final approach, where we create a new sysfs directory and file which we program to act differently for reads and writes.

For a write to the file, we receive the PID from the userspace, and for the write, we return the CR3 for the priorly received PID.

The code is available here.

That’s it!

Linux NUMA physical memory layout

Just looked through the pseudo files in the procfs.

/proc/zoneinfo has some interesting information.

There are two zones in my current system: DMA, normal. We also have 2 nodes, and thus the file is shown as:

  • Node0 DMA
  • Node0 DMA32
  • Node0 normal
  • Node1 normal

Each section has how many free pages, present pages there are. At the end of the section there is the start_pfn that shows us the physical address of the beginning of the zone.

Thus we can approximate the physical address space of our system by using the start_pfn * PAGE_SIZE and also using the number of present pages.

NUMA Linux Memory Allocation

Scope of Memory Policies

System Default Policy

Hard coded into the kernel. On system boot it uses interleaved. After bootup it uses local allocation.

Task/Process Policy

Per task policy. Task policies are inherited to child processes. Thus applications like numactl uses this property to propogate the task policy to the child process.

In multi-threaded situation where other threads exist only the thread that calls the MEMORY_POLICY_APIS will set its memory policy. All other exisitng threads will retain the prior policy.

The policy only affects memory allocation after the time the policy is set. All allocations before the change are not affected.

VMA Policy

Only applies to anonymous pages. File mapped VMAs will ignore the VMA policy if it is set to MAP_SHARED. If it is MAP_PRIVATE VMA policy will only be enforced on a write to the mapping (CoW).

VMA policies are shared by threads of the same address space. VMA policies do not persist accross exec() calls (as the address pace is wiped)

Shared Policy

Similar to VMA policy but it is shared among processes. Some more details, but skipped due to irrelevance to my work.

Memory Policies

Default Mode

Specifies the current scope does not follow a policy, fall back to larger scope’s policy. At the root it’ll follow the system policy.

Bind Mode

Memory allocated from the nodes specified by the policy. Proximity is considered first, and if enough free space exists for the closest memory node (to the allocation requestor) it’ll be granted

Preferred Mode

Allocation will be attempted from the preferred (single) node. If it fails, the nodes will be searched for free space in nearest first fashion.
Local allocation is a preferred mode where the node that initiates a page fault is the preferred node.

Interleaved Mode

For Anonymous/shared pages: Node set is indexed using the page offset to the VMA. (Address % node_nums). The indexed node is requested for a page as in Preferered mode. And if it fails follows preferred mode style.

Page cache pages: A node counter is used to wrap around and try to spread out pages among the nodes (that are specified).

 

Reference: What is Linux Memory Policy