Linux NUMA physical memory layout

Just looked through the pseudo files in the procfs.

/proc/zoneinfo has some interesting information.

There are two zones in my current system: DMA, normal. We also have 2 nodes, and thus the file is shown as:

  • Node0 DMA
  • Node0 DMA32
  • Node0 normal
  • Node1 normal

Each section has how many free pages, present pages there are. At the end of the section there is the start_pfn that shows us the physical address of the beginning of the zone.

Thus we can approximate the physical address space of our system by using the start_pfn * PAGE_SIZE and also using the number of present pages.


NUMA Linux Memory Allocation

Scope of Memory Policies

System Default Policy

Hard coded into the kernel. On system boot it uses interleaved. After bootup it uses local allocation.

Task/Process Policy

Per task policy. Task policies are inherited to child processes. Thus applications like numactl uses this property to propogate the task policy to the child process.

In multi-threaded situation where other threads exist only the thread that calls the MEMORY_POLICY_APIS will set its memory policy. All other exisitng threads will retain the prior policy.

The policy only affects memory allocation after the time the policy is set. All allocations before the change are not affected.

VMA Policy

Only applies to anonymous pages. File mapped VMAs will ignore the VMA policy if it is set to MAP_SHARED. If it is MAP_PRIVATE VMA policy will only be enforced on a write to the mapping (CoW).

VMA policies are shared by threads of the same address space. VMA policies do not persist accross exec() calls (as the address pace is wiped)

Shared Policy

Similar to VMA policy but it is shared among processes. Some more details, but skipped due to irrelevance to my work.

Memory Policies

Default Mode

Specifies the current scope does not follow a policy, fall back to larger scope’s policy. At the root it’ll follow the system policy.

Bind Mode

Memory allocated from the nodes specified by the policy. Proximity is considered first, and if enough free space exists for the closest memory node (to the allocation requestor) it’ll be granted

Preferred Mode

Allocation will be attempted from the preferred (single) node. If it fails, the nodes will be searched for free space in nearest first fashion.
Local allocation is a preferred mode where the node that initiates a page fault is the preferred node.

Interleaved Mode

For Anonymous/shared pages: Node set is indexed using the page offset to the VMA. (Address % node_nums). The indexed node is requested for a page as in Preferered mode. And if it fails follows preferred mode style.

Page cache pages: A node counter is used to wrap around and try to spread out pages among the nodes (that are specified).


Reference: What is Linux Memory Policy