linux memory management
Memory management subsystem is one of the most complex and at the same time most important part of any kernel. In this article we will discuss about basics of memory management.
Describing Physical memory
To better understand the memory management in Linux, we should know little bit of NUMA and UMA architectures. In this section we will discuss about how memory is organized in NUMA based system from both hardware and software point of view.
NUMA – Nodes and Zones
In brief a NUMA system is a computer platform that comprises multiple components or assemblies each of which may contain zero or more CPUs, local memory, and/or IO buses. It is referred by a ‘cell’. The cells of the NUMA system are connected together with some sort of system interconnect. In NUMA system all memory is visible to and accessible from any CPU attached to any cell.
In Linux, the system’s hardware resources are divided in to multiple software abstractions called “nodes”. Linux maps the nodes onto the physical cells of the hardware platform. As with physical cells, software nodes may contain zero or more CPUs, memory and/or IO buses. And, again, memory accesses to memory on “closer” nodes will generally experience faster access times.
For each node with memory, Linux constructs an independent memory management subsystem, complete with its own free page lists, in-use page lists, usage statistics and locks to mediate access. In addition, for each memory zone Linux construct a DMA zone, a NORMAL zone or a HIGHMEM zone and each is suitable for different type of usage.
Keep in mind that memory can be requested from anyone of the zones and for this Linux kernel provides various flags which kernel developer should pass when calling kernel interfaces.
Page and Page Frame
Page Frame
The system’s memory is broken up into fixed sized chunks called page frames. This page frame is treated as the basic unit of memory management unit (MMU, the hardware that manages memory and performs virtual to physical address translations).The kernel represents every page frame on the system with a struct page structure. Please note that each architecture can defined its own page size.
The important point to understand is that the page structure is associated with physical pages, not virtual pages. This page structure’s goal is to describe physical memory not the data contained in it.
Page
The virtual address space is divided into pages, a contiguous span of addresses of a particular size. The pages are page aligned in that the starting address of a page is a multiple of the page size.
Allocating physical pages
Now we will discuss about the interfaces provided by Linux kernel to allocate and free physical pages. Note that interfaces that we discuss below allocates memory with page-sized granularity.
The core function sare
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
This allocates 2order (that is, 1 << order) contiguous physical pages and returns apointer to the first page’s page structure; on error it returns NULL.
If you require one page then below API’s are used and which behaves same as alloc_pagesfunction:
struct page * alloc_page(gfp_t gfp_mask)
If struct page is not required, you can directly call below function which returns logical address of the first requested page.
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
Above function returns more than one page. If single page is required you can use:
unsigned long __get_free_page(gfp_t gfp_mask)
Getting zeroed Pages: If you need the page filled with zeros, use the function
unsigned long get_zeroed_page(unsigned int gfp_mask)
Above allocation requests can fail, so the kernel must check the result returned after every allocation attempt.
If you notice all the above functions expect an unsigned int gfp_maskas an argument. This flag determines how the allocator will behave. gfp_mask is required because different users will require memory from different zones which depends on context user code will run. For example ZONE_DMA is specified while writing certain device drivers or ZONE_NORMALfor disk buffers and callers should not have to be aware of what node is being used. So the gfp_mask helps the allocator to examine the selected zone and checks if it is suitable to allocate pages based on the number of available pages. If the zone is not suitable, the allocator may fall back to other zones.
There is one more function, where given the page it will return the logical address:
void * page_address(struct page *page)
Freeing Pages
void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)
As the rule, free only pages you allocate, otherwise kernel can panic or corruption can happen.
kmalloc
The kmalloc is a kernel function used to obtain kernel memory in byte-sized chunks. If we need whole pages, then we have to use function mentioned in previous section.
void * kmalloc(size_t size, gfp_t flags)
Return values:
On success: It returns memory at least size bytes in length.
On failure: Returns NULL.
So we have to check for return value after a call to kmalloc.
kfree
void kfree(const void *ptr)
This method will free block of memory allocated by kmalloc.
gfp_mask Flags
We have seen that the low level page allocators and kmalloc expect gfp flags as a parameter. We will discuss about these flags further and conclude this article.
Here gfp stands for – get free pages and these gfp flags are formed using various other flags. Here our knowledge of zones will come in to picture.
The gfp flags are broken up into three categories: action modifiers, zone modifiers, and types.
Action Modifiers:
Action modifiers specify how the kernel is supposed to allocate the requested memory. For example various contexts exist where developer needs to use different action modifiers:
Table -1 Low Level GFP Flags
__GFP_WAIT
Indicates that the caller is not high priority and can sleep or reschedule.
__GFP_HIGH
The allocator can access emergency pools and Used by a high priority or kernel process.
__GFP_IO
Indicates that the caller can perform low level IO.
__GFP_FS
The allocator can start filesystem I/O.
__GFP_HIGHIO
Determines that IO can be performed on pages mapped in high memory.
Zone Modifiers
Zone modifiers specify from which memory zone the allocation should originate. Normally, allocations can be fulfilled from any zone.
Table – 2 Zone Modifiers
Flag Description
__GFP_DMA Allocates only from ZONE_DMA
__GFP_DMA32 Allocates only from ZONE_DMA32
__GFP_HIGHMEM Allocates from ZONE_HIGHMEM or ZONE_NORMAL
Type Flags
The type flags specify the required action and zone modifiers to fulfill a particular type of transaction.
Table-3 Low Level GFP Flag Combinations for High Level
Flag Modifier Flags
GFP_ATOMIC __GFP_HIGH
GFP_NOWAIT 0
GFP_NOIO __GFP_WAIT
GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_HIGHUSER
(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
GFP_DMA __GFP_DMA
Kernel provides various high level flags used according to the context you are calling the kmalloc function:
GFP_ATOMIC
Used to allocate memory in interrupt handlers and other code outside of a process context and it never sleeps.
Interrupts handles are not executing in process context and they cannot call ‘schedule’ function. Also interrupt handles should execute for short period of time and so should not sleep.
GFP_KERNEL
Normal allocation of kernel memory May sleep. This can be used for example, when executing a system call in kernel on behalf of a process. This can block.
GFP_USER
Used to allocate memory for user-space pages; it may sleep.
GFP_HIGHUSER
Like GFP_USER, but allocates from high memory, if any.
GFP_NOIO
GFP_NOFS
A GFP_NOFS allocation is not allowed to perform any file system calls, while GFP_NOIO disallows the initiation of any I/O at all. They are used primarily in the file system and virtual memory code where an allocation may be allowed to sleep.
We will discuss about usage of these flags in other articles. As of now, you have to be aware of these various flags.
vmalloc
It allocates memory which is virtually contiguous and not necessarily physically contiguous. But do remember ‘kmall0c’ will return physically contiguous and virtual contiguous memory.
So if memory is virtually contiguous and physically not, then how virtual memory is mapped to physical memory?
Kernel does this by using page tables. This we discuss later articles.
You may also get a doubt, which is faster in allocating memory? The answer lies in the above discussion.
Since vmalloc use page tables to map the virtual address to physical address, so whenever you access memory allocated using vmalloc, you have to traverse through page table. This is additional overhead compared to kmalloc. So vmalloc should be used where it is really necessary.
void * vmalloc(unsigned long size)
On success: Returns a pointer to at least size bytes of virtually contiguous memory.
On failure: the function returns NULL.
To free an allocation obtained via vmalloc(), use
void vfree(const void *addr)
References: