User space memory access from the Linux kernel
An introduction to Linux memory and user space APIs
By M. Jones
10 August 2010
Archive date: 2023-08-31
虽然字节可能是 Linux 中内存的最低可寻址单位,但页面才是内存的可管理抽象。本文首先讨论 Linux 中的内存管理,然后探讨从内核操作用户地址空间的方法。
Although the byte may be the lowest addressable unit of memory within Linux, it's the page that serves as the managed abstraction of memory. This article begins with a discussion of memory management within Linux, and then explores the methods for manipulation of user address space from the kernel.
Linux memory
在 Linux 中,用户内存和内核内存是独立的,并在不同的地址空间中实现。地址空间是虚拟化的,也就是说,地址是从物理内存中抽象出来的(通过即将详述的过程)。由于地址空间是虚拟化的,因此可以存在很多地址空间。事实上,内核本身驻留在一个地址空间中,而每个进程都驻留在自己的地址空间中。这些地址空间由虚拟内存地址组成,允许许多具有独立地址空间的进程引用一个小得多的物理地址空间(机器中的物理内存)。这不仅方便,而且安全,因为每个地址空间都是独立和隔离的,因此是安全的。
In Linux, user memory and kernel memory are independent and implemented in separate address spaces. The address spaces are virtualized, meaning that the addresses are abstracted from physical memory (through a process detailed shortly). Because the address spaces are virtualized, many can exist. In fact, the kernel itself resides in one address space, and each process resides in its own address space. These address spaces consist of virtual memory addresses, permitting many processes with independent address spaces to refer to a considerably smaller physical address space (the physical memory in the machine). Not only is this convenient, but it's also secure, because each address space is independent and isolated and therefore secure.
但这种安全性也是有代价的。因为每个进程(和内核)都可能拥有指向不同物理内存区域的相同地址,所以无法立即共享内存。幸运的是,我们有一些解决方案。用户进程可以通过 UNIX® 便携式操作系统接口(POSIX)共享内存机制(shmem)共享内存,但需要注意的是,每个进程都可能拥有指向同一物理内存区域的不同虚拟地址。
But there's a cost associated with this security. Because each process (and the kernel) can have identical addresses that refer to different regions of physical memory, it's not immediately possible to share memory. Luckily, a few solutions exist. User processes can share memory through the Portable Operating System Interface for UNIX® (POSIX) shared memory mechanism (shmem), with the caveat that each process may have a different virtual address that refers to the same region of physical memory.
虚拟内存与物理内存的映射是通过页表进行的,页表在底层硬件中实现(见图 1)。硬件本身提供映射,但内核管理页表及其配置。请注意,如图所示,一个进程可能有一个很大的地址空间,但它是稀疏的,这意味着地址空间的小区域(页)通过页表指向物理内存。这样,进程就可以拥有一个庞大的地址空间,而这个地址空间只为特定时间所需的页定义。
The mapping of virtual memory to physical memory occurs through page tables, which are implemented in the underlying hardware (see Figure 1). The hardware itself provides the mapping, but the kernel manages the tables and their configuration. Note that as shown here, a process may have a large address space, but it is sparse, meaning that small regions (pages) of the address space refer to physical memory through the page tables. This permits a process to have a massive address space that is defined only for the pages that are needed at any given time.
Figure 1. Page tables provide the mapping from virtual addresses to physical addresses
图 1:页表提供了从虚拟地址到物理地址的映射
为进程稀疏定义内存的能力意味着底层物理内存可能会被过度占用。通过一个称为分页(在 Linux 中通常称为交换)的过程,较少使用的页面会被动态移动到速度较慢的存储设备(如磁盘)上,以容纳其他需要访问的页面(见图 2)。这种行为允许计算机中的物理内存提供应用程序更需要的页面,同时将不太需要的页面迁移到磁盘,以提高物理内存的利用率。需要注意的是,有些页面可能会指向文件,在这种情况下,如果数据较脏,可以(通过页面缓存)刷新,如果页面干净,则直接丢弃。
Having the ability to sparsely define memory for processes means that the underlying physical memory can be overcommitted. Through a process called paging (though in Linux, it's typically called swap), less-used pages are dynamically moved to a slower storage device (such as a disk) to accommodate other pages that need to be accessed (see Figure 2). This behavior allows the physical memory within the computer to serve pages that an application more readily needs while migrating less-needed pages to disk for improved utilization of the physical memory. Note that some pages can refer to files, in which case, the data can be flushed if dirty (through the page cache) or, if the page is clean, simply discarded.
图 2. 通过将使用较少的页面迁移到速度较慢、成本较低的存储空间,交换可以更好地利用物理内存空间。
Figure 2. Swap permits better use of the physical memory space by migrating less-used pages to slower and less expensive storage
MMU-less architectures (无 MMU 架构)
并非所有处理器都有 MMU。因此,uClinux 发行版(微控制器 Linux)支持单地址操作空间。这种架构缺乏 MMU 提供的保护,但允许 Linux 在另一类处理器上运行。有关 uClinux 的信息,请参阅资源部分。
Not all processors have MMUs. Therefore, the uClinux distribution (microcontroller Linux) supports a single address space of operation. This architecture lacks the protection offered by an MMU but permits Linux to run on another class of processor. See the resources section for information on uClinux.
选择页面交换到存储器的过程称为页面替换算法,可以使用多种算法(如最近最少使用算法)来实现。当请求的内存位置的页面不在内存中(内存管理单元[MMU]中没有映射)时,这个过程就会发生。这种情况称为页面故障,由硬件(MMU)检测到,然后在发生页面故障中断后由固件管理。该堆栈的示意图见图 3。
The process by which a page is selected to swap to storage is called a page-replacement algorithm and can be implemented using a number of algorithms (such as least recently used). This process can occur when a memory location is requested whose page is not in memory (no mapping is present in the memory management unit [MMU]). This event is called a page fault and is detected by hardware (the MMU), and then managed by firmware after a page fault interrupt occurs. See Figure 3 for a illustration of this stack.
Linux 提供了一种有趣的交换实现方式,它具有一些有用的特性。Linux 交换系统允许创建和使用多个交换分区和优先级,这样就可以在具有不同性能特征的存储设备上实现分级交换(例如,在固态硬盘上使用一级交换,而在速度较慢的存储设备上使用更大的二级交换空间)。为固态硬盘交换空间附加较高的优先级,可以让它一直使用到用完为止;只有在这个时候,页面才会被写入优先级较低(速度较慢)的交换分区。
Linux provides an interesting implementation of swap that offers some useful characteristics. The Linux swap system permits the creation and use of multiple swap partitions and priorities, which permits a hierarchy of swap over storage devices that provide different performance characteristics (for example, a first-level swap on a solid-state disk [SSD] and a larger, second-level swap space on a slower storage device). Attaching a higher priority to the SSD swap allows it to be used until exhausted; only then would pages be written to the lower-priority (slower) swap partition.
图 3. 虚拟地址到物理地址映射的地址空间和元素
Figure 3. Address spaces and elements of virtual-to-physical address mapping
并非所有页面都适合交换。比如响应中断的内核代码或管理页表和交换逻辑的代码。这些页面显然永远不应该被交换出去,因此被固定或永久驻留在内存中。虽然内核页面不适合交换,但用户空间页面可以,但可以通过 mlock(或 mlockall)函数锁定页面。这就是用户空间内存访问函数背后的目的。如果内核认为用户传递的地址是有效且可访问的,那么最终就会发生内核恐慌(例如,由于用户页面被换出,导致内核出现页面故障)。该应用程序接口(API)可确保正确处理这些死角情况。
Not all pages are candidates for swapping. Consider kernel code that responds to interrupts or code that manages the page tables and swap logic. These are obvious pages that should never be swapped out and are therefore pinned, or permanently resident in memory. Although kernel pages are not candidates for swapping, user space pages are, but they can be pinned through the mlock (or mlockall) function to lock the page down. This is the purpose behind the user space memory access functions. If the kernel assumed that an address that a user passed was valid and accessible, a kernel panic would eventually occur (for example, because the user page was swapped out, resulting in a page fault in the kernel). This application programming interface (API) ensures that those corner cases are handled properly.
Kernel APIs
现在,让我们来探索内核中用于操作用户内存的 API。请注意,这里涉及的是内核和用户空间接口,下一节将探讨其他一些内存 API。表 1 列出了要探索的用户空间内存访问函数。
Now, let's explore the kernel APIs for manipulating user memory. Note that this covers the kernel and the user space interface, but the next section explores some of the other memory APIs. The user space memory access functions to be explored are listed in Table 1.
表 1. 用户空间内存访问 API
Table 1. The User Space Memory Access API
Function |
Description |
access_ok |
Checks the validity of the user space memory pointer |
get_user |
Gets a simple variable from user space |
put_user |
Puts a simple variable to user space |
clear_user |
Clears, or zeros, a block in user space |
copy_to_user |
Copies a block of data from the kernel to user space |
copy_from_user |
Copies a block of data from user space to the kernel |
strnlen_user |
Gets the size of a string buffer in user space |
strncpy_from_user |
Copies a string from user space into the kernel |
如你所料,这些函数的实现可能与体系结构有关。对于 x86 架构,这些函数和符号定义在 ./linux/arch/x86/include/asm/uaccess.h 中,源代码在 ./linux/arch/x86/lib/usercopy_32.c 和 usercopy_64.c。
As you would expect, the implementation of these functions can be architecture dependent. For x86 architectures, you can find these functions and symbols defined in ./linux/arch/x86/include/asm/uaccess.h, with source in ./linux/arch/x86/lib/usercopy_32.c and usercopy_64.c.
数据移动功能的作用如图 4 所示,它与复制所涉及的类型(简单与汇总)有关。
The role of the data-movement functions is shown in Figure 4 as it relates to the types involved for copy (simple vs. aggregate).
图 4. 使用用户空间内存访问 API 进行数据移动
Figure 4. Data movement using the User Space Memory Access API
The access_ok function
你可以使用 access_ok 函数来检查你打算访问的用户空间指针的有效性。调用者提供的指针指的是数据块的起始位置、数据块的大小和访问类型(是读取还是写入该区域)。函数原型定义如下
You use the access_ok function to check the validity of the pointer in user space that you intend to access. The caller provides the pointer, which refers to the start of the data block, the size of the block, and the type of access (whether the area is intended to be read or written). The function prototype is defined as:
access_ok( type, addr, size );
类型参数可指定为 VERIFY_READ 或 VERIFY_WRITE。VERIFY_WRITE 符号还能确定内存区域是否可读可写。如果该区域可能被访问,则函数返回非零值(尽管访问仍可能导致 -EFAULT)。该函数只是检查地址是否位于用户空间,而不是内核空间。
The type argument can be specified as VERIFY_READ or VERIFY_WRITE. The VERIFY_WRITE symbolic also identifies whether the memory region is readable as well as writable. The function returns non-zero if the region is likely accessible (though access may still result in -EFAULT). This function simply checks that the address is likely in user space, not in the kernel.
The get_user function
要从用户空间读取一个简单变量,需要使用 get_user 函数。该函数用于 char 和 int 等简单类型,但结构体等较大数据类型必须使用 copy_from_user 函数。该函数原型接受一个变量(用于存储数据)和用户空间中用于读取操作的地址:
To read a simple variable from user space, you use the get_user function. This function is used for simple types such as char and int, but larger data types like structures must use the copy_from_user function, instead. The prototype accepts a variable (to store the data) and an address in user space for the Read operation:
get_user( x, ptr );
get_user 函数映射到两个内部函数之一。在内部,该函数确定被访问变量的大小(基于提供用于存储结果的变量),并通过 __get_user_x 形成内部调用。该函数成功时返回 0。一般来说,get_user 和 put_user 函数比对应的块拷贝函数更快,如果要移动小类型的变量,应使用这两个函数。
The get_user function maps to one of two internal functions. Internally, this function determines the size of the variable being accessed (based on the variable provided to store the result) and forms an internal call through __get_user_x. This function returns zero on success. In general, the get_user and put_user functions are faster than their block copy counterparts and should be used if small types are moved.
The put_user function
你可以使用 put_user 函数将一个简单变量从内核写入用户空间。与 get_user 类似,该函数接受一个变量(包含要写入的值)和一个用户空间地址作为写入目标:
You use the put_user function to write a simple variable from the kernel into user space. Like get_user, it accepts a variable (which contains the value to write) and a user space address as the write target:
put_user( x, ptr );
与 get_user 类似,put_user 函数内部映射了 put_user_x 函数,成功时返回 0,错误时返回 -EFAULT。
Like get_user, the put_user function is internally mapped over the put_user_x function and returns 0 on success or -EFAULT on error.
The clear_user function
clear_user 函数用于将用户空间中的内存块清零。该函数接收一个用户空间指针和一个归零大小(以字节为单位):
The clear_user function is used to zero a block of memory in user space. This function takes a pointer in user space and a size to zero, which is defined in bytes:
clear_user( ptr, n );
在内部,clear_user 函数首先检查用户空间指针是否可写(通过 access_ok),然后调用一个内部函数(以内联汇编方式编码)执行清除操作。该函数使用带重复前缀的字符串指令优化为一个非常紧凑的循环。它会返回无法清除的字节数,如果操作成功,则返回 0。
Internally, the clear_user function first checks to see whether the user space pointer is writable (via access_ok), and then invokes an internal function (coded in inline assembly) to perform the Clear operation. This function is optimized as a very tight loop using string instructions with the repeat prefix. It returns the number of bytes that were not clearable or zero if the operation was successful.
The copy_to_user function
copy_too_user 函数将内核中的数据块复制到用户空间。该函数接受一个指向用户空间缓冲区的指针、一个指向内核缓冲区的指针和一个以字节为单位的长度。函数成功时返回 0,否则返回非 0,以表示未传输的字节数。
The copy_to_user function copies a block of data from the kernel into user space. This function accepts a pointer to a user space buffer, a pointer to a kernel buffer, and a length defined in bytes. The function returns zero on success or non-zero to indicate the number of bytes that weren't transferred.
copy_to_user( to, from, n );
在检查是否有能力写入用户缓冲区(通过 access_ok)后,内部函数 __copy_to_user 将被调用,该函数又会调用 __copy_from_user_inatomic (在 ./linux/arch/x86/include/asm/uaccessXX.h 中,其中 _XX 是 32 或 64,取决于体系结构)。该函数(在确定执行 1、2 还是 4 字节拷贝后)最后调用 `copy_to_user_ll`,这才是真正的工作。在损坏的硬件中(在 i486 之前,WP 位在监督模式下不被执行),页表可以随时更改,这就要求将所需的页钉入内存,以便在寻址时不会被换出。在 i486 后,这一过程只不过是优化拷贝而已。
After checking the ability to write to the user buffer (through access_ok), the internal function __copy_to_user is invoked, which in turn calls __copy_from_user_inatomic (in ./linux/arch/x86/include/asm/uaccessXX.h, where _XX is 32 or 64 depending on architecture). This function (after determining whether to perform 1, 2 or 4 byte copies) finally calls `copy_to_user_ll`, which is where the real work is done. In broken hardware (prior to the i486, where the WP bit was not honored from supervisory mode), the page tables could change at any time, requiring a the desired pages to be pinned into memory so that they could not be swapped out while being addressed. Post i486, the process is nothing more than an optimized copy.
The copy_from_user function
copy_from_user 函数将用户空间的数据块复制到内核缓冲区。它接受一个目标缓冲区(内核空间)、一个源缓冲区(用户空间)和一个以字节定义的长度。与 copy_too_user 函数一样,该函数成功时返回 0,失败时返回非 0。
The copy_from_user function copies a block of data from user space into a kernel buffer. it accepts a destination buffer (in kernel space), a source buffer (from user space), and a length defined in bytes. As with copy_to_user, the function returns zero on success and non-zero to indicate a failure to copy some number of bytes.
copy_from_user( to, from, n );
函数首先检查从用户空间的源缓冲区读取数据的能力(通过 access_ok),然后调用 __copy_from_user 并最终调用 __copy_from_user。从这里开始,根据不同的体系结构,调用将用户缓冲区复制到内核缓冲区,并将不可用的字节清零。优化后的汇编函数包括管理用户缓冲区的能力。
The function begins by checking the ability to read from the source buffer in user space (via access_ok), and then calls __copy_from_user and eventually __copy_from_user. From here, depending on architecture, a call is made to copy from the user buffer to a kernel buffer with zeroing (of unavailable bytes). The optimized assembly functions include the ability to manage.
The strnlen_user function
strnlen_user 函数的用法与 strnlen 类似,但假定缓冲区在用户空间。strnlen_user 函数有两个参数:用户空间缓冲区地址和要检查的最大长度。
The strnlen_user function is used just like strnlen but assumes that the buffer is available in user space. The strnlen_user function takes two arguments: the user space buffer address and the maximum length to check.
strnlen_user( src, n );
The strnlen_user function first checks to see that the user buffer is readable through a call to access_ok. If accessible, the strlen function is called, and the max length argument is ignored.
strnlen_user 函数首先通过调用 access_ok 来检查用户缓冲区是否可读。如果可以读取,则调用 strlen 函数,并忽略最大长度参数。
The strncpy_from_user function
strncpy_from_user 函数根据用户空间源地址和最大长度,将用户空间的字符串复制到内核缓冲区。
The strncpy_from_user function copies a string from user space into a kernel buffer, given a user space source address and max length.
strncpy_from_user( dest, src, n );
作为从用户空间的复制,该函数首先通过 access_ok 检查缓冲区是否可读。与 copy_from_user 类似,该函数也是作为优化的汇编函数实现的(在 ./linux/arch/x86/lib/usercopy__XX._c 中)。
As a copy from user space, this function first checks that the buffer is readable via access_ok. Similar to copy_from_user, this function is implemented as an optimized assembly function (within ./linux/arch/x86/lib/usercopy__XX._c).
Other schemes for memory mapping 其他内存映射方案
上一节探讨了在内核和用户空间(由内核启动操作)之间移动数据的方法。Linux 提供了许多其他方法,你可以用来在内核和用户空间中移动数据。虽然这些方法提供的功能不一定与用户空间内存访问函数所描述的完全相同,但它们在地址空间之间映射内存的能力是相似的。
The previous section explored methods for moving data between the kernel and user space (with the kernel initiating the operation). Linux provides a number of other methods that you can use for data movement, both in the kernel and in user space. Although these methods may not necessarily provide identical functionality as described by the user space memory access functions, they are similar in their ability to map memory between address spaces.
在用户空间中,需要注意的是,由于用户进程出现在不同的地址空间中,因此在它们之间移动数据必须通过某种形式的进程间通信机制。Linux 提供了多种方案(如消息队列),但最值得注意的是 POSIX 共享内存(shmem)。这种机制允许一个进程创建一个内存区域,然后与一个或多个进程共享该区域。请注意,每个进程都可以将共享内存区域映射到各自地址空间中的不同地址。因此,需要使用相对偏移寻址。
In user space, note that because user processes appear in separate address spaces, moving data between them must occur through some form of inter-process communication mechanism. Linux provides a variety of schemes (such as message queues), but most notable is POSIX shared memory (shmem). This mechanism allows a process to create an area of memory, and then share that region with one or more processes. Note that each process can map the shared memory region to different addresses in their respective address spaces. Therefore, relative offset addressing is required.
mmap 函数允许用户空间应用程序在虚拟地址空间创建映射。这种功能在某些类别的设备驱动程序中很常见(为了提高性能),允许将物理设备内存映射到进程的虚拟地址空间。在驱动程序中,mmap 功能是通过 remap_pfn_range 内核函数实现的,该函数可将设备内存线性映射到用户地址空间。
The mmap function allows a user space application to create a mapping in the virtual address space. This functionality is common in certain classes of device drivers (for performance), allowing physical device memory to be mapped into the virtual address space of the process. Within a driver, the mmap function is implemented through the remap_pfn_range kernel function, which provides a linear mapping of device memory into a user's address space.
Going further 更进一步
本文探讨了 Linux 中的内存管理问题(得出分页背后的意义),然后探讨了使用这些概念的用户空间内存访问函数。在用户空间和内核之间移动数据并不像看起来那么简单,但 Linux 提供了一套简单的应用程序接口(API),可为你跨平台管理这项错综复杂的任务。
This article explored the topic of memory management within Linux (to arrive at the point behind paging), and then explored the user space memory access functions that use those concepts. Moving data between the user space and kernel is not as simple as it seems, but Linux includes a simple set of APIs that manage the intricacies of this task across platforms for you.
参考: