X86 Paging Mechanism
From The Linux Memory Wiki
After translating logical address into linear address, which basically has the same effect as doing nothing in Linux, we have to figure out how the addresses are mapped into the physical memory.
We have the following issues to consider during the physical address translation:
- The RAM usually has smaller size than the actual capacity a physical address can carry. For instance, most of our computer have 1GB of RAM while the 32-bit address can map up to 4GB of RAM. Hence, only a range of physical address is available.
- All processes have their own linear address. For instance, a two completely different program may have the same instruction jmp 0x30001500 but obviously they should jump to different places in the physical RAM.
- The processes may use more memory than the amount of actual memory. The kernel must load inactive memories of a process into harddisk and then load it out later when the process need it again.
Hence, a virtual memory system has to be designed to solve these problems. The paging system is one of the subsystem of Linux virtual memory manager. We will see how the paging system translate linear address into physical address and how it pass the work to other system when the virtual memory resides inside the harddisk.
Contents |
Page and Page Frame
It is important to understand the different thing a page and a page frame refers to.
A page frame is a range of physical address resides in the RAM.
A page is the actual content that should be loaded from the corresponding linear address.
A page and be inserted into any page frame while the page frame is static.
We may view the page and page frame in this way:
The RAM is like a big room that can be filled with many balls. We divide the big room into hundreds of boxes (the page frame), and each box can hold one ball (the page).
Each ball and box has a number to indentify theirselve. The paging system determine which ball is located at which box, or whether the box is inside the room or not. For instance, ball #3 may be placed in box #8 while ball #5 may be outside the room.
In x86 the size of the page frame can be either 4KB or 4MB. Linux chooses 4KB as default to improve performance on loading pages.
Hence, if we have a 512MB RAM, there will be 131072 page frames to be managed by the kernel. The first page frame is located from physical address 0x00000000 to 0x000001000, the 2nd page frame is located from 0x00002000, and so on.
Sharing of Pages
Each process has their own sets of linear address, hence each of them has their own set of pages which is independent of each other.
For instance, two completely different processes may have the same linear address 0x2038000A but one may point to the physical address 0x53901040 and the other points to physical address 0xAF394041.
Real Mode and Protected Mode
When we talking about how the CPU loads various descriptors from the RAM, it is important to understand that it doesn't goes through the segmentation and paging mechanism.
This means that when we load the segment descriptor mentioned in the previous section, it loads directly from the RAM according to the physical address stored inside the gdtr register. Hence it doesn't know anything about page frame and neither do it care which page is in this page frame.
The same situation applies when the CPU loads the page and page frame descriptors which we will describe later.
The situation above are for the internal memory loading of CPU, when we want to write assembly programs that loads directly from physical address, it is also possible. This is operated in a mode called the Real Mode. In real mode, a linear address is the same as a physical address.
Hence, the instruction lw %eax, 0x00001234 operated in real mode will bypass the segmentation and paging mechanism and will directly load data from the physical memory of address 0x00001234.
This explains how Linux actually load the descriptors into memory when it is booted up. During boot up, the kernel determines a range of physical address to store the segment and other descriptors. Note however that after boot up, the kernel will exit real mode. The kernel never operates in real mode after loading the descriptors into memory.
In following chapters we will discuss how the kernel manages the memory in normal situation.
Page Directory and Page Tables
So, the linear address is made up of whole lots of pages. For 32-bit architecture, there can be up to 4GB/4KB = one million of pages to be used by a process. These pages may reside in any page frames or may reside in the harddisk. Given a linear address, we know the page we have to find, but how we find that page?
The direct way of doing this is to use a way similar to what we did in segmentation. We can create a page table array of one million pages, and then use the offset (which obtained by taking the 29 most significant bits of the address because each page is 4KB = multiply of 0x1000) to locate the page descriptor, which then points us to the physical address that the page really is.
(revise what you learned from segmentation if you don't understand this. The only difference is the segments are 16-bit but the page tables are 29-bit, one is taken from segment register while paging takes from part of the linear address.)
But obviously this would be too waste of memory, because each process must have their own linear address space, they must have their own set of page tables. If we create a page table of one million entries for each process that would be horrible. Furthermore, not all process will eat up 4GB of memory.
In Intel x86 architecture, the paging unit provide 2 level paging mechanism. The procedure is the following:
We first take the 10 most significant bits (bits 22-31) and refer it to the corresponding entry in a page table called the Page Directory.
This entry will point us to another table, called the Page Table (the name of the page table is Page Table! watch out the case just as we do in Java).
We then take the middle 10 significant bits (bits 12-21) and use this to refer to the corresponding entry in the Page Table.
This entry will point us to the physical address of the corresponding page, we will then add the offset to the physical address to produce the physical address for the linear address.
The whole picture now becomes like this, for a set of linear address there exists one Page Directory with 2^10 = 1024 entries. Each of this entry points to another table named Page Table. (Or we may say there is one Page Directory which is a table of 1024 Page Tables if you wish to be simple)
Each of this Page Table has also 2^10 = 1024 entries, and each of this entry points to the physical address of the page.
Why do we want to split one big table into one table of tables? This is because the Page Tables of a linear address region are only created when a process truely uses the linear address.
For instance, if a process never uses the range of linear addess from
1111 1111 0100 0000 0000 0000 0000 = 0xFF400000 to 1111 1111 0111 1111 1111 1111 1111 = 0xFF7FFFFF
Then the Page Directory at entry 11 1111 1101 = 0x3FD (notice the uneven splitting of 10 bits in the binary representation) will not point to any Page Table because all entries in that Page Table will be never used. The kernel thus saved a space of 1024 * 4 = 4KB of memories for one Page Table!
Page Table Descriptor and cr3 Register
So we have some page tables which point to each other in a complex way, each entries in the table must hold some information, and the entry is called the page table descriptor. Each page table descriptor is of 4 bytes or 32-bit long, a page table has 1024 entries, and each entry size of 4 bytes give the total size of a page table 4KB, which fits nicely into one page frame. Notice that the descriptor of a Page Directory is the same as the descriptor of a Page Table, even though they points to different thing.
Each descriptor holds various fields that hold different information. We will talk about some of the important field here:
- base field: This 20-bit field points to either the physical address of the next page table or the page frame. The address of the page table and page frame are both multiple of 4KB hence the 12 least significant bits of the address can be omitted.
- present flag: This flag determines if the entry exists or not. If the flag is 0, the entry is not valid and the address field will not be referenced. Instead, a Page Fault will be issued by the Paging Unit. The Page Fault exception will be discussed later.
- user/supervisor flag: This flag determines if the page belongs to to the kernel or user. If it is set to 0, the page is protected and only the kernel can access that page.
We know that we can get the physical address of a linear address by first refering to the Page Directory, then the Page Table. But how do we know where is the Page Directory?
The problem is solved using the same way as segmentation again. We uses another register called the cr3 register. The cr3 register is a 20-bit register pointing to the physical address of the Page Directory.
Sharing of cr3 register
Each process has their own Page Directory, hence whenever a process switch happens, the kernel will change the value of the cr3 register to point to the Page Directory of the new running process. The kernel obtains the physical address if the process's Page Directory from the process's process descriptor. (another descriptor stored in somewhere else in the memory!)
By this, every process got their own linear address space by simply changing the cr3 register.
Memory Management Unit
An Example
An example explains everything. Suppose we have an instruction jmp 0x16C23150 (0001 0110 1100 0010 0011 0001 0101 0000), here is how the Paging Unit translate the linear address into physical address:
- Since the segmentation has no effect in Linux, the value 0x06C23150 remains the same in linear address.
- The Paging Unit loads the cr3 register. Suppose the cr3 register value is 0x05C71 (0000 0101 1100 0111 0001), we now know the Page Directory is located at 0x05C71 * 0x1000 = 0x05C71000.
- Extract the 10 MSB from the linear address, 0x5B (00 0101 1011), this is the 91th entry in the Page Directory. Multiply by 4 bytes to get 0x16C (0001 0110 1100) back and add this to 0x05C71000 to obtain 0x05C7116C. This is the physical address of the 91th descriptor in the Page Directory.
- Read the descriptor of the entry from physical address 0x05C7116C. Suppose the present flag is set and the base field has the 20-bit value 0x05E37 (0000 0101 1110 0011 0111), multiply this value by 0x1000 to get 0x05E37000. This is the physical address of the 91th Page Table in the Page Directory.
- Obtain the middle 10 bits value of the linear address, 0x23 (00 0010 0011), this is the 35th entry of the Page Table. Multiply by 4 to get 0x08C (0000 1000 1100). Add this to the physical address of Page Table 0x05E37000 to get 0x05E3708C. This is the physical address of the 35th entry of the Page Table.
- Read the descriptor at physical address 0x05E3708C. Suppose the present flag is set and the base field has the 20-bit value 0x23C69 (0010 0011 1100 0110 1001), multiply by 0x1000 and the address 0x23C69000 is the physical address of the page frame for the linear address.
- Extract the 12 LSB from the linear address, 0x150, this is the offset of the physical address from the page frame. Add to 0x23C69000 to get 0x23C69150.
So finally! The linear address 0x06C23150 has been translated to physical address 0x23C69150! This is located somewhere after the 512MB region of the RAM. If the computer has a pair of 512MB RAM, it is located some where at the beginning of the 2nd RAM.
This example illustrates the complex event involved inside the Paging Unit to translate a linear address into physical address. Thankfully this is done automatically by the Paging Unit, otherwise it would be tedious to write a kernel code to perform such translation!
Warning: I *strongly strongly* discourage of using this kind of example as exam question! The example is useful only for illustration purpose and I believe forcing students to do this kind of stupid, error prone, and mechanical calculation do not benifite the student at all and does not prove anything about the knowledge of the student about operating systems.
No exam question is allowed to imitate this example!!
Other Kinds of Paging Mechanism
Other than the default paging mechanism, called Regular Paging in x86 architecture, there is several more paging mechanism that is supported by Intel:
- Extended Paging - in this model the page frame size is 4MB instead of 4KB. There is only one Page Directory and no Page table to map the pages, as there is 1024 time less pages that needs to be mapped by the Paging Unit.
- Physical Address Extension (PAE) Paging - this model allows the kernel to map up to 64GB physical address instead of the traditional 4GB limit. This is done mainly by increasing the bit size of the cr3 register to 24-bit and increasing the base field of the page tables. However only the kernel is able to map all 64GB physical address and user processes are still limited to the 32-bit address 4GB boundry.
By recent introduction of the x64 64-bit architecture, we are now able to map virtually unlimited amount of physical and linear address. The huge amount of 64-bit linear address makes the 2-level regular paging mechanism inefficient. A new 4-level paging mechanism is introduced in the x64 architecture to map the 64-bit linear address. The only difference of this with 32-bit regular paging is that more level of page tables are used to map the linear address.
We won't discuss the details of these paging mechanism. In this book we will focus mainly on the regular paging mechanism in the x86 architecture.
