A logical address specified in an instruction is first translated to a linear address by the segmenting hardware. This linear address is then translated to a physical address by the paging unit.
There are two levels of indirection in address translation by the paging unit. A page directory contains pointers to 1024 page tables. Each page table contains pointers to 1024 pages. The register CR3 contains the physical base address of the page directory and is stored as part of the TSS in the task_struct and is therefore loaded on each task switch.
A 32-bit Linear address is divided as follows:
31 ...... 22 | 21 ...... 12 | 11 ...... 0 |
---|---|---|
DIR | TABLE | OFFSET |
CR3 + DIR | points to the table_base. |
table_base + TABLE | points to the page_base. |
physical_address = | page_base + OFFSET |
Page directories (page tables) are page aligned so the lower 12 bits are used to store useful information about the page table (page) pointed to by the entry.
Format for Page directory and Page table entries:
31 ...... 12 | 11 .. 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|
ADDRESS | OS | 0 | 0 | D | A | 0 | 0 | U/S | R/W | P |
D | 1 means page is dirty (undefined for page directory entry). |
R/W | 0 means readonly for user. |
U/S | 1 means user page. |
P | 1 means page is present in memory. |
A | 1 means page has been accessed (set to 0 by aging). |
OS | bits can be used for LRU etc, and are defined by the OS. |
When a page is swapped, bits 1-31 of the page table entry are used to mark where a page is stored in swap (bit 0 must be 0).
Paging is enabled by setting the highest bit in CR0. [in head.S?] At each stage of the address translation access permissions are verified and pages not present in memory and protection violations result in page faults. The fault handler (in memory.c) then either brings in a new page or unwriteprotects a page or does whatever needs to be done.
bit | cleared | set |
0 | page not present | page level protection |
1 | fault due to read | fault due to write |
2 | supervisor mode | user mode |
The Translation Lookaside Buffer (TLB) is a hardware cache for physical addresses of the most recently used virtual addresses. When a virtual address is translated the 386 first looks in the TLB to see if the information it needs is available. If not, it has to make a couple of memory references to get at the page directory and then the page table before it can actually get at the page. Three physical memory references for address translation for every logical memory reference would kill the system, hence the TLB.
The TLB is flushed if CR3 loaded or by task switch that changes CR0. It is explicitly flushed in Linux by calling invalidate() which just reloads CR3.
Segment registers are used in address translation to generate a linear address from a logical (virtual) address.
linear_address = segment_base + logical_address
The linear address is then translated into a physical address by the paging hardware.
Each segment in the system is described by a 8 byte segment descriptor which contains all pertinent information (base, limit, type, privilege).
The segments are:
Characteristics of system segments
To keep track of all these segments, the 386 uses a global descriptor table (GDT) that is setup in memory by the system (located by the GDT register). The GDT contains a segment descriptors for each task state segment, each local descriptor tablet and also regular segments. The Linux GDT contains just two normal segment entries:
LDT[n] != LDTn
LDT[n] | = the nth descriptor in the LDT of the current task. |
LDTn | = a descriptor in the GDT for the LDT of the nth task. |
The kernel segments have base 0xc0000000 which is where the kernel lives in the linear view. Before a segment can be used, the contents of the descriptor for that segment must be loaded into the segment register. The 386 has a complex set of criteria regarding access to segments so you can't simply load a descriptor into a segment register. Also these segment registers have programmer invisible portions. The visible portion is what is usually called a segment register: cs, ds, es, fs, gs, and ss.
The programmer loads one of these registers with a 16-bit value called a selector. The selector uniquely identifies a segment descriptor in one of the tables. Access is validated and the corresponding descriptor loaded by the hardware.
Currently Linux largely ignores the (overly?) complex segment level protection afforded by the 386. It is biased towards the paging hardware and the associated page level protection. The segment level rules that apply to user processes are
A segment selector is loaded into a segment register (cs, ds, etc.) to select one of the regular segments in the system as the one addressed via that segment register.
Segment selector Format:
15 ...... 3 | 2 1 | 0 |
---|---|---|
index | TI | RPL |
Examples:
Selectors used in Linux:
TI | index | RPL | selector | segment | |
---|---|---|---|---|---|
0 | 1 | 0 | 0x08 | kernel code | GDT[1] |
0 | 2 | 0 | 0x10 | kernel data/stack | GDT[2] |
0 | 3 | 0 | ??? | ??? | GDT[3] |
1 | 1 | 3 | 0x0F | user code | LDT[1] |
1 | 2 | 3 | 0x17 | user data/stack | LDT[2] |
On entry into syscall:
There is a segment descriptor used to describe each segment in the system. There are regular descriptors and system descriptors. Here's a descriptor in all its glory. The strange format is essentially to maintain compatibility with the 286. Note that it takes 8 bytes.
63-54 | 55 | 54 | 53 | 52 | 51-48 | 47 | 46 | 45 | 44-40 | 39-16 | 15-0 |
---|---|---|---|---|---|---|---|---|---|---|---|
Base 31-24 |
G | D | R | U | Limit 19-16 |
P | DPL | S | TYPE | Segment Base 23-0 |
Segment Limit 15-0 |
Explanation:
R | reserved (0) |
DPL | 0 means kernel, 3 means user |
G | 1 means 4K granularity (Always set in Linux) |
D | 1 means default operand size 32bits |
U | programmer definable |
P | 1 means present in physical memory |
S | 0 means system segment, 1 means normal code or data segment. |
Type | There are many possibilities. Interpreted differently for system and normal descriptors. |
Linux system descriptors:
TSS: P=1, DPL=0, S=0, type=9, limit = 231 room for 1 tss_struct.
LDT: P=1, DPL=0, S=0, type=2, limit = 23 room for 3 segment descriptors.
The base is set during fork(). There is a TSS and LDT for each task.
Linux regular kernel descriptors: (head.S)
code: P=1, DPL=0, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x3ffff
data: P=1, DPL=0, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x3ffff
The LDT for task[0] contains: (sched.h)
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x9f
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x9f
The default LDT for the remaining tasks: (exec())
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0, limit= 0xbffff
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0, limit= 0xbffff
The size of the kernel segments is 0x40000 pages (4KB pages since G=1 = 1 Gigabyte). The type implies that the permissions on the code segment is read-exec and on the data segment is read-write.
Registers associated with segmentation.
Format of segment register: (Only the selector is programmer visible)
16-bit | 32-bit | 32-bit | |
---|---|---|---|
selector | physical base addr | segment limit | attributes |
Format of GDTR (and IDTR):
32-bits | 16-bits |
---|---|
Linear base addr | table limit |
The TR and LDTR are loaded from the GDT and so have the format of the other segment registers. The task register (TR) contains the descriptor for the currently executing task's TSS. The execution of a jump to a TSS selector causes the state to be saved in the old TSS, the TR is loaded with the new descriptor and the registers are restored from the new TSS. This is the process used by schedule to switch to various user tasks. Note that the field tss_struct.ldt contains a selector for the LDT of that task. It is used to load the LDTR. (sched.h)
Some assembler macros are defined in sched.h and system.h to ease access and setting of descriptors. Each TSS entry and LDT entry takes 8 bytes.
Manipulating GDT system descriptors:
Copyright (C) 1992, 1993, 1996 Michael K. Johnson, [email protected].