The HyperNews Linux KHG Discussion Pages

80386 Memory Management

A logical address specified in an instruction is first translated to a linear address by the segmenting hardware. This linear address is then translated to a physical address by the paging unit.

Paging on the 386

There are two levels of indirection in address translation by the paging unit. A page directory contains pointers to 1024 page tables. Each page table contains pointers to 1024 pages. The register CR3 contains the physical base address of the page directory and is stored as part of the TSS in the task_struct and is therefore loaded on each task switch.

A 32-bit Linear address is divided as follows:

31 ...... 22 21 ...... 12 11 ...... 0
DIR TABLE OFFSET

Physical address is then computed (in hardware) as:
CR3 + DIR points to the table_base.
table_base + TABLE points to the page_base.
physical_address = page_base + OFFSET

Page directories (page tables) are page aligned so the lower 12 bits are used to store useful information about the page table (page) pointed to by the entry.

Format for Page directory and Page table entries:

31 ...... 12 11 .. 9 8 7 6 5 4 3 2 1 0
ADDRESS OS 0 0 D A 0 0 U/S R/W P

D 1 means page is dirty (undefined for page directory entry).
R/W 0 means readonly for user.
U/S 1 means user page.
P 1 means page is present in memory.
A 1 means page has been accessed (set to 0 by aging).
OS bits can be used for LRU etc, and are defined by the OS.

The corresponding definitions for Linux are in .

When a page is swapped, bits 1-31 of the page table entry are used to mark where a page is stored in swap (bit 0 must be 0).

Paging is enabled by setting the highest bit in CR0. [in head.S?] At each stage of the address translation access permissions are verified and pages not present in memory and protection violations result in page faults. The fault handler (in memory.c) then either brings in a new page or unwriteprotects a page or does whatever needs to be done.

Page Fault handling Information

The Translation Lookaside Buffer (TLB) is a hardware cache for physical addresses of the most recently used virtual addresses. When a virtual address is translated the 386 first looks in the TLB to see if the information it needs is available. If not, it has to make a couple of memory references to get at the page directory and then the page table before it can actually get at the page. Three physical memory references for address translation for every logical memory reference would kill the system, hence the TLB.

The TLB is flushed if CR3 loaded or by task switch that changes CR0. It is explicitly flushed in Linux by calling invalidate() which just reloads CR3.

Segments in the 80386

Segment registers are used in address translation to generate a linear address from a logical (virtual) address.
linear_address = segment_base + logical_address
The linear address is then translated into a physical address by the paging hardware.

Each segment in the system is described by a 8 byte segment descriptor which contains all pertinent information (base, limit, type, privilege).

The segments are:

Regular segments
System segments

Characteristics of system segments

To keep track of all these segments, the 386 uses a global descriptor table (GDT) that is setup in memory by the system (located by the GDT register). The GDT contains a segment descriptors for each task state segment, each local descriptor tablet and also regular segments. The Linux GDT contains just two normal segment entries:

The rest of the GDT is filled with TSS and LDT system descriptors:

LDT[n] != LDTn

LDT[n] = the nth descriptor in the LDT of the current task.
LDTn = a descriptor in the GDT for the LDT of the nth task.

The kernel segments have base 0xc0000000 which is where the kernel lives in the linear view. Before a segment can be used, the contents of the descriptor for that segment must be loaded into the segment register. The 386 has a complex set of criteria regarding access to segments so you can't simply load a descriptor into a segment register. Also these segment registers have programmer invisible portions. The visible portion is what is usually called a segment register: cs, ds, es, fs, gs, and ss.

The programmer loads one of these registers with a 16-bit value called a selector. The selector uniquely identifies a segment descriptor in one of the tables. Access is validated and the corresponding descriptor loaded by the hardware.

Currently Linux largely ignores the (overly?) complex segment level protection afforded by the 386. It is biased towards the paging hardware and the associated page level protection. The segment level rules that apply to user processes are

  1. A process cannot directly access the kernel data or code segments
  2. There is always limit checking but given that every user segment goes from 0x00 to 0xc0000000 it is unlikely to apply. [This has changed, and needs updating, please.]

Selectors in the 80386

A segment selector is loaded into a segment register (cs, ds, etc.) to select one of the regular segments in the system as the one addressed via that segment register.

Segment selector Format:

15 ...... 3 2 1 0
index TI RPL
TI Table indicator:
0 means selector indexes into GDT
1 means selector indexes into LDT
RPL Privelege level. Linux uses only two privelege levels.
0 means kernel
3 means user

Examples:

Kernel code segment
TI=0, index=1, RPL=0, therefore selector = 0x08 (GDT[1])
User data segment
TI=1, index=2, RPL=3, therefore selector = 0x17 (LDT[2])

Selectors used in Linux:

TI index RPL selector segment
0 1 0 0x08 kernel code GDT[1]
0 2 0 0x10 kernel data/stack GDT[2]
0 3 0 ??? ??? GDT[3]
1 1 3 0x0F user code LDT[1]
1 2 3 0x17 user data/stack LDT[2]
Selectors for system segments are not to be loaded directly into segment registers. Instead one must load the TR or LDTR.

On entry into syscall:

Segment descriptors

There is a segment descriptor used to describe each segment in the system. There are regular descriptors and system descriptors. Here's a descriptor in all its glory. The strange format is essentially to maintain compatibility with the 286. Note that it takes 8 bytes.

63-54 55 54 53 52 51-48 47 46 45 44-40 39-16 15-0
Base
31-24
G D R U Limit
19-16
P DPL S TYPE Segment Base
23-0
Segment Limit
15-0

Explanation:

R reserved (0)
DPL 0 means kernel, 3 means user
G 1 means 4K granularity (Always set in Linux)
D 1 means default operand size 32bits
U programmer definable
P 1 means present in physical memory
S 0 means system segment, 1 means normal code or data segment.
Type There are many possibilities. Interpreted differently for system and normal descriptors.

Linux system descriptors:
TSS: P=1, DPL=0, S=0, type=9, limit = 231 room for 1 tss_struct.
LDT: P=1, DPL=0, S=0, type=2, limit = 23 room for 3 segment descriptors.
The base is set during fork(). There is a TSS and LDT for each task.

Linux regular kernel descriptors: (head.S)
code: P=1, DPL=0, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x3ffff
data: P=1, DPL=0, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x3ffff

The LDT for task[0] contains: (sched.h)
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x9f
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x9f

The default LDT for the remaining tasks: (exec())
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0, limit= 0xbffff
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0, limit= 0xbffff

The size of the kernel segments is 0x40000 pages (4KB pages since G=1 = 1 Gigabyte). The type implies that the permissions on the code segment is read-exec and on the data segment is read-write.

Registers associated with segmentation.

Format of segment register: (Only the selector is programmer visible)

16-bit 32-bit 32-bit
selector physical base addr segment limit attributes
The invisible portion of the segment register is more conveniently viewed in terms of the format used in the descriptor table entries that the programmer sets up. The descriptor tables have registers associated with them that are used to locate them in memory. The GDTR (and IDTR) are initialized at startup once the tables are defined. The LDTR is loaded on each task switch.

Format of GDTR (and IDTR):

32-bits 16-bits
Linear base addr table limit

The TR and LDTR are loaded from the GDT and so have the format of the other segment registers. The task register (TR) contains the descriptor for the currently executing task's TSS. The execution of a jump to a TSS selector causes the state to be saved in the old TSS, the TR is loaded with the new descriptor and the registers are restored from the new TSS. This is the process used by schedule to switch to various user tasks. Note that the field tss_struct.ldt contains a selector for the LDT of that task. It is used to load the LDTR. (sched.h)

Macros used in setting up descriptors

Some assembler macros are defined in sched.h and system.h to ease access and setting of descriptors. Each TSS entry and LDT entry takes 8 bytes.

Manipulating GDT system descriptors:

_TSS(n), _LDT(n)
These provide the index into the GDT for the n'th task.
_LDT(n) is stored in the the ldt field of the tss_struct by fork.
_set_tssldt_desc(n, addr, limit, type)
ulong *n points to the GDT entry to set (see fork.c).
The segment base (TSS or LDT) is set to 0xc0000000 + addr.
Specific instances of the above are, where ltype refers to the byte containing P, DPL, S and type:
set_ldt_desc(n, addr) ltype = 0x82
P=1, DPL=0, S=0, type=2 means LDT entry.
limit = 23 => room for 3 segment descriptors.
set_tss_desc(n, addr) ltype = 0x89
P=1, DPL=0, S=0, type = 9, means available 80386 TSS limit = 231 room for 1 tss_struct.
load_TR(n),
load_ldt(n) load descriptors for task number n into the task register and ldt register.
ulong get_base (struct desc_struct ldt)
gets the base from a descriptor.
ulong get_limit (ulong segment)
gets the limit (size) from a segment selector.
Returns the size of the segment in bytes.
set_base(struct desc_struct ldt, ulong base),
set_limit(struct desc_struct ldt, ulong limit)
Will set the base and limit for descriptors (4K granular segments).
The limit here is actually the size in bytes of the segment.
_set_seg_desc(gate_addr, type, dpl, base, limit)
Default values 0x00408000 => D=1, P=1, G=0
Present, operation size is 32 bit and max size is 1M.
gate_addr must be a (ulong *)

Copyright (C) 1992, 1993, 1996 Michael K. Johnson, [email protected].

Copyright (C) 1992, 1993 Krishna Balasubramanian and Douglas Johnson


Messages

1. Feedback: paging initialization, doc update by [email protected]
2. Feedback: User Code and Data Segment no longer in LDT. by Lennart Benschop