"Thou hast to recompile thee kernel".
This antique curse has been thrown on every Linux newcomer since the birth of Linux. Unfortunately as long as kernel recompiling is deemed a necessary part of a Linux installation it will be impossible to spread Linux between non-nerds. In this article we will make a detailed analysis of the performance increases one can expect of kernel compiling.
To begin with we will see module compiling. Compiling a module in the kernel will save a little more than 2K per module: 2K due to page alignment and a small bit of code for the loading, unloading of the module. Now, despite being a module fanatic I never managed to be in a situation with more than ten modules loaded, but let's imagine you have 20 modules loaded and all of them are needed permanently so you recompile them in the kernel. You would save 40K of memory, that is 0.5% of the memory of an 8 Meg computer.
Now we will look at benefits of a lean kernel. When Matt Welsh wrote his books kernel recompiling was undoubtedly necessary. It was not uncommon to be able to save above 1.5 Megs of memory and your average computer had 8 Megs of RAM. Thus recompiling would increase memory available from 5.5 to 7 Megs that is a 27% increase.
But people failed to notice that Linux has gone modular and computers got more memory. Today most distributions ship modular kernels so recompiling will get benefits much smaller than in 1995. As an example I tested recompiling the kernel shipped in RedHat 5.2 with everything unneeded thrown out and modularizing everything else when it was possible. The boot messages (that is before loading of any module) showed I had saved a mere 400K. In addition today even low end computers have 32 Megs of RAM that means that recompiling your kernel will increase your available memory of only 1.25%
It is possible to write a specially designed program who will not do a single page fault with N Megs of memory and thrash horribly if you reduce it by a single page. However in normal situations a 1.25% increase in memory available will make little difference. There ARE still a couple distributions who ship kernels good for little else outside installation: huge kernels lacking essential features so recompiling is not a performance issue but a requirement. Now consider what happens if a small company without a full-time guru needs a firewall. Its expert is good for little else short of starting Word. If he stumbles upon a distribution with one of those broken kernels he will fail and will end recommending NT.
Most modern distribs (Caldera, Suse, RedHat and their clones) ship fully-featured kernels and in addition kernel recompiling will produce no appreciable speed increase due to memory savings: they are good enough out of the box. Only a couple of "hackeristic" distribs will force you to recompile the kernel. But for the good of Linux you should ask the maintainers to fix them instead of supplying for their deficiencies. YOU can recompile but your neighbour cannot and he will choose NT.
Again let's quantify this. Linux performs a number of optimizations for CPU type but most of them are performed at execution time and don't depend on compiling options. For one part we will quantify the influence due to alternative portions of code being compiled and we will also take a look at the influence of compilation options in the code generated by GCC.
Let's look now at the circumsatnces where selective TLB invalidation has a significant effect and let's quantify the slow down.
First of all if the kernel unmaps a page and then handles control to another process it will reload CR3 and that will cause a total TLB invalidation (different processes have entirely different mappings) so you get any benefit only if control is handled back to the same process either immediately or after some time in kernel mode. Also consider that time wasted due to entire TLB invalidation is some microseconds while disk IO takes 10 milliseconds in best case that is one thousand times more. That means in case there is disk IO following this unmapping (due to swap out) benefits would be unsignificant.
In fact about the only case where selective TLB will be meaningful would be in the following scenario: process frees memory so the kernel will invalidate TLB, it handles control to the same process and then the process scans a large array doing only a single access for every entry, then just when the TLB is fully reloaded, it unmaps memory again, new TLB invalidation, kernel gives back control again and then the process scans the same array entries. Highly theorical and don't forget that during the second pass page entries will be in cache so address translation will be much faster and this will reduce benefits got due to selective TLB invalidation.
Let's evaluate what happens in a normal process. We will arbitrarily assume this process runs for one tick (10 ms) after the unmapping. For everything else we will take the worst case. The slower the memory the more costly is translation so we will assume this computer uses 60 ms DRAM instead of SDRAM. The larger the TLB the bigger the benefits of selective invalidation so we will choose a CPU with a big TLB in our case it will be an AMD K6 model 7: it has a 64 entry TLB for code pages and a 128 entry TLB for data pages. We will also assume that we never find nor page table entries nor page directory entries in cache (the later is very irrealistic because a single directory entry is used every 4 Megs of address space) so every translation will need 2x60=120 ns so the complete refilling of the TLB needs 120 ns * 192 TMB entries = 23 microseconds. Because we assumed the process would be running for a whole tick that means the slowdown due to address translation is only 0.2 per cent.
If you are an expert and have a spare machine for experimenting then you could try recompilings using more agressive optimizations than the standard -O2 or using a better compiler than gcc 2.7 like egcs or pgcc. However be warned that all 2.0 kernels until 2.0.35 and possibly 2.0.36 have some bugs who will break the kernel with any other compiler than gcc 2.7 (they work due to gcc 2.7 bugs). Also be wary about some optimizations like loop-unrolling who according to egcs or pgcc doc were never thorougly tested be in gcc, egcs or pgcc and that egcs and pgcc are not as well tested as gcc (egcs 1.0 was notorious for its FP bugs). Given these warnings there is a 7% speed difference between the Byte benchmarks compiled with -O6 and loop-unrolling against plain -O2. So playing with compiler and compiler flags is an interesting possibility if you are an expert: it could help the kernel developpers to determine what are the more agresive optimizations who don't break the kernel. If you are not an expert then don't lose sleep about this. The problem is that only a small part of the time spent by your program will be spent executing those parts of kernel code affected
Kernel recompiling for your specific processor gives only a minimal CPU boost when the kernel version is 2.0 and the processor is a 1998 or earlier model of the i386 architecture. This could change in future versions of Linux or when using newer processors.
Kernel compiling has been seen as the panacea for Linux optimization. Unfortunately this doesn't resist serious analysis. It also has two serious drawbacks. First it is poor public relations for spreading Linux between normal people. Second this has sterilized investigation of more effective optimizations.