We will assume that you decide that you do not wish to write a user-space device, and would rather implement your device in the kernel. You will probably be writing writing two files, a .c file and a .h file, and possibly modifying other files as well, as will be described below. We will refer to your files as foo.c and foo.h, and your driver will be the foo driver.
One of the first things you will need to do, before writing any code, is to name your device. This name should be a short (probably two or three character) string. For instance, the parallel device is the ``lp'' device, the floppies are the ``fd'' devices, and SCSI disks are the ``sd'' devices. As you write your driver, you will give your functions names prefixed with your chosen string to avoid any namespace confusion. We will call your prefix foo, and give your functions names like foo_read(), foo_write(), etc.
Memory allocation in the kernel is a little different from memory allocation in normal user-level programs. Instead of having a malloc() capable of delivering almost unlimited amounts of memory, there is a kmalloc() function that is a bit different:
To free memory allocated with kmalloc(), use one of two functions: kfree() or kfree_s(). These differ from free() in a few ways as well:
See Supporting Functions for more information on kmalloc(), kfree(), and other useful functions.
Be gentle when you use kmalloc. Use only what you have to. Remember that kernel memory is unswappable, and thus allocating extra memory in the kernel is a far worse thing to do in the kernel than in a user-level program. Take only what you need, and free it when you are done, unless you are going to use it right away again.
There are two main types of devices under all Unix systems, character and block devices. Character devices are those for which no buffering is performed, and block devices are those which are accessed through a cache. Block devices must be random access, but character devices are not required to be, though some are. Filesystems can only be mounted if they are on block devices.
Character devices are read from and written to with two function: foo_read() and foo_write(). The read() and write() calls do not return until the operation is complete. By contrast, block devices do not even implement the read() and write() functions, and instead have a function which has historically been called the ``strategy routine.'' Reads and writes are done through the buffer cache mechanism by the generic functions bread(), breada(), and bwrite(). These functions go through the buffer cache, and so may or may not actually call the strategy routine, depending on whether or not the block requested is in the buffer cache (for reads) or on whether or not the buffer cache is full (for writes). A request may be asyncronous: breada() can request the strategy routine to schedule reads that have not been asked for, and to do it asyncronously, in the background, in the hopes that they will be needed later.
The sources for character devices are kept in drivers/char/, and the sources for block devices are kept in drivers/block/. They have similar interfaces, and are very much alike, except for reading and writing. Because of the difference in reading and writing, initialization is different, as block devices have to register a strategy routine, which is registered in a different way than the foo_read() and foo_write() routines of a character device driver. Specifics are dealt with in Character Device Initialization and Block Device Initialization.
Hardware is slow. That is, in the time it takes to get information from your average device, the CPU could be off doing something far more useful than waiting for a busy but slow device. So to keep from having to busy-wait all the time, interrupts are provided which can interrupt whatever is happening so that the operating system can do some task and return to what it was doing without losing information. In an ideal world, all devices would probably work by using interrupts. However, on a PC or clone, there are only a few interrupts available for use by your peripherals, so some drivers have to poll the hardware: ask the hardware if it is ready to transfer data yet. This unfortunately wastes time, but it sometimes needs to be done.
Some hardware (like memory-mapped displays) is as fast as the rest of the machine, and does not generate output asyncronously, so an interrupt-driven driver would be rather silly, even if interrupts were provided.
In Linux, many of the drivers are interrupt-driven, but some are not, and at least one can be either, and can be switched back and forth at runtime. For instance, the lp device (the parallel port driver) normally polls the printer to see if the printer is ready to accept output, and if the printer stays in a not ready phase for too long, the driver will sleep for a while, and try again later. This improves system performance. However, if you have a parallel card that supplies an interrupt, the driver will utilize that, which will usually make performance even better.
There are some important programming differences between interrupt-driven drivers and polling drivers. To understand this difference, you have to understand a little bit of how system calls work under Unix. The kernel is not a separate task under Unix. Rather, it is as if each process has a copy of the kernel. When a process executes a system call, it does not transfer control to another process, but rather, the process changes execution modes, and is said to be ``in kernel mode.'' In this mode, it executes kernel code which is trusted to be safe.
In kernel mode, the process can still access the user-space memory that it was previously executing in, which is done through a set of macros: get_fs_*() and memcpy_fromfs() read user-space memory, and put_fs_*() and memcpy_tofs() write to user-space memory. Because the process is still running, but in a different mode, there is no question of where in memory to put the data, or where to get it from. However, when an interrupt occurs, any process might currently be running, so these macros cannot be used--if they are, they will either write over random memory space of the running process or cause the kernel to panic.
Instead, when scheduling the interrupt, a driver must also provide temporary space in which to put the information, and then sleep. When the interrupt-driven part of the driver has filled up that temporary space, it wakes up the process, which copies the information from that temporary space into the process' user space and returns. In a block device driver, this temporary space is automatically provided by the buffer cache mechanism, but in a character device driver, the driver is responsible for allocating it itself.
Perhaps the best way to try to understand the Linux sleep-wakeup mechanism is to read the source for the __sleep_on() function, used to implement both the sleep_on() and interruptible_sleep_on() calls.
static inline void __sleep_on(struct wait_queue **p, int state) { unsigned long flags; struct wait_queue wait = { current, NULL }; if (!p) return; if (current == task[0]) panic("task[0] trying to sleep"); current->state = state; add_wait_queue(p, &wait); save_flags(flags); sti(); schedule(); remove_wait_queue(p, &wait); restore_flags(flags); }
A wait_queue is a circular list of pointers to task structures, defined in <linux/wait.h> to be
struct wait_queue { struct task_struct * task; struct wait_queue * next; };state is either TASK_INTERRUPTIBLE or TASK_UNINTERUPTIBLE, depending on whether or not the sleep should be interruptable by such things as system calls. In general, the sleep should be interruptible if the device is a slow one; one which can block indefinitely, including terminals and network devices or pseudodevices.
add_wait_queue() turns off interrupts, if they were enabled, and adds the new struct wait_queue declared at the beginning of the function to the list p. It then recovers the original interrupt state (enabled or disabled), and returns.
save_flags() is a macro which saves the process flags in its argument. This is done to preserve the previous state of the interrupt enable flag. This way, the restore_flags() later can restore the interrupt state, whether it was enabled or disabled. sti() then allows interrupts to occur, and schedule() finds a new process to run, and switches to it. Schedule will not choose this process to run again until the state is changed to TASK_RUNNING by wake_up() called on the same wait queue, p, or conceivably by something else.
The process then removes itself from the wait_queue, restores the orginal interrupt condition with restore_flags(), and returns.
Whenever contention for a resource might occur, there needs to be a pointer to a wait_queue associated with that resource. Then, whenever contention does occur, each process that finds itself locked out of access to the resource sleeps on that resource's wait_queue. When any process is finished using a resource for which there is a wait_queue, it should wake up and processes that might be sleeping on that wait_queue, probably by calling wake_up(), or possibly wake_up_interruptible().
If you don't understand why a process might want to sleep, or want more details on when and how to structure this sleeping, I urge you to buy one of the operating systems textbooks listed in the Annotated Bibliography and look up mutual exclusion and deadlock.
If the sleep_on()/wake_up() mechanism in Linux does not satisfy your device driver needs, you can code your own versions of sleep_on() and wake_up() that fit your needs. For an example of this, look at the serial device driver (drivers/char/serial.c) in function block_til_ready(), where quite a bit has to be done between the add_wait_queue() and the schedule().
The Virtual Filesystem Switch, or VFS, is the mechanism which allows Linux to mount many different filesystems at the same time. In the first versions of Linux, all filesystem access went straight into routines which understood the minix filesystem. To make it possible for other filesystems to be written, filesystem calls had to pass through a layer of indirection which would switch the call to the routine for the correct filesystem. This was done by some generic code which can handle generic cases and a structure of pointers to functions which handle specific cases. One structure is of interest to the device driver writer; the file_operations structure.
From /usr/include/linux/fs.h:
struct file_operations { int (*lseek) (struct inode *, struct file *, off_t, int); int (*read) (struct inode *, struct file *, char *, int); int (*write) (struct inode *, struct file *, char *, int); int (*readdir) (struct inode *, struct file *, struct dirent *, int count); int (*select) (struct inode *, struct file *, int, select_table *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned int); int (*mmap) (struct inode *, struct file *, unsigned long, size_t, int, unsigned long); int (*open) (struct inode *, struct file *); void (*release) (struct inode *, struct file *); };Essentially, this structure constitutes a parital list of the functions that you may have to write to create your driver.
This section details the actions and requirements of the functions in the file_operations structure. It documents all the arguments that these functions take. [It should also detail all the defaults, and cover more carefully the possible return values.]
This function is called when the system call lseek() is called on the device special file representing your device. An understanding of what the system call lseek() does should be sufficient to explain this function, which moves to the desired offset. It takes these four arguments:
If there is no lseek(), the kernel will take the default action, which is to modify the file->f_pos element. For an origin of 2, the default action is to return -EINVAL if file->f_inode is NULL, otherwise it sets file->f_pos to file->f_inode->i_size + offset. Because of this, if lseek() should return an error for your device, you must write an lseek() function which returns that error.
The read and write functions read and write a character string to the device. If there is no read() or write() function in the file_operations structure registered with the kernel, and the device is a character device, read() or write() system calls, respectively, will return -EINVAL. If the device is a block device, these functions should not be implemented, as the VFS will route requests through the buffer cache, which will call your strategy routine. The read and write functions take these arguments:
This function is another artifact of file_operations being used for implementing filesystems as well as device drivers. Do not implement it. The kernel will return -ENOTDIR if the system call readdir() is called on your device special file.
The select() function is generally most useful with character devices. It is usually used to multiplex reads without polling--the application calls the select() system call, giving it a list of file descriptors to watch, and the kernel reports back to the program on which file descriptor has woken it up. It is also used as a timer. However, the select() function in your device driver is not directly called by the system call select(), and so the file_operations select() only needs to do a few things. Its arguments are:
SEL_IN | read |
SEL_OUT | write |
SEL_EX | exception |
If the calling program wants to wait until one of the devices upon which it is selecting becomes available for the operation it is interested in, the process will have to be put to sleep until one of those operations becomes available. This does not require use of a sleep_on*() function, however. Instead the select_wait() function is used. (See Supporting Functions for the definition of the select_wait() function). The sleep state that select_wait() will cause is the same as that of sleep_on_interruptible(), and, in fact, wake_up_interruptible() is used to wake up the process.
However, select_wait() will not make the process go to sleep right away. It returns directly, and the select() function you wrote should then return. The process isn't put to sleep until the system call sys_select(), which originall called your select() function, uses the information given to it by the select_wait() function to put the process to sleep. select_wait() adds the process to the wait queue, but do_select() (called from sys_select()) actually puts the process to sleep by changing the process state to TASK_INTERRUPTIBLE and calling schedule().
The first argument to select_wait() is the same wait_queue that should be used for a sleep_on(), and the second is the select_table that was passed to your select() function.
After having explained all this in excruciating detail, here are two rules to follow:
If you provide a select() function, do not provide timeouts by setting current->timeout, as the select() mechanism uses current->timeout, and the two methods cannot co-exist, as there is only one timeout for each process. Instead, consider using a timer to provide timeouts. See the description of the add_timer() function in Supporting Functions for details.
The ioctl() function processes ioctl calls. The structure of your ioctl() function will be: first error checking, then one giant (possibly nested) switch statement to handle all possible ioctls. The ioctl number is passed as cmd, and the argument to the ioctl is passed as arg. It is good to have an understanding of how ioctls ought to work before making them up. If you are not sure about your ioctls, do not feel ashamed to ask someone knowledgeable about it, for a few reasons: you may not even need an ioctl for your purpose, and if you do need an ioctl, there may be a better way to do it than what you have thought of. Since ioctls are the least regular part of the device interface, it takes perhaps the most work to get this part right. Take the time and energy you need to get it right.
The first thing you need to do is look in Documentation/ioctl-number.txt, read it, and pick an unused number. Then go from there.
PROT_READ | region can be read. |
PROT_WRITE | region can be written. |
PROT_EXEC | region can be executed. |
PROT_NONE | region cannot be accessed. |
open() is called when a device special files is opened. It is the policy mechanism responsible for ensuring consistency. If only one process is allowed to open the device at once, open() should lock the device, using whatever locking mechanism is appropriate, usually setting a bit in some state variable to mark it as busy. If a process already is using the device (if the busy bit is already set) then open() should return -EBUSY. If more than one process may open the device, this function is responsible to set up any necessary queues that would not be set up in write(). If no such device exists, open() should return -ENODEV to indicate this. Return 0 on success.
release() is called only when the process closes its last open file descriptor on the files. [I am not sure this is true; it might be called on every close.] If devices have been marked as busy, release() should unset the busy bits if appropriate. If you need to clean up kmalloc()'ed queues or reset devices to preserve their sanity, this is the place to do it. If no release() function is defined, none is called.
This function is not actually included in the file_operations structure, but you are required to implement it, because it is this function that registers the file_operations structure with the VFS in the first place--without this function, the VFS could not route any requests to the driver. This function is called when the kernel first boots and is configuring itself. The init function then detects all devices. You will have to call your init() function from the correct place: for a character device, this is chr_dev_init() in drivers/char/mem.c.
While the init() function runs, it registers your driver by calling the proper registration function. For character devices, this is register_chrdev(). (See Supporting Functions for more information on the registration functions.) register_chrdev() takes three arguments: the major device number (an int), the ``name'' of the device (a string), and the address of the device_fops file_operations structure.
When this is done, and a character or block special file is accessed, the VFS filesystem switch automagically routes the call, whatever it is, to the proper function, if a function exists. If the function does not exist, the VFS routines take some default action.
The init() function usually displays some information about the driver, and usually reports all hardware found. All reporting is done via the printk() function.
Copyright (C) 1992, 1993, 1994, 1996 Michael K. Johnson, [email protected].