197 lines
10 KiB
HTML
197 lines
10 KiB
HTML
<h2>How mount actually works</h2>
|
|
|
|
<p>The <a href=https://landley.net/toybox/help.html#mount>mount comand</a>
|
|
calls the <a href=https://man7.org/linux/man-pages/man2/mount.2.html>mount
|
|
system call</a>, which has five arguments:</p>
|
|
|
|
<blockquote><b>
|
|
int mount(const char *source, const char *target, const char *filesystemtype,
|
|
unsigned long mountflags, const void *data);
|
|
</b></blockquote>
|
|
|
|
<p>The command "<b>mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime</b>"
|
|
parses its command line arguments to feed them into those five system call
|
|
arguments. In this example, the <b>source</b> is "/dev/sda1", the <b>target</b>
|
|
is "/path/to/mountpoint", and the <b>filesystemtype</b> is "ext2".
|
|
|
|
<p>The other two syscall arguments (<b>mountflags</b> and </b>data</b>)
|
|
come from the "-o option,option,option" argument. The mountflags argument goes
|
|
to the VFS (explained below), and the data argument is passed to the filesystem
|
|
driver.</p>
|
|
|
|
<p>The mount command's options string is a list of comma separated values. If
|
|
there's more than one -o argument on the mount command line, they get glued
|
|
together (in order) with a comma. The mount command also checks the file
|
|
<b>/etc/fstab</b> for default options, and the options you specify on the command
|
|
line get appended to those defaults (if any). Most other command line mount
|
|
flags are just synonyms for adding option flags (for example
|
|
"mount -o remount -w" is equivalent to "mount -o remount,rw"). Behind the
|
|
scenes they all get appended to the -o string and fed to a common parser.</p>
|
|
|
|
<p>VFS stands for "Virtual File System" and is the common infrastructure shared
|
|
by different filesystems. It handles common things like making the filesystem
|
|
read only. The mount command assembles an option string to supply to the "data"
|
|
argument of the option syscall, but first it parses it for VFS options
|
|
(ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag
|
|
from <b>#include <sys/mount.h></b>. The mount command removes those options
|
|
from the string and sets the corresponding bit in mountflags, then the
|
|
remaining options (if any) form the data argument for the filesystem driver.</p>
|
|
|
|
<blockquote>
|
|
<p>Implementation details: the mountflag MS_SILENCE gets set by
|
|
default even if there's nothing in /etc/fstab. Some actions (such as --bind
|
|
and --move mounts, I.E. -o bind and -o move) are just VFS actions and don't
|
|
require any specific filesystem at all. The "-o remount" flag requires looking
|
|
up the filesystem in /proc/mounts and reassembling the full option string
|
|
because you don't _just_ pass in the changed flags but have to reassemble
|
|
the complete new filesystem state to give the system call. Some of the options
|
|
in /etc/fstab are for the mount command (such as "user" which only does
|
|
anything if the mount command has the suid bit set) and don't get passed
|
|
through to the system call.</p>
|
|
</blockquote>
|
|
|
|
<p>When mounting a new filesystem, the "<b>filesystem</b>" argument to the mount system
|
|
call specifies which filesystem driver to use. All the loaded drivers are
|
|
listed in /proc/filesystems, but calling mount can also trigger a module load
|
|
request to add another. A filesystem driver is responsible for putting files
|
|
and subdirectories under the mount point: any time you open, close, read,
|
|
write, truncate, list the contents of a directory, move, or delete a file,
|
|
you're talking to a filesystem driver to do it. (Or when you call
|
|
ioctl(), stat(), statvfs(), utime()...)</p>
|
|
|
|
<h2>Four filesystem types (block backed, server backed, ramfs, synthetic).</h2>
|
|
|
|
<p>Different drivers implement different filesystems, which come in four
|
|
different types: the filesystem's backing store can be a fixed length
|
|
block of storage, the backing store can be some server the driver connects to,
|
|
the files can remain in memory with no backing store,
|
|
or the filesystem driver can algorithmically create the filesystem's contents
|
|
on the fly.</p>
|
|
|
|
<ol>
|
|
<li><h3>Block device backed filesystems, such as ext2 and vfat.</h3>
|
|
|
|
<p>This kind of filesystem driver acts as a lens to look at a block device
|
|
through. The source argument for block backed filesystems is a path to a
|
|
block device (such as "/dev/hda1") which stores the contents of the
|
|
filesystem in a fixed length block of sequential storage, with a seperate
|
|
driver providing that block device.</p>
|
|
|
|
<p>Block backed filesystems are the "conventional" filesystem type most people
|
|
think of when they mount things. The name means that the "backing store"
|
|
(where the data lives when the system is switched off) is on a block device.</p>
|
|
</li>
|
|
|
|
<li><h3>Server backed filesystems, such as cifs/samba or fuse.</h3>
|
|
|
|
<p>These drivers convert filesystem operations into a sequential stream of
|
|
bytes, which it can send through a pipe to talk to a program. The filesystem
|
|
server could be a local Filesystem in Userspace daemon (connected to a local
|
|
process through a pipe filehandle), behind a network socket (CIFS and v9fs),
|
|
behind a char device (/dev/ttyS0), and so on. The common attribute is there's
|
|
some program on the other end sending and receiving a sequential bytestream.
|
|
The backing store is a server somewhere, and the filesystem driver is talking
|
|
to a process that reads and writes data in some known protocol.</p>
|
|
|
|
<p>The source argument for these filesystems indicates where the filesystem
|
|
lives. It's often in a URL-like format for network filesystems, but it's
|
|
really just a blob of data that the filesystem driver understands.</p>
|
|
|
|
<p>A lot of server backed filesystems want to open their own connection so they
|
|
don't have to pass their data through a persistent local userspace process,
|
|
not really for performance reasons but because in low memory situations a
|
|
chicken-and-egg situation can develop where all the process's pages have
|
|
been swapped out but the filesystem needs to write data to its backing
|
|
store in order to free up memory so it can swap the process's pages back in.
|
|
If this mechanism is providing the root filesystem, this can deadlock and
|
|
freeze the system solid. So while you _can_ pass some of them a filehandle,
|
|
more often than not you don't.</p>
|
|
|
|
<p>These are also known as "pipe backed" filesystems (or "network filesystems"
|
|
because that's a common case, although a network doesn't need to be inolved).
|
|
Conceptually they're char device backed filesystems analogous to the block
|
|
backed filesystems (block devices provide seekable storage, char devices
|
|
provide serial I/O), but you don't commonly specify a character device in
|
|
/dev when mounting them because you're talking to a specific server process,
|
|
not a whole machine.</p>
|
|
</li>
|
|
|
|
<li><h3>Ram backed filesystems (ramfs and tmpfs).</h3>
|
|
|
|
<p>These are very simple filesystems that don't implement a backing store,
|
|
but just keep the data in memory. Data
|
|
written to these gets stored in the disk cache, and the driver ignores requests
|
|
to flush it to backing store (reporting all the pages as pinned and
|
|
unfreeable).</p>
|
|
|
|
<p>These filesystem drivers essentially mount the VFS's page/dentry cache as if it was a
|
|
filesystem. (Page cache stores file contents, dentry cache stores directory
|
|
entries.) They grow and shrink dynamically as needed: when you write files
|
|
into them they allocate more memory to store it, and when you delete files
|
|
the memory is freed.</p>
|
|
|
|
<p>The "ramfs" driver provides the simplest possible ram filesystem,
|
|
which is too simple for most real use cases. The "tmpfs" driver adds
|
|
a size limitation (by default 50% of system RAM, but it's adjustable as a mount
|
|
option) so the system doesn't run out of memory and lock up if you
|
|
"cat /dev/zero > file", can report how much space is remaining
|
|
when asked (ramfs always says 0 bytes free), and can write its data
|
|
out to swap space (like processes do) when the system is under memory pressure.</p>
|
|
|
|
<blockquote>
|
|
<p>Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a
|
|
chunk of memory to implement a block device, and then you can format that
|
|
block device and mount it with a block device backed filesystem driver.
|
|
(This is the same "two device drivers" approach you always have with block
|
|
backed filesystems: one driver provides /dev/ram0 and the second driver mounts
|
|
it as vfat.) Ram disks are significantly less efficient than ramfs,
|
|
allocating a fixed amount of memory up front for the block device instead of
|
|
dynamically resizing itself as files are written into an deleted from the
|
|
page and dentry caches the way ramfs does.</p>
|
|
</blockquote>
|
|
|
|
<p>Initramfs (I.E. rootfs) is a ram backed filesystem mounted on / which
|
|
can't be unmounted for the same reason PID 1 can't exit. The boot
|
|
process can extract a cpio.gz archive into it (either statically linked
|
|
into the kernel or loaded as a second file by the bootloader), and
|
|
if it contains an executable "init" binary at the top level that
|
|
will be run as PID 1. If you specify "root=" on the kernel command line,
|
|
initramfs will be ramfs and will get overmounted with the specified
|
|
filesystem if no "/init" binary can be run out of the initramfs.
|
|
If you don't specify root= then initramfs will be tmpfs, which is probably
|
|
what you want when the system is running from initramfs.</p>
|
|
</li>
|
|
|
|
<li><h3>Synthetic filesystems (proc, sysfs, devtmpfs, devpts...)</h3>
|
|
|
|
<p>These filesystems don't have any backing store because they don't
|
|
store arbitrary data the way the first three types of filesystems do.</p>
|
|
|
|
<p>Instead they present artificial contents, which can represent processes or
|
|
hardware or anything the driver writer wants them to show. Listing or reading
|
|
from these files calls a driver function that produces whatever output it's
|
|
programmed to, and writing to these files submits data to the driver which
|
|
can do anything it wants with it.</p>
|
|
|
|
<p>Synthetic filesystems are often implemented to provide monitoring and control
|
|
knobs for parts of the operating system, as an alternative to adding more
|
|
system calls (or ioctl, sysctl, etc). They provide a more human friendly user
|
|
interface which programs can use but which users can also interact with
|
|
directly from the command line via "cat" and redirecting the output of
|
|
"echo" into special files.</p>
|
|
|
|
<blockquote>
|
|
<p>The first synthetic filesystem in Linux was "proc", which was initially
|
|
intended to provide a directory for each process in the system to provide
|
|
information to tools like "ps" and "top" (the /proc/[0-9]* entries)
|
|
but became a dumping ground for any information the kernel wanted to export.
|
|
Eventually the kernel developers <a href=https://lwn.net/Articles/57369/>genericized</a>
|
|
the synthetic filesystem infrastructure so the system could have multiple
|
|
different synthetic filesystems, but /proc remains full
|
|
unrelated historic legacy exports kept for backwards compatibility.</p>
|
|
</blockquote>
|
|
</li>
|
|
</ol>
|
|
|
|
<p>TODO: explain overmounts, mount --move, mount namespaces.</p>
|