Storage Systems - NAS vs. SAN

[gtranslate]

There are several options for where to place your own or customer data that the server needs to do its job (files, emails, etc.). We will take a look at these methods, explain how the so-called. storage and what the terms DAS, NAS and SAN mean.

File system cache

First, a digression to explain one very important feature of operating systems when dealing with data they read and write to disk. First, let’s forget about storage and have the data stored directly on the server. The operating system reads data from the disk in blocks (not in individual bytes, one block can be 4 kB in size). It is also important that hard disks are very slow (reading or writing one block is in the order of 10 ms), while RAM in the computer is significantly faster (for example 1000 times).

Now imagine that the application asks the operating system to read block no. 7353. The operating system sends a disk command, waits for the data to be read from the disk and passes it to the application. And what happens if someone asks to read the same block from the disk again after a while? If this block has not been written to since then, isn’t it unnecessary for the operating system to re-send a disk command to read this block and wait again for 10 ms for the result?

This is where the filesystem cache comes in. The operating system and applications do not typically consume all the available operating memory; there is always some left over. And it is this unused residue that the operating system uses to store a copy of the data read from the disk. So when a block is read from disk for the first time, it must actually be read from disk, but its contents are also immediately stored in memory. Whenever someone else asks to read the same block, the result comes right out of memory and no read command is sent to the disk.

It may not seem like it, but this “minor” modification brings a radical increase in disk performance. Notice that when you run an application (like Word) for the first time since you started your computer, it takes quite a while and the disk is working hard. However, when you run the same application again after a while, it happens that the application starts several times faster and maybe the disk work light doesn’t flash at all. This is simply because the system cached all the files read from the disk when it was first started.

Direct-attached Storage (DAS)

The most common and usual way is that all data is placed directly on the server, i.e. on the hard disks connected to it. There are advantages and disadvantages. The advantage is fast access, the disadvantage may be more complicated sharing of this data with other servers. This is called Direct-attached Storage (DAS).

Storage

However, if you want to do things on a large scale and you want to, for example, create server clusters (i.e. a group of servers that perform the same activities on the same data, when a failure occurs, they replace each other, they distribute the load among themselves), then you need them to share the necessary data (for example, customer websites, emails). For this purpose, there are specialized facilities that are designed to store this data – storage or disk arrays. Storage is typically a machine with a large number of disks that it takes care of and makes their contents available to others. These data are not placed on the individual servers (e.g. web servers), but are worked with on the storage. This will separate the data storage and data processing logic. The server then only needs to have a small slow disk on which the operating system and software are stored. And alongside that is storage with large, very fast and more reliable disks holding the data. The server is connected to the storage via a computer network.

Network-attached storage (NAS)

The usual way to connect to storage over a network is with a NAS. In this case, data is shared at the file level. It is, for example. the familiar connection of a network drive in Windows (via SMB protocol), or similar thing in Linux in the form of the NFS protocol. You “connect” a virtual disk from storage over the network to your computer or server, it looks to you as if it were a disk connected directly and you work with it normally.

The crucial point is that it works at the file level. The log client (that is, your computer server) knows nothing about the file system where the files are actually stored. The storage takes care of the filesystem and the client sends it commands like “give me the first X bytes of file Y” or “write data Z to position X in file Y”.

These logs are often stateless – storage does not keep track of which client has which files open and what it does with them. Only one-time orders are executed. This extends to e.g. about some locks, etc.

And now, the crucial insight for which we are writing this article: In the case of NAS , you cannot use the filesystem cache, precisely because of the non-stationarity. We have no way of knowing that another user has changed a file on the storage and that we should dump the invalid version from the cache. So the cache doesn’t work here at all. If you send a command to read the same thing 100 times, that command is indeed transmitted 100 times over the network, even if the data has not changed at all in the meantime. But it’s not that bad, of course the file cache on the storage side works. But not on the client side, so we have to factor in the overhead of network communication.

NAS is suitable for some casual sharing of documents and other files within a business, home, etc. But it is unsuitable for use in the busy north (for web data, emails).

Storage area network (SAN)

For servers and server clusters, the third option – SAN– is the clear choice. This is data sharing at the block level and not at the file level. I mean. storage works with raw data and has no concept of the filesystem that is built on top of it. To the server, a virtual disk mounted from storage appears as an unformatted hard disk, behaving and being used as if it were mounted directly. And the server will only build a file system on top of this. The communication between the server and the storage is therefore “read block X” and “write block X”. You can imagine that a network is just inserted between the server and the hard disk, which would normally be connected directly.

But what makes it so great? At first glance, it looks the same. However, here it is crucial – here the file cache is used on the server side. So if the server reads the same file 1000 times, it reads it once from storage (sends a command over the network and waits for a response) and 999 times it reads it from its RAM (and sends nothing over the network).

This is typically what happens with web servers – the vast majority of files are read-only all the time, and are changed very rarely. So again, the vast majority of requests to read a file are satisfied from the server RAM and the storage knows nothing about anything. So working with the disk here is almost as fast as with DAS. This is no longer the case with email and database servers, where relatively small files are created, read and deleted intensively, and databases are also constantly being written to. The cache will not be as successful here.

A server can have a SAN connected in several ways:

Fibre Channel (FC) – optical interconnect system specialized for network storage, very fast and powerful, but significantly more expensive, requiring special switches
iSCSI – SCSI protocol over IP – a cheaper solution, it is not necessary to purchase special interconnection elements (existing IP network and common Ethernet switches are used, it is better to have a separate network for these purposes, where other IP traffic is not mixed) and is not as powerful

Clustered file systems

But we still have one problem to solve. If we have multiple servers and one storage, each server can have its own part of the space provided by the storage connected (the virtual disk on the storage is called a LUN) and the servers do not work with the same data. However, if we need a cluster of servers that work with the same data, we run into the problem that it is not possible to simply connect multiple servers to the same LUN. This is because the servers assume it is their disk as if it were directly attached, and would immediately destroy the data as they all start modifying the filesystem data structures (since they simply don’t know about each other).

This is where specialized cluster filesystems come in. A typical example is GFS2 from Red Hat, which we also use. It is a filesystem supplemented by a distributed lock system – servers exchange information about who is going to read or write what file.

In a nutshell, it looks like if the server wants to read a file, it has to get a lock to read it. Multiple servers can hold a read lock (they do not collide when reading), but at that point no server can hold a write lock. If we want to write, we must have a write lock – only one of the servers can hold it, and at the same time no server can have a read lock on the same file (so the write lock is exclusive).

So servers exchange information about who has what locks on what files, and if a server needs a lock and another server prevents it with another lock, it must ask the server to release it.

How does the file cache fit into this? As long as everyone is just reading one particular file, then everyone can cache its contents. But as soon as one of the servers asks for a write lock, everyone has to dump that file from the cache at that point.