Distributed File Systems, Simplified!

Rajesh Dangi / July 22nd, 2020

Beyond the Cloud

Distributed File Systems, Simplified!

File Systems are responsible for organizing, storing, updating and retrieving the data. File systems may by design have unique approach or structure to provide data availability, read and write speed, flexibility, security, data size or volume it can handle or control and all users and storage of the data is located on a single system.

A Distributed file system however is a classical model of a file system distributed across different logical or physical systems in the given cluster ( Read, Connected systems over network, local or remote) that enables storing and sharing of dispersed chunks of the files across multiple systems or nodes or chunk servers as they are called. The DFS provides the mechanism by which the file systems of these physically dispersed systems are integrated into and presented as a single logical system for file management (typically unstructured data). Clients and application workloads thus have the access to their files as if they are sitting are on a local system, even though in reality the files are distributed in the cluster of systems or nodes sharing the same network. This location transparency frees clients from managing the details such as the current location of the file, attributes and also enables mobility, allowing access to files from any location. Thus providing features of automatic storage replication, common namespace and an atomic transaction mechanism providing high reliability and improved availability in the distributed environment.

Any file system consist of two or three layers of functionality, The top interface layer called logical file system interacts with user applications and maintains file table entries, pre-process descriptors for accessing the contents of the file, maintain the directory / indexation and integrity of the file as a logical entity. The second layer is virtual file system and is at times optional based on the type of file system implementation ( single or distributed) and manages the support for multiple concurrent instances of file system when distributed across the cluster. The third layer of functionality if physical file system, wherein the file system interacts with actual storage layer for physical operation of storage devices for reading and writing the blocks on the storage medium, handling buffers, cache or memory management in conjunction with respecting device drivers, controllers and channels reaching out to the storage devices etc. In the distributed file system all these levels do exist but physical layers might get localized to desperate systems or nodes handling actual physical operation of storage medium for CRUD operations. ( Read, Create - Read - Update - Delete).

In the era of faster processing using GPUs, FPGAs or even CPU accelerators the computations elements are seen designed along / on the storage devices to contain an in line compute requirements, all of the data analytics and AI/ML driven computer vision usecases are seen either rest on a distributed or parallel file system deployments. Such approach facilitates faster computation/performance, application-specific programming of hardware and offloading of computation from the server to the storage nodes / devices where the data resides.

Although The parallel file system is a type of distributed file system it typically means that the filesystem driver was built with the understanding that multiple processes can write to files at the same time, and so uses an appropriate block allocation strategy to write the files contiguously to different parts of the single disk unlike other distributed file systems that communicates with multiple servers to distribute the fragments of the file across the cluster reducing the load on single disk.

Beyond the Cloud

Broadly, A distributed file system provides many advantages such as ..

  • Capability to handle more storage than can fit on a single system, able to handle large file size in TBs, faster retrieval and network efficiency.
  • Scalability from single system to multi node system is seamless and transparent to clients for access policies and data replication involved.
  • Unified and consistent view of shared folders/resources. Access clients need not know where the files are stored, they can access files from the namespaces as if stored on local drives and all modifications done to the file are coherently visible to all connected clients and mobility of files from one system to another in the running file system.
  • Data Access transparency and location independence via fault tolerance or automated data recovery via erasure coding or data replicas depending on implementation strategy.
  • The heterogeneity is implemented by design for deployment across different hardware and operating system platforms.

DFS Deployment tenets

Since Distributed file systems provide persistent storage of unstructured data, which are organized in a hierarchical namespace of files that is shared among networked nodes, it becomes very important to look at the design considerations based on possible usecases and application Integration requirements. The deployment tenets define the performance or usability from the point of view of applications since there are different levels of integrations a distributed file system can provide. Most DFS systems provide a library with an interface that closely resembles POSIX file systems yet might not fully comply with the POSIX file system standards.

Portable Operating System Interface (POSIX) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. It helps define the application programming interface (API), along with command line shells and utility interfaces, for software compatibility with variants of Unix and other operating systems to enable interface and integration aspects of application workloads interaction with file systems.

Fundamentally, each distributed file system needs to be tested with real applications. There are few functionalities that is often poorly supported in distributed file systems is file locking, atomic renaming of files and directories, multiple hard links, and deletion of open files. At times, deviations from the POSIX standard are subtle design changes like HDFS, with a different feature that writes file sizes asynchronously and thus it returns the real size of a file only sometime after the file has been written. Some distributed file systems are implemented as an extension of the operating system kernel (Read, NFS, AFS, Lustre) are designed to offer better performance compared to interposition systems but the deployment is complex and implementation errors might crash the operating system kernel.

In object-based file systems, data management and meta-data management is separated (e. g. GFS). Files are spread over a number of servers that handle read and write operations. A meta-data server maintains the directory tree and takes care of data placement. As long as meta-data load is much smaller than data operations (i. e. large files ), this architecture allows for incremental scaling. As the load increases, data / node servers can be added one by one with minimal administrative overhead or manual intervention of reconfiguration of the cluster etc.

The architecture is further refined by parallel file systems (e. g. Lustre) that cut every file in small blocks and distribute the blocks over many nodes to enable read and write operations to be executed in parallel on multiple object storage server (OSS) nodes on object storage target (OST) devices for better maximum throughput. Lustre has metadata servers (MDS) nodes that have one or more metadata target (MDT) devices per Lustre filesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout. is only involved in pathname and permission checks, and is not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server. A distributed meta-data architecture (as for instance in Ceph) overcomes the bottleneck of the central meta-data server. Distributed meta-data handling is more complex than distributed data handling because all meta-data servers are closely coupled and need to agree on a single state of the file system tree.

The decomposition and modularization aspects are well adopted in various distributed file systems like offloading of authorization to Kerberos in AFS, or the offloading of distributed consensus to Chubby in GFS or ZooKeeper in HDFS etc further or the multi-layered implementation of Ceph with the independent RADOS key-value store as building block beneath the file system using The B-tree-based design to ensure that the set of key-value pairs is distributed among many Ceph objects, which leads to efficient handling of objects.

A distributed file system should be fast and must scale to volumes of files, users, and nodes at ease sustaining hardware failures and graceful recovery from such incidents to ensure the integrity of the file system over long data retention periods and survive flappy network connections. Caching and file striping are standard techniques to improve the speed of distributed file systems, these caches can be located in memory, in flash memory, or on solid state drives / disks. These caches are transparently filled on access; opportunistic pre-placement of data sets is typically not part of distributed file systems but it is implemented in data management systems on top of distributed file systems for faster data writes and retrieval depending upon the deployment choices at the time of cluster design and implementation . at time the values for sizes needs to be manually tuned according to the working set size of applications.

Striping, reading and writing a file in small blocks on many servers in parallel, works only for large files. For small files, (memory) caches can improve the read performance but the write performance on hard drives can suffer from the seek operations necessary to write data and meta-data to disk. Based on the server loads, Ceph file system uses Dynamic workload adaptation technique to change the mapping of meta-data to meta-data servers. To balance the cluster dynamically the file system tree is dynamically re-partitioned and re-distributed and single hot directories are split across multiple nodes.

When the file contents gets distributed across multiple nodes / systems the techniques used are Replication or erasure coding to avoid data loss and to continue operation in case of hardware failures towards fault tolerance. An engineering challenge is the placement of redundant data in such a way that the redundancy crosses multiple failure domains. The Ceph file system, for instance, can parse the physical layout of a data center together with policies to distribute redundant data on multiple racks or disk shelves maintaining a logical index of provider groups and pools . While replication is simple and fast, it also results in a large storage overhead. Most systems use a data replication factor of three. Erasure coding, on the other hand, can be seen as more sophisticated, distributed RAID system, where each and every file is chunked ( Read, broken into pieces or fragments) and additional redundancy blocks are computed. Typically the storage overhead of erasure codes is much smaller than a factor of two yet comes at the cost of computational complexity and demand more compute power yet all modifications or updates to a file as well as recovery from hardware faults is expensive because the redundancy blocks have to be recalculated and redistributed.

Lastly, Geo- Distributed file systems often need to transfer data via untrusted connections and still ensure integrity and authenticity of the data. Cryptographic hashes of the content of files are often used to ensure data integrity. Cryptographic hashes provide a short, constant length, unique identifier for data of any size. Collisions are virtually impossible to occur neither by chance nor by clever crafting, which makes cryptographic hashes a means to protect against data tampering. Many globally distributed file systems use cryptographic hashes in the form of content-addressable storage.

In summary..

Distributed file systems provide a relatively well-defined and general purpose interface for applications to use large-scale persistent storage. There are two key aspects for any DFS, is how it handles location transparency via namespace and the redundancy of data via replication of data across different servers. Namespace Referral Protocol allows file system clients to resolve names from a namespace distributed across many servers and geographies into local names on specific file servers. Once the names are resolved, DFS clients can directly access files on the identified servers by using supported file system protocols. The hierarchical namespace provides a natural and flexible way to store large amounts of unstructured data that is replicated all across. The implementation of distributed file systems, however, is always tailored to a particular class of applications. Even though a large number of distributed file systems is available today, some practical and important use cases are still uncovered.

Today, the digital transformation is driving non transactional data from devices, sensors and event based low latency workloads that demand performance at scale, the proven architectures of distributed computing is making world connected than ever before, with distributed file systems this has now become a reality. The future of DFS surely dictated by the OTTs, IoTs and compute intensive AI/ML workloads for Sure!!

July 2020. Compilation from various publicly available internet sources, authors views are personal.