CEPH Unified Storage, Simplified!
In todays era of the data dictating the dynamics of the applications, solving usecases of mobility and scale for the interconnected digital world, data storage is becoming a challenge for those who just keep relying on the legacy storage approaches and solutions that maintain silos of file systems, object storage ( read, file servers) and block storage systems. Most of the organizations thus forgot to Innovate since innovation demands reinvesting in the completely new set of hardware and network gear without guaranteed inter-operatability and the lock-in of the existing legacy.
Current storage systems and solutions span over few decades and have been hardwired thus struggling to solve the scalability and inter-operability issues, their performance is directly proportional to adding newer and faster hardware and disk arrays. With the unprecedented growth of data and processing needs these hierarchical systems cause bottlenecks in data storage and retrieval in real time since distributed application and containers demand faster and quicker access to data and demands consistency and partition tolerance thus solving only availability of data via traditional storage systems is no longer works. Since beginning file, object and block storage systems do not talk to each other and are positioned as a separate solution maintaining silos.
The need for a convergence in storage system demands radical approach and should ensure performance, future enhancements, scalability and availability in built, all these factors directly impact the functioning of the effective and efficient storage system that is should solve the evident problems of the legacy isolated storage solutions. It is thus prudent to have these essential features as part of single unified design rather than addons, The Silver bullet for solving all these storage problems is CEPH…
An open source ( read, free to use) ceph implements object storage on a single distributed cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available, says Wikipedia.
Ceph is an open source, reliable and easy-to-manage, next-generation distributed object store based storage platform that provides different interfaces for the storage of unstructured data from applications and pools in the various storage resources as a unified system. Developed at University of California, Santa Cruz, by Sage Weil in 2003 as a part of his PhD project, the name "Ceph" is an abbreviation of "cephalopod", a class of molluscs that includes the octopus. The name (emphasized by the logo) suggests the highly parallel behavior of an octopus.
In 2012, Sage Weil created Inktank Storage for professional services and support for Ceph. In October 2015, the Ceph Community Advisory Board was formed to assist the community in driving the direction of open source software-defined storage technology. The charter advisory board includes Ceph community members from global IT organizations that are committed to the Ceph project, including individuals from Canonical, CERN, Cisco, Fujitsu, Intel, Red Hat, SanDisk, and SUSE.
The Ceph ecosystem can be broadly divided into four segments (see Figure 1): clients (users of the data), metadata servers (which cache and synchronize the distributed metadata), an object storage cluster (which stores both data and metadata as objects and implements other key responsibilities), and finally the cluster monitors (which implement the monitoring functions). One of the key differences between Ceph and traditional file systems is that rather than focusing the intelligence in the file system itself, the intelligence is distributed around the ecosystem.
Ceph key components & daemons
A high-level component view of a ceph system provides insight into the five key components,
A Ceph Storage Cluster may contain thousands of storage nodes spread across multiple clusters. A minimal system will have at least one Ceph Monitor and two Ceph OSD Daemons for data replication. The Ceph Filesystem, Ceph Object Storage and Ceph Block Devices read data from and write data to the Ceph Storage Cluster.
ceph system thus relies on four key daemons, namely ..ceph-mon, ceph-mds, ceph-osd and ceph-raw.. the function of these helps ceps system, monitor and track what is stored, maintain metadata ( read, data about data) journal the changes and expose the object storage layer as an interface via APIs for ease of integration with ceph clients & thus applications.
How does Ceph works?
The Ceph Client is the user of the Ceph file system. The Ceph Metadata Daemon provides the metadata services, while the Ceph Object Storage Daemon provides the actual storage (for both data and metadata). Finally, the Ceph Monitor provides cluster management. There can be many Ceph clients, many object storage endpoints, numerous metadata servers (depending on the capacity of the file system), and at least a redundant pair of monitors.
Since all of these daemons are fully distributed and can run on the same set of servers with ability of clients directly interact with all of them makes it agile and adaptive and horizontally scalable storage indeed. A truly A Scalable, High-Performance Distributed File System ensuring performance, reliability, and scalability. Ceph stores data as objects within logical storage pools. Using the CRUSH algorithm, Ceph calculates which placement group should contain the object, and further calculates which Ceph OSD Daemon should store the placement group this enables the Ceph Storage Cluster to scale, rebalance, and recover dynamically.
Ceph divides the OSDs into placement groups for the CRUSH algorithm. These placement groups can be combined together to form a pool, which is like a logical partition for storing the objects in Ceph. Pools can help differentiate between the storage hardware based on performance. Ceph also has cache-tiering, which helps in creating a pool of faster storage devices as cache storage for expensive read/write operations. This helps in improved performance and efficient utilization of the storage hardware. With OpenStack as the cloud platform, Ceph can be used as a Swift object store and Cinder block store utilizing the same storage hardware for multiple needs. Ceph can be used with other cloud platforms like CloudStack, Eucalyptus and OpenNebula.Ceph is thus not only a file system but an object storage ecosystem with enterprise-class features.
User's perspective of Ceph is transparent. From the users' point of view, they have access to a large storage system and are not aware of the underlying metadata servers, monitors, and individual object storage devices that aggregate into a massive storage pool. Users simply see a mount point, from which standard file I/O can be performed. Since with ceph, the file system's intelligence is distributed across the nodes, which simplifies the client interface but also provides Ceph with the ability to massively scale (even dynamically).
The Ceph metadata server
The job of the metadata server (cmds) is to manage the file system's namespace. Even though both metadata and data are stored in the object storage cluster, they are managed separately to support scalability. Metadata is further split among a cluster of metadata servers that can adaptively replicate and distribute the namespace to avoid hot spots and ensure redundancy. As shown below, the metadata servers manage portions of the namespace and can overlap (for redundancy and also for performance). The mapping of metadata servers to namespace is performed in Ceph using dynamic subtree partitioning, which allows Ceph to autonomously adapt to changing workloads (migrating namespaces between metadata servers) while preserving locality for performance and efficiency of read operations
Each metadata server simply manages the namespace for the population of clients, its primary application is an intelligent metadata cache (because actual metadata is eventually stored within the object storage cluster). Metadata to write is cached in a short-term journal, which eventually is pushed to physical storage. This behavior allows the metadata server to serve recent metadata back to clients which is common in metadata operations and provides faster response time for retrieval. The journal is also useful for failure recovery: if the metadata server fails, its journal can be replayed to ensure that metadata is safely stored on the disk. Metadata servers manage the inode space, converting file names to metadata. The metadata server transforms the file name into an inode, file size, and striping data (layout) that the Ceph client uses for file I/O.
Managing Objects in ceph
A file / object is assigned an inode number (INO) from the metadata server, which is a unique identifier. Using the INO and the object number (ONO), each object is assigned an object ID (OID). Using a simple hash over the OID, each object is assigned to a placement group.
The placement group (identified as a PGID) is a conceptual container for all objects. Ceph must handle many types of operations, including data durability via replicas or erasure code chunks, data integrity by scrubbing or CRC checks, replication, rebalancing and recovery. Consequently, managing data on a per-object basis presents a scalability and performance bottleneck. Ceph addresses this bottleneck by sharding a pool into these placement groups.
Finally, the mapping of the placement group to object storage devices is a pseudo-random mapping using an algorithm called Controlled Replication Under Scalable Hashing (CRUSH). The final component for allocation is the cluster map. The cluster map is an efficient representation of the devices representing the storage cluster. With a PGID and the cluster map, ceph can locate any file / object.
Ceph includes monitors that implement management of the cluster map, but some elements of fault management are implemented in the object store itself. When object storage devices fail or new devices are added, monitors detect and maintain a valid cluster map. This function is performed in a distributed fashion where map updates are communicated with existing traffic. Ceph uses Paxos, which is a family of algorithms for distributed consensus.
Ceph object storage
Similar to traditional object storage, Ceph storage nodes include not only storage but also intelligence. Traditional drives are simple targets that only respond to commands from initiators. But object storage devices are intelligent devices that act as both targets and initiators to support communication and collaboration with other object storage devices.
The CRUSH Algorithm & Maps
The architecture of Ceph might seem familiar to those who know about Google File System (GFS) and Hadoop Distributed File System (HDFS) but it is also very different from them in multiple ways. Ceph uses CRUSH (controlled, scalable, decentralized placement of replicated data) algorithm for random and distributed data storage among the OSDs. Ceph doesn’t need two round-trips for data retrieval like HDFS or GFS, in which one trip is to the central lookup table to find the data location and the second trip is to the located data node. Every bit of data stored in Ceph OSDs is self-calculated using the CRUSH algorithm and stored independent of any other attribute. When a client requests data from Ceph, this CRUSH algorithm is used to find the exact location of all the requested blocks, and the data is transferred by the responsible OSD nodes. As and when any OSD goes down, a new cluster map is generated in the background and the duplicate data of the crashed OSD is transferred to a new node based on results from the CRUSH algorithm.
Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the cluster topology, which is inclusive of 5 maps collectively referred to as the “Cluster Map”:
Each map maintains an iterative history of its operating state changes. Ceph Monitors maintain a master copy of the cluster map including the cluster members, state, changes, and the overall health of the Ceph Storage Cluster.
HIGH AVAILABILITY AUTHENTICATION
To identify users and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system to authenticate users and daemons.Cephx uses shared secret keys for authentication, meaning both the client and the monitor cluster have a copy of the client’s secret key. The authentication protocol is such that both parties are able to prove to each other they have a copy of the key without actually revealing it. This provides mutual authentication, which means the cluster is sure the user possesses the secret key, and the user is sure that the cluster has a copy of the secret key.
A key scalability feature of Ceph is to avoid a centralized interface to the Ceph object store, which means that Ceph clients must be able to interact with OSDs directly. To protect data, Ceph provides its cephx authentication system, which authenticates users operating Ceph clients. The cephx protocol operates in a manner with behavior similar to Kerberos.
Ceph Monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with any OSD or metadata server in the cluster. Like Kerberos, cephx tickets expire, so an attacker cannot use an expired ticket or session key obtained surreptitiously. This form of authentication will prevent attackers with access to the communications medium from either creating bogus messages under another user’s identity or altering another user’s legitimate messages, as long as the user’s secret key is not divulged before it expires.
In nutshell : ceph logical data flow..
The management of data as it flows through a Ceph cluster involves each of the components and coordination among these components empowers Ceph to provide a reliable and robust storage system. Data management begins with clients writing data to pools. When a client writes data to a Ceph pool, the data is sent to the primary OSD. The primary OSD commits the data locally and sends an immediate acknowledgement to the client if replication factor is 1. If the replication factor is greater than 1 (as it should be in any serious deployment) the primary OSD issues write subops to each subsidiary (secondary, tertiary, etc) OSD and awaits a response. depending on the configuration subops ensures multiple copies of data are written to the respective OSDs and metadata is updated subsequently. here is a simplified illustration of ceph logical data flow..
Getting There: Enabling Hyperscale..
Now the primary reason we are going after ceph is scalability, one of the known challenges in scalability is cluster awareness, like in many clustered architectures, the primary purpose of cluster membership is so that a centralized interface knows which nodes it can access. Then the centralized interface provides services to the client through a double dispatch–which is a huge bottleneck at the petabyte-to-exabyte scale to overcome this barrier, ceph’s OSD Daemons and Ceph Clients both are cluster aware by design. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD Daemons in the cluster, this enables Ceph OSD Daemons to interact directly with each other and also with Ceph Monitors. Additionally, it enables Ceph Clients to interact directly with Ceph OSD Daemons thus removing bottleneck.
Key Considerations / Recommendations
Deployment and Hardware Considerations
The ceph deployments can become complex since it offers unified storage and varied options of using the bare-metal, virtualized instances and even public cloud storage buckets are part of the ceph system, there are other options like using ceph as part of Apache Mesos, or even with containers or simply stand-alone ceph cluster. If you already have few high-end servers and wish to build a high performance ceph platform then you must have a deployment strategy as to how will you use CPU, RAM and disks from the bare metal servers and get best of these infra resources in hand.
Best strategy is to map few thumb rules, for every disk on the bare metal server create one corresponding OSD and earmark 1 GB of RAM per TB, similarly for each OSD allocate one CPU thread since RADOS will need lot of processing for data replication, erasure coding, rebalancing, recovery, monitoring and reporting functions running locally one each OSD node. Remember to have different crush maps for different disk types for the high-performance cluster. For a medium or low storage latency clusters minimum hardware requirements are listed in the ceph documentation as hardware recommendation.
In Conclusion, Ceph can do all three types of commonly found storage types (object, block, and file) — a flexibility that separates it from the rest of the SDS herd. It’s inherent scale-out support aligns well with horizontal which means you can gradually build large systems as cost and demand require, and it sports enterprise-grade features such as erasure coding, thin provisioning, cloning, load-balancing, automated tiering between flash and hard drives, and simplified maintenance and debugging. Its Network File System (NFS) interface allows access to the same data from public cloud storage API interferences and NFS file interfaces and is compatible with the Hadoop S3A filesystem client, enabling developers to use Apache Hadoop MapReduce, Hive, and Spark for seamless integration.
While recent acquisition of REDHAT by IBM raised a question about future of ceph, as analysts view advantage of ceph for IBM is that Ceph would provide IBM a truly unified software defined storage platform which is block, file and object storage and thus will provide a wider strategic advantage and brings it closure to open source community for sure.
According to Gartner Inc., Ceph has made successful strategic entry into the enterprise IT storage space and has proved to be the next big evolution in storage technology. With the current adoption rate, Ceph will soon surpass the existing storage solutions at enterprises. There is a lot of development happening with Ceph, which will bring about significant performance improvements to match the current proprietary solutions. Even if you don’t take future enhancements into consideration, Ceph is a storage platform that definitely needs to be looked at for any big deployment that demands data at large scale and proportions and riding high on the software defined storage (SDS) digital transformation curve…