Aiming for Strong Data Consistency in notoriously Concurrent and Fault-Prone environments.

Developing a Strongly Consistent, Long-Lived, Fault-Tolerant, Distributed Storage System with a Failure Prediction Mechanism

Distributed Storage Systems (DSS) encompass the technology powering modern cloud data storage services such as DropBox and Google Drive that are used by millions of users as networked platforms for collaborative applications and data storage. Algorithms for DSS ensure data availability and survivability by replicating data in geographically dispersed network locations. However, a major problem with data distribution is consistency, especially when the storage is accessed concurrently by multiple processes; a key to enabling collaboration. Numerous strategies have been devised to mitigate these issues; however, a robust and efficient solution remains elusive. In this project, we propose a novel atomic DSS built on top of asynchronous message-passing, failure-prone, commodity devices. Ultimately the project aims to lead to the implementation of a DSS with the following characteristics.

Strong Consistency (Atomicity)

Despite the existence of concurrent operations, asynchrony, and node failures, our goal is to design algorithms for read/write objects that guarantee that each read operation returns a value no older than the value written by its latest preceding write and no older than the one returned by any preceding read. Such consistency guarantee is known as Atomicity. Atomicity is the most natural consistency guarantee as it provides the illusion of a centralized, sequentially accessed storage.

Fault Tolerance

The service will allow the termination of read/write operations, despite the existence of transient or persistent failures of data hosts in the system. In this project, we focus on crash failures.

Long Liveness

To ensure that persistent faults will not affect the operation of the service in the future, the service will implement mechanisms to remove faulty data hosts, insert new healthy alternatives, and migrate the data for a seamless uninterrupted experience to the clients. Such mechanisms are known as reconfigurations since they result in updating the membership of the host nodes.

Failure Prediction

It is one thing to reconfigure and another to know when to reconfigure. The last characteristic of the service is to implement Machine Learning algorithms in order to predict when soon to fail storage devices. This will allow determining which hosts will become unavailable and thus how the service needs to reconfigure to maintain functionality.

Minimum Viable Prototype

Essentially we would like to devise an efficient prototype of an atomic, distributed storage system, by combining the following key services:

  1. Distributed Object Management,
  2. Data Fragmentation,
  3. Object Reconfiguration, and
  4. Failure Prediction