Network bandwidth variation-adapted state transfer for geo-replicated state machines and its application to dynamic replica replacement

Abstract

This paper proposes a new state transfer method for geographic state machine replication (SMR) that dynamically allocates the state to be transferred among replicas according to changes in communication bandwidths. SMR improves fault tolerance by replicating a service to multiple replicas. When a replica is newly added or recovered from a failure, the other replicas transfer the current state of the service to it. However, in geographic SMR, the communication bandwidths of replicas are different and constantly changing. Therefore, existing state transfer methods cannot fully utilize the available bandwidth, and their state transfer time increases. To overcome this problem, our method divides the state into multiple chunks and assigns them to replicas based on each replica’s bandwidth so that the broader a replica’s bandwidth is, the more chunks it transfers. The proposed method also updates the chunk assignment of each replica dynamically based on the currently estimated bandwidth. The performance evaluation on Amazon EC2 shows that the proposed method reduces the state transfer time by up to 47% compared to the existing one. In addition, we apply the proposed method to dynamic replacement of replicas, which can mitigate latency degradation caused by network trouble, and evaluate how fast the method can relocate a replica.

Publication
Concurrency and Computation: Practice and Experience