Latencies Gone Wild!

gpang

Cloud services are becoming popular for large-scale computing and data management.  Amazon EC2 is a commonly used cloud service by many individuals and companies, and has clusters in 5 different regions: US East, US West, EU, Asia (Singapore) and Asia (Tokyo).  However, failures can happen, even to entire clusters and regions.  Amazon suffered a failure for several days in the east region in April 2011.  Eventually most services were restored, but 0.4% of database data could not be restored and was lost.  Therefore, if distributed systems must be highly available, they must be replicated across data centers.  In addition, no data loss and higher levels of consistency can only be achieved through synchronous replication.

Since spanning multiple regions is important for reliable distributed systems, the latencies of network messages are affected by the long distances.  When two machines across the country or the globe need to communicate, the speed of light limits the lower bound of network latencies.  For example, if 4,000 kilometers span between California and Virginia, the speed of light dictates that the theoretical lower bound of any round trip message is at least 26 milliseconds.  RPCs within a single data center usually take less than 1 millisecond to complete, but RPCs to different regions are expected to take around 100 milliseconds or more.We ran a few experiments on EC2 to measure cross data center message delays to get a better idea of how different regions affect the latencies.

For the first experiment, we measured simulated 2048-byte echo RPCs between two machines in 3 different scenarios: both machines in the same data center, both machines in the same region, but different data centers, and both machines in different regions.

Data Center Labels:
west1 – data center 1 in the US West region
west2 – data center 2 in the US West region
east1 – data center 1 in the US East region
west1west1 west1west2 west1east1
average 0.68 ms 1.68 ms 83.11 ms
99th percentile 0.88 ms 1.90 ms 83.68 ms

From the numbers, it is obvious that network latencies between the west and east coast of the US are about 2 orders of magnitude longer than latencies within a single data center.

Our second experiment measured latencies between some of the other regions for longer periods of time.  We collected latency measurements for about a week for RPCs between different regions.

The 4 cross-regions tested:
East (US) – EU (Ireland)
East (US) – Tokyo
West (US) – EU (Ireland)
West (US) – Tokyo
This shows that the latencies between distant regions can vary wildly.  There were some spikes of RPCs which took longer than a minute and there were periods of time when the latencies were consistently almost a second long.

These experiments show that the message delays can have a lot of variation and spikes of high latencies can be expected for cross data center network traffic.  Globally reliable systems will need to expect longer network message delays, and deal with them.  Common techniques either suffer data loss, or do not handle the longer latencies to achieve good performance.  New solutions will have to be developed in order to provide fault tolerant, reliable, globally distributed systems with usable performance.  Stay tuned for details on our new project addressing this issue.