Sunday, August 9, 2009

Skinny Straw in the Cloud Shake

There is recently an article by Bernard Golden talking about network constraint (bandwidth and latency) as well as the associated bandwidth usage cost continues to become one main obstacle in cloud computing.

There are two concerns here. One is about not meeting the application's performance goal (throughput and response time). The other is about the cost of running in the cloud. (receive a large phone bill from your cloud provider)

The goal is to reduce the total amount of data transfer. A number of cloud app design patterns can be used ...

How do you put the code and data together before the processing can start ?

Try to be as stateless as possible
There is zero data data transfer to be transferred if your component is stateless by nature. Following techniques are assuming that there are some unavoidable stateful components involved.

Move your data creation process into the cloud first
Instead of uploading huge volume of data from your data center into the cloud so processing can be started, can you move the data creation process into the cloud ? Of course, you need to carefully evaluate the security implications here.

Distribute the architecture of your data creation
If the subsequent processing is based on a parallel execution architecture, why not distribute the data creation processing also. This will save a data repartition step.

Move the code to the data
Code usually has a much smaller footprint than the data it processes. Therefore it is more economical to move processing logic to the data rather than downloading the data to process. Of course, we need to check to make sure the machine hosting the data has enough CPU power to execute the processing logic.

Do as much as possible along current partition
A typically parallel processing architecture partitions data along some dimensions, conduct the processing in parallel, and then repartition data along other dimensions, conduct the next stage of processing, and so on ...

See if you can rearrange the order of processing such that you can do as much as possible within the current partition. The goal is to minimize the number of repartitions where a lot of data transfer is needed.

Minimize data redistribution at grow/shrink
How do you redistribute data to newly joined VM such that the overall data transfer can be minimized ? For example, "consistent hashing" algorithm can be used such that data redistribution only happens within the neighbor of newly joined VM rather than every other existing VMs.

Conduct data redistribution in the background
Data redistribution should have an impact on performance but not accuracy. In other words, the newly joined VMs should be able to serve immediately while doing data redistribution in the background. The data redistribution algorithm (which may take a longer time to finish) also need to adapt to continuous joining VMs. In other words, data redistribution can be just an ongoing performance improvement process in a highly dynamic workload environment.

Place component with bandwidth cost in mind
Other than the amount of data being transferred (which should be minimized anyway), it is equally important to look into bandwidth cost. Typically the cloud provider will charge a substantial amount in bandwidth usage across the cloud boundary. Therefore, it is important to place the components such that if data transfer do need to occur, it will occur within the cloud rather than across the cloud boundary. This requires a careful analysis of the communication pattern among application components and group frequently communicating components so they will be deployed within the same cloud.

Migrate data as communication pattern changes
Communication pattern may change after the system is deployed. It is important to continuously monitor the actual communication patterns and determine if a migration is needed to minimize the bandwidth cost. It is important to consider the gain versus the cost of migration. Gain is estimated by multiplying the communication frequency with the time that the new communication pattern is going to persist. Cost is estimated by the total among of data redistribution traffic caused by component migration. And only when the migration cost is smaller than the gain will the migration take place.

Exploit Caching
Use a local cache to reduce the need of data access, especially if the data is relatively static.

Allow direct access to data
This is against the philosophy of SOA where the internal state should be encapsulated behind an API interface. In this model, when a client want to extract the data, it need to first make a request to the owning application, which then make a request to the DB, get the data, encode that into the web service response, and pass the result back to the client. Is network bandwidth is costly, it will be much more efficient if the client can have direct access to the DB.

Expose latency information to the application
Provide latency map so application can dynamically adjust their communication partners who they want to communicate with.

No comments: