Their most recent release was a software package that enabled a distributed protein folding computer model, in which internal and external machines could work together to calculate protein folding processes.
While this was an excellent, cost-effective solution to a computationally intensive problem, their new software architecture was not working as well as they had hoped. There was a 5-6 minute delay before any of the clients started doing computation whenever they created a new data set to work on.
Their architecture was rather simple. When a new genome dataset arrived, it was replicated across regions before being chopped up into manageable parts and uploaded to Google Cloud Storage by a Data Coordinator.Clients listening on the service would be notified of available work via Pub/Sub, and the blocks would be grabbed immediately from GCS to begin working.
The issue was that it took too long to upload all of the files to GCS. The fascinating component was that the data coordinator would change the number and size of blocks based on the anticipated number of customers in that location. With a large number of users, a large number of smaller files may be passed around. The idea is that there must be a relationship between file size and GCS upload performance.
Trying to duplicate
This appears to be a simple enough problem to replicate. We can generate a large number of little files, upload them to a GCS bucket, and then download them one by one from the GCS bucket to determine the time difference.
To test this, I wrote a tiny Python script that generated files of varying sizes and then uploaded each file 100 times to a GCS regional bucket (with a different name each time to avoid clashes). The performance (bytes/sec) is plotted versus the size of the items below.
Read Also: Fix Google Cloud not Working
Comments