Cost of locally using HPC and cloud computing
This post is to share my thoughts on High Performance Computing to use the CFD models. These days most of the jobs involve using many nodes on a local cluster. Between 2010-2014, I spent plenty of hours moving from one cluster to another during my PhD as we exhausted our CPU hours. Once we consumed the hours on one cluster, I had to move to another. At one point of time, I was using 512 processors for solving over 140 million mesh points. These numbers even seem daunting now because to get access to 512 processors is frustrating when one has to sit in the queues and then hope for the simulation to show good results. Having said that, atleast I had DOD’s resources available at the time and the support staff was really helpful when one moved from one cluster to another. The support staff forms an integral part of HPC clusters because one has to get used to the local Linux environments and Fortran/MPI/GPU compilers.
At my present job I do not run computationally heavy jobs all the time because I am doing model development and try to get the model validated with smaller test cases. However, I do have some jobs that require me to go to the HPC units, wait for the job to get queued and then wait for the model output to arrive in your local directory. I have also seen that research groups are focused on getting their runs/jobs completed but scientists are not normally aware of the cost of HPC computing. Usually the business model is to operate on local clusters and pay an upfront cost of buying nodes or computational time.
As the simulations get more and more computationally heavy, one has to think of a way to efficiently utilize the time of setting up scientists on new machines. In general, one needs to have a dedicated training program to make sure that scientists are aware of the involved costs and not waste the resources. I have seen people running jobs directly into standard queues and then after waiting for few days realizing that they had a small bug in their setup. Then the cycle repeats itself. I am guilty of that too. I think we all are focused on getting the right answer. However, it also leads to inefficiencies.
In near future, there are several things that one has to consider costwise while working on local clusters.
- Life of a HPC system-Local clusters come with an expiry date and the machines that have state of the art equipment today would be outdated in 3-5 years.
- Operational costs include electricity and cooling costs.
- Maintainence costs involved with personal that manage HPC - From my graduate school until now, I have seen that most inefficiencies with local clusters lie with managing HPC personal who maintain the clusters. It is hard for institutions to retain good HPC specialists and also the fact that institutions may depend on a handful individuals (sometimes even 1 or 2) for deciding the work output of several scientists.
- Cost of data storage-When we use local clusters, we have to make sure that enough redundancies are available in case one or two backup storage disks crash.
- Cost of bringing the data back for post processing- This can be another inefficiency.
While local clusters have been the main tool for most research groups, cloud computing may overcome some of the above listed issues with local clusters.
- One has to not worry about the life of HPC systems as the technology companies would always find it easier to upgrade their systems.
- With tools like Docker for deploying the models easily on cloud, installation challenges could be removed for users.
- Personal management would be most likely less of an issue because technology companies would always have a better chance of hiring competitive HPC specialists than local institutions.
- Backup of data would be on multiple locations.
- Tools used to postprocess directly on HPC clusters using Jupyter notebooks can solve the issue of bringing the data back locally. By doing that, one can also use the parallel processing capability of cloud to do post processing.
Rescale is one of the companies that is using cloud technology to run CFD models for the last few years. All of these arguments sound like that local clusters should not be supported but I think having some hard numbers attached to the involved cost would be the best way for each research team to move forward. There is definitely a factor of comfort when you have things operated locally and there can be security considerations in favor for data to be placed locally.