Enter the New Era of High Performance Computing (HPC)
Imagine a new era of HPC for a moment. You always submit your jobs to a single node with minimal RAM and single CPU. And cluster automatically scales and adds appropriate types of nodes. It then delete nodes automatically when you job does not need them anymore. You don't have to specify memory you need, CPUs you need or number of nodes you need. Cluster magically figures all that out for you. And you pay per second for whatever you need. Sounds like a dream? Not anymore. Get ready for the Kogence kScaling Smart HPC Clusters! But before that, lets take a moment at contrast this against what exist today.
Existing Choice #1: Traditional Onprem Clusters
Onprem clusters have a fixed number of nodes. They also have fixed type of nodes ( think of number of cpu per node, type of cpu per node, amount of RAM per node etc.). Each queue is homogenous and if you are lucky you may have access to 2 or 3 queues made of 2 or 3 different type of nodes -- one queue with 4 cpu per node; another queue with 16 cpu per node; and if you are lucky another queue with cpus's that are accelerated with one Nvidia GPU per ndoe.
Onprem clusters are multi-user and multi-application. All users submit their job to same queue. Jobs may have to wait in queue before infrastructure becomes available for job to be executed.
On the positive note, onprem clusters are very efficient and usually have >90% utilization. But as the average utilization goes up, average queue wait time also goes up.
Onprem clusters are not scalable -- neither horizontally (types of nodes) nor vertically (# of nodes).
Existing Choice #2: So Called "On-Demand" Clusters from Cloud Computing Providers
Cloud HPC was one of the fastest growing enterprise software industry in 2018. There are now many cloud HPC providers that let you choose number of nodes of a given type for a given duration and they will kick-start a cluster for you.
On-demand is a myth. Even though cloud vendors like to market these as "on-demand" clusters, they are actually quite static and quite inefficient. They are actually much more inefficient than traditional onprem clusters. Lets take a closer look.
Typical HPC jobs consists of many steps in the workflow. For example, your job submission script might first do a lot of pre-processing. Setting up geometries, discretizing the geometry and setting up meshes, setting up material properties, excitation conditions etc. These generally do not scale with multi-nodes and are preferred to be run on single node. May be you want this single node to have lots of cpu's, lots of RAM, or maybe you want this to be GPU accelerated. Then you run numerical computing on a multi-node cluster. After all computing is finished, you may want to do lot of post-processing. Again post-processing generally do not scale with availability of multiple nodes. Your workflow may be more complex and you may be feeding output of one type of simulation into second simulation. This may require you to go through these basic steps multiple times.
Onprem clusters have evolved to handle this very efficiently. You job takes more node as need and when available and releases them when not needed so other users can use them. But these so-called "on-demand" cloud HPC clusters are highly inefficient. You get your own personal cluster. You have to pre-define the # of nodes, types of nodes and duration upfront before you submit your job. While your job is going through pre-processing and post-processing your cluster is sitting empty and you are paying for all the nodes that are not doing anything.
Alternatively, you can choose to sacrifice the "on-demand" nature of cloud and you can mimic an onprem multi-user, multi-application cluster on the cloud. This will bring same efficiency as that of on-prem cluster but then you lose almost all the benefits of cloud computing. Cloud clusters mimicking static multi-user on-prem clusters are always cost-inefficient compared to bare-metal onprem clusters as less number of middle parties are involved.
At Kogence we call it Cloud-HPC conundrum. Now lets look at how we solve this at Kogence.
Kogence Smart Scaling HPC Clusters with kScaling Technology
On Kogence, you always submit your jobs to a single node with minimal RAM and single CPU. Kogence kScaling technology is a smart orchestration engine. Kogence cluster automatically scales and adds appropriate types of nodes as required by your submitted job. It then delete nodes automatically when you job does not need them anymore.
Please contact us to discuss further.