100 Trillion Parameter AI Training Models

Recommender AI systems are an important component of Internet services today: billion dollar revenue businesses like Amazon and Netflix are directly driven by recommendation services.

AI recommenders get better as they get bigger. Several models have been previously released with billion parameters up to even trillion very recently. Every jump in the model capacity has brought in significant improvement on quality. The era of 100 trillion parameters is just around the corner.

Complicated, dense rest neural network is increasingly computation-intensive with more than 100 TFLOPs in each training iteration. Thus, it is important to have some sophisticated mechanism to manage a cluster with heterogeneous resources for such training tasks.

Recently, Kwai Seattle AI Lab and DS3 Lab from ETH Zurich have collaborated to propose a novel system named “Persia” to tackle this problem through careful co-design of both the training algorithm and the training system. At the algorithm level, Persia adopts a hybrid training algorithm to handle the embedding layer and dense neural network modules differently. The embedding layer is trained asynchronously to improve the throughput of training samples, while the rest neural network is trained synchronously to preserve statistical efficiency. At the system level, a wide range of system optimizations for memory management and communication reduction have been implemented to unleash the full potential of the hybrid algorithm.

Cloud Resources for 100 Trillion Parameter AI Models

Persia 100 trillion parameter AI workload runs on the following heterogeneous resources:

3,000 cores of compute-intensive Virtual Machines
8 A2 Virtual Machines adding a total of 64 A100 Nvidia GPUs
30 High Memory Virtual Machines, each with 12 TB of RAM, totalling 360 TB
Orchestration with Kubernetes
All resources had to be launched concurrently in the same zone to minimize network latency. Google Cloud was able to provide the required capacity with very little notice.

AI Training needs resources in bursts.

Google Kubernetes Engine (GKE) was utilized to orchestrate the deployment of the 138 VMs and software containers. Having the workload containerized also al

100 Trillion Parameter AI Training Models

Byindianadmin

By indianadmin

We currently understand how Ant-Man and the Wasp: Quantumania is going to end, however here’s why it does not matter

Amazon Kindle Scribe evaluation: Good hardware, middling software application

The return of among Netflix’s most significant programs is less than a week away

You missed

Controversy erupts over cricketer Rinku Singh’s appointment as Education Officer in Uttar Pradesh

Solana (SOL) Price Surges Past $153 Following First-Ever Staking ETF Launch

Ethereum Price Prediction: BlackRock Scoops Up $3.5B In ETH, Is Now The Time To Hold Or Fold?

Dogecoin (DOGE) Price: Flashes Upside Signals Following 50-Day Downtrend

100 Trillion Parameter AI Training Models

Byindianadmin

By indianadmin

Related Post

We currently understand how Ant-Man and the Wasp: Quantumania is going to end, however here’s why it does not matter

Amazon Kindle Scribe evaluation: Good hardware, middling software application

The return of among Netflix’s most significant programs is less than a week away

You missed

Controversy erupts over cricketer Rinku Singh’s appointment as Education Officer in Uttar Pradesh

Solana (SOL) Price Surges Past $153 Following First-Ever Staking ETF Launch

Ethereum Price Prediction: BlackRock Scoops Up $3.5B In ETH, Is Now The Time To Hold Or Fold?

Dogecoin (DOGE) Price: Flashes Upside Signals Following 50-Day Downtrend