Checkpoint quantumwise

Therefore, multi-GPU parallelism is a necessary first step to enable inference for these large models. Large models can require more memory than what is available on a single GPU. To handle these challenges, we introduce DeepSpeed Inference, which seamlessly adds high-performance inference support to large models trained in DeepSpeed with three key features: inference-adapted parallelism for multi-GPU inference, inference-optimized kernels tuned for small batch sizes, and flexible support for quantize-aware training and inference kernels for quantized models. While DeepSpeed supports training advanced large-scale models, using these trained models in the desired application scenarios is still challenging due to three major limitations in existing inference solutions: 1) lack of support for multi-GPU inference to fit large models and meet latency requirements, 2) limited GPU kernel performance when running inference with small batch sizes, and 3) difficulties in exploiting quantization, which includes both quantizing the model to reduce the model size and latency as well as supporting high-performance inference of quantized models without specialized hardware. Multi-GPU inference with DeepSpeed for large-scale Transformer models DeepSpeed Profiler performance tool shows model complexity and training efficiency to help users identify performance bottlenecks.1-bit LAMB enables communication-efficient large-scale training with 4.6x communication volume reduction, which accelerates training of large-scale models even in clusters with low-bandwidth interconnects.

Compressed training exploits coarse-grained sparsity in Transformer layers via Progressive Layer Dropping during training to obtain reduced training cost, which yields 2.8x faster convergence speed without hurting accuracy.We also provide a new profiling tool to identify training performance bottlenecks. In this release, we introduce new compressed-training strategies to support fast and low-cost training while simultaneously delivering high accuracy. Together, DeepSpeed Inference shows 1.9–4.4x latency speedups and 3.4–6.2x throughput gain and cost reduction when compared with existing work.Īffordable, fast, and accurate training: Beyond inference, another key ask from DeepSpeed users is to reduce training time of large-scale models without adding additional hardware. Effective quantize-aware training allows users to easily quantize models that can efficiently execute with low-precision, such as 8-bit integer (INT8) instead of 32-bit floating point (FP32), leading to both memory savings and latency reduction without hurting accuracy.Inference-optimized CUDA kernels boost per-GPU efficiency by fully utilizing the GPU resources through deep fusion and novel kernel scheduling.Inference-adapted parallelism allows users to efficiently serve large models by adapting to the best parallelism strategies for multi-GPU inference, accounting for both inference latency and cost.Our new technologies for optimizing inference cost and latency include: To accommodate even bigger models, and to achieve faster and cheaper inference, we have added DeepSpeed Inference-with high-performance multi-GPU inferencing capabilities.ĭeepSpeed Inference at a glance: As requested by many users, DeepSpeed rolls out high-performance inference support for large Transformer-based models with billions of parameters, like those at the scale of Turing-NLG 17B and Open AI GPT-3 175B. For example, a single NVIDIA V100 Tensor Core GPU with 32 GB of memory can only fit up to a 10-billion-parameter model for inference, and the latency is limited by single GPU performance. Moreover, these models with tens or hundreds of billions of parameters, trained with aggregated memory from multiple GPUs, simply become too large to fit on a single GPU’s device memory for inference. Large-scale models are extremely computationally expensive and often too slow to respond in many practical scenarios. Two of the main challenges with inference include latency and cost. But inference, especially for large-scale models, like many aspects of deep learning, is not without its hurdles. One important aspect of large AI models is inference-using a trained AI model to make predictions against new data. As the DeepSpeed optimization library evolves, we are listening to the growing DeepSpeed community to learn how users are engaging with the library and to take on new frontiers to expand the capabilities of DeepSpeed. In addition to creating optimizations for scale, our team strives to introduce features that also improve speed, cost, and usability. Last month, the DeepSpeed Team announced ZeRO-Infinity, a step forward in training models with tens of trillions of parameters.