publications
2024
- UNAST: Unified framework for Neural Architecture Search for TransformersIlia Markov, Chenhan D. Yu, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jan Kautz, and 2 more authors2024submitted to NeurIPS’24
We introduce the UNAST, a new approach to optimize Large Language Models (LLMs) post-training. UNAST combines Neural Architecture Search (NAS) with sparsity and quantization for LLM compression. Starting with a trained model, UNAST replaces layers (e.g., attention and MLP) with more efficient alternatives by adjusting attention heads, KV projection dimensions, and MLP expansion factors. Local distillation pretrains layer candidates to mimic original layers. Scores and costs (latency, number of parameters, etc.) of each operator are fed into an Integer Linear Optimizer to find the optimal architecture under predefined constraints (latency, number of parameters, etc.). Our experiments show that UNAST scales to large models, reducing training costs by up to 10 times compared to training smaller models from scratch. Validation on GPT-3 and LLaMa models demonstrate that UNAST improves latency and memory footprint by up to 60% with minimal accuracy loss. UNAST also provides insights into the effects of different compression types on Transformer layers, aiding in the development of non-uniform models.
- Sparse Expansion and Neuronal DisentanglementShashata Sawmya, Linghao Kong, Ilia Markov, and Nir N Shavit Dan Alistarh2024submitted to NeurIPS’24
We show how to improve the inference efficiency of an LLM by expanding it into a mixture of sparse experts, where each expert is a copy of the original weights, one-shot pruned for a specific cluster of input values. We call this approach Sparse Expansion. We show that, for models such as Llama 2 70B, as we increase the number of sparse experts, Sparse Expansion outperforms all other one-shot sparsification approaches for the same inference FLOP budget per token, and that this gap grows as sparsity increases, leading to inference speedups. But why? To answer this, we provide strong evidence that the mixture of sparse experts is effectively disentangling the input-output relationship of every individual neuron across clusters of inputs. Specifically, sparse experts approximate the dense neuron’s output distribution with fewer weights by decomposing the distribution into a collection of simpler ones, each with a separate sparse dot product covering it. Interestingly, we show that the Wasserstein distance between a neuron’s output distribution and a Gaussian distribution is an indicator of its entanglement level and contribution to the accuracy of the model. Every layer of an LLM has a fraction of highly entangled Wasserstein neurons, and model performance suffers more when these are sparsified as opposed to others.
2023
- Quantized Distributed Training of Large Models with Convergence GuaranteesIlia Markov, Adrian Vladu, Qi Guo, and Dan Alistarh2023ICML’23
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model’s weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.
- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language ModelsSaleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, and 2 more authors2023submitted to ACL’24
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity.
2022
- L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient CompressionIlia Markov*, Mohammadreza Alimohammadi*, Elias Frantar, and Dan Alistarh2022MlSys’24
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks due to gradient transmission. To address this issue, entire families of lossy gradient compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model’s layers dynamically during training, significantly improving the overall compression without sacrificing accuracy. Our framework, called L-GreCo, is based on an efficient adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while respecting a theoretically-justified error constraint. Our extensive experimental study over image classification and language modeling tasks shows that L-GreCo is effective across all three compression families, and achieves up to 2.5× training speedup and up to 5× compression improvement over efficient implementations of standard approaches while recovering full accuracy. Moreover, we show that L-GreCo is complementary to existing adaptive algorithms improving their compression ratio by 50% and practical throughput by 66%.
- CGX: Adaptive System Support for Communication-Efficient Deep LearningIlia Markov, Hamidreza Ramezanikebrya, and Dan AlistarhIn Proceedings of the 23rd Conference on 23rd ACM/IFIP International Middleware Conference, 2022Best Paper Award Runner-Up
The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth over-provisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes.In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: At the system level, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. At the application level, it provides seamless, parameter-free integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a layer-wise adaptive compression technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
2021
- Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient DescentGiorgi Nadiradze, Ilia Markov, Bapi Chatterjee, Vyacheslav Kungurtsev, and Dan AlistarhProceedings of the AAAI Conference on Artificial Intelligence, May 2021
One key element behind the recent progress of machine learning has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Most of these models are trained employing variants of stochastic gradient descent (SGD) based optimization, but most methods involve some type of consistency relaxation relative to sequential SGD, to mitigate its large communication or synchronization costs at scale. In this paper, we introduce a general consistency condition covering communication-reduced and asynchronous distributed SGD implementations. Our framework, called elastic consistency, decouples the system-specific aspects of the implementation from the SGD convergence requirements, giving a general way to obtain convergence bounds for a wide variety of distributed SGD methods used in practice. Elastic consistency can be used to re-derive or improve several previous convergence bounds in message-passing and shared-memory settings, but also to analyze new models and distribution schemes. As a direct application, we propose and analyze a new synchronization-avoiding scheduling scheme for distributed SGD, and show that it can be used to efficiently train deep convolutional models for image classification.
- NUQSGD: Provably Communication-Efficient Data-Parallel SGD via Nonuniform QuantizationAli Ramezani-Kebrya, Fartash Faghri, Ilia Markov, Vitalii Aksenov, Dan Alistarh, and Daniel M. RoyJ. Mach. Learn. Res., Jul 2021
As the size and complexity of models and datasets grow, so does the need for communication efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.
- Keep the Dirt: Tainted TreeKEM, Adaptively and Actively Secure Continuous Group Key AgreementKaren Klein, Guillermo Pascual-Perez, Michael Walter, Chethan Kamath, Margarita Capretto, Miguel Cueto, and 4 more authorsIn 2021 IEEE Symposium on Security and Privacy (SP), Jul 2021Work is done during the rotation in Pietrzak group
While messaging systems with strong security guarantees are widely used in practice, designing a protocol that scales efficiently to large groups and enjoys similar security guarantees remains largely open. The two existing proposals to date are ART (Cohn-Gordon et al., CCS18) and TreeKEM (IETF, The Messaging Layer Security Protocol, draft). TreeKEM is the currently considered candidate by the IETF MLS working group, but dynamic group operations (i.e. adding and removing users) can cause efficiency issues. In this paper we formalize and analyze a variant of TreeKEM which we term Tainted TreeKEM (TTKEM for short). The basic idea underlying TTKEM was suggested by Millican (MLS mailing list, February 2018). This version is more efficient than TreeKEM for some natural distributions of group operations, we quantify this through simulations.Our second contribution is two security proofs for TTKEM which establish post compromise and forward secrecy even against adaptive attackers. The security loss (to the underlying PKE) in the Random Oracle Model is a polynomial factor, and a quasipolynomial one in the Standard Model. Our proofs can be adapted to TreeKEM as well. Before our work no security proof for any TreeKEM-like protocol establishing tight security against an adversary who can adaptively choose the sequence of operations was known. We also are the first to prove (or even formalize) active security where the server can arbitrarily deviate from the protocol specification. Proving fully active security – where also the users can arbitrarily deviate – remains open.
2020
- Adaptive Gradient Quantization for Data-Parallel SGDFartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel M Roy, and Ali Ramezani-KebryaIn Advances in Neural Information Processing Systems, Jul 2020
Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on ImageNet in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters.