Quantize and deploy Llama 2 70B on cost-effective NVIDIA A10 Tensor Core GPUs in OCI Data Science
Blog: Oracle BPM
NVidia A10 GPUs have been around for a couple of years. They are much cheaper than the newer A100 and H100, however they are still very capable of running AI workloads, and their price point makes them cost-effective. With the quantization technique of reducing the weights size to 4 bits, even the powerful Llama 2 70B model can be deployed on 2xA10 GPUs. In this blog post we will show how to quantize the foundation model and then how to deploy it.