Which Quantization Methodology is Proper for You?(GPTQ vs. GGUF vs. AWQ) | by Maarten Grootendorst | Nov, 2023

Last updated: 2023/11/14 at 3:50 AM

media

2 Min Read

Exploring Pre-Quantized Massive Language Fashions

All through the final yr, we now have seen the Wild West of Massive Language Fashions (LLMs). The tempo at which new know-how and fashions have been launched was astounding! In consequence, we now have many alternative requirements and methods of working with LLMs.

On this article, we’ll discover one such subject, specifically loading your native LLM by means of a number of (quantization) requirements. With sharding, quantization, and totally different saving and compression methods, it’s not simple to know which technique is appropriate for you.

All through the examples, we’ll use Zephyr 7B, a fine-tuned variant of Mistral 7B that was skilled with Direct Choice Optimization (DPO).

🔥 TIP: After every instance of loading an LLM, it’s suggested to restart your pocket book to forestall OutOfMemory errors. Loading a number of LLMs requires vital RAM/VRAM. You’ll be able to reset reminiscence by deleting the fashions and resetting your cache like so:

# Delete any fashions beforehand created
del mannequin, tokenizer, pipe# Empty VRAM cache
import torch
torch.cuda.empty_cache()

You can too comply with together with the Google Colab Pocket book to verify all the pieces works as supposed.

Probably the most simple, and vanilla, means of loading your LLM is thru 🤗 Transformers. HuggingFace has created a big suite of packages that permit us to do wonderful issues with LLMs!

We’ll begin by putting in HuggingFace, amongst others, from its principal department to assist newer fashions:

# Newest HF transformers model for Mistral-like fashions
pip set up git+https://github.com/huggingface/transformers.git
pip set up speed up bitsandbytes xformers

After set up, we are able to use the next pipeline to simply load our LLM:

from torch import bfloat16
from transformers import pipeline# Load in your LLM with none compression methods
pipe = pipeline(
"text-generation", 
mannequin="HuggingFaceH4/zephyr-7b-beta", 
torch_dtype=bfloat16, 
device_map="auto"
)

Share this Article

Why the politics of the long run is expertise and expertise is the way forward for politics

GitLab Introduces Duo Chat: A Conversational AI Instrument for Productiveness

Which Quantization Methodology is Proper for You?(GPTQ vs. GGUF vs. AWQ) | by Maarten Grootendorst | Nov, 2023

Exploring Pre-Quantized Massive Language Fashions

Leave a Reply Cancel reply

Latest News

AI was chargeable for the faux quotes within the Megalopolis trailer

Bettering RLHF (Reinforcement Studying from Human Suggestions) with Critique-Generated Reward Fashions

Are You Making These Errors in Classification Modeling?

Steve Jobs’ Apple-1 set to create a ‘excellent storm’ at public sale

AI Century Tech is at the forefront of AI innovation, driving the future with cutting-edge technology and groundbreaking AI solutions.

Quick Link

Top Categories

Sign Up for Our Newsletter

Exploring Pre-Quantized Massive Language Fashions

You Might Also Like

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Latest News

Sign Up for Our Newsletter