Loading Transformers Models
Introduction
As mentioned in a previous post the transformers library is a powerful tool for working with a host of AI models either locally or by “deploying” your own Space.
For the purpose of making the creation of HuggingFace Spaces easier, I have created a Github Template with action setup to CI/CD to the Space’s repository and to create the requirements.txt
file automatically using uv1
After a few attempts to debug a phathom OOM bug2 while building this Video Caption Demo using VLMs, here’s a few lessons learnt on how to load models using the transformers library in the most memory-efficient manner.
Loading Transformers Models
Here’s a code snippet for an example HuggingFace Space Gradio app to illustrate the best practice for loading transformers models efficiently. The details are highlighted in the code-annotation on the relevant lines.
app.py
import spaces, torch, time
import gradio as gr
from transformers import (
AutoModelForImageTextToText,
Gemma3nForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig,
)
# Flash Attention for ZeroGPU
import subprocess
subprocess.run("pip install flash-attn --no-build-isolation",
={"FLASH_ATTENTION_SKIP_CUDA_BUILD": "TRUE"},
env=True,
shell
)
# Set target DEVICE and DTYPE
= (
DTYPE
torch.bfloat16if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
else torch.float16
)= "auto"
DEVICE print(f"Device: {DEVICE}, dtype: {DTYPE}")
def load_model(
str = "chancharikm/qwen2.5-vl-7b-cam-motion-preview",
model_name: bool = False,
use_flash_attention: bool = True,
apply_quantization:
):= BitsAndBytesConfig(
bnb_config =True, # Load model weights in 4-bit
load_in_4bit="nf4", # Use NF4 quantization (or "fp4")
bnb_4bit_quant_type=DTYPE, # Perform computations in bfloat16/float16
bnb_4bit_compute_dtype=True, # Optional: further quantization for slightly more memory saving
bnb_4bit_use_double_quant
)
# Determine model family from model name
= model_name.split("/")[-1].split("-")[0]
model_family
# Common model loading arguments
= {
common_args "torch_dtype": DTYPE,
"device_map": DEVICE,
"low_cpu_mem_usage": True,
"quantization_config": bnb_config if apply_quantization else None,
}if use_flash_attention:
"attn_implementation"] = "flash_attention_2"
common_args[
# Load model based on family
match model_family:
# case "qwen2.5" | "Qwen2.5":
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# model_name, **common_args
# )
case "gemma":
= Gemma3nForConditionalGeneration.from_pretrained(
model **common_args
model_name,
)case "InternVL3":
= AutoModelForImageTextToText.from_pretrained(
model **common_args
model_name,
)case _:
raise ValueError(f"Unsupported model family: {model_family}")
# Set model to evaluation mode for inference (disables dropout, etc.)
return model.eval()
def load_processor(model_name="Qwen/Qwen2.5-VL-7B-Instruct"):
return AutoProcessor.from_pretrained(
model_name,=DEVICE,
device_map=True,
use_fast=DTYPE,
torch_dtype
)
print("Loading Models and Processors...")
= {
MODEL_ZOO "qwen2.5-vl-7b-instruct": load_model(
="Qwen/Qwen2.5-VL-7B-Instruct",
model_name=False,
use_flash_attention=False,
apply_quantization
),"InternVL3-1B-hf": load_model(
="OpenGVLab/InternVL3-1B-hf",
model_name=False,
use_flash_attention=False,
apply_quantization
),"InternVL3-2B-hf": load_model(
="OpenGVLab/InternVL3-2B-hf",
model_name=False,
use_flash_attention=False,
apply_quantization
),"InternVL3-8B-hf": load_model(
="OpenGVLab/InternVL3-8B-hf",
model_name=False,
use_flash_attention=True,
apply_quantization
),
}
= {
PROCESSORS "qwen2.5-vl-7b-instruct": load_processor("Qwen/Qwen2.5-VL-7B-Instruct"),
"InternVL3-1B-hf": load_processor("OpenGVLab/InternVL3-1B-hf"),
"InternVL3-2B-hf": load_processor("OpenGVLab/InternVL3-2B-hf"),
"InternVL3-8B-hf": load_processor("OpenGVLab/InternVL3-8B-hf"),
}print("Models and Processors Loaded!")
# Our Inference Function
@spaces.GPU(duration=120)
def video_inference(
str,
video_path: str,
prompt: str,
model_name: int = 8,
fps: int = 512,
max_tokens: float = 0.1,
temperature:
):= time.time()
s_time = MODEL_ZOO[model_name]
model = PROCESSORS[model_name]
processor = [
messages
{"role": "user",
"content": [
{"type": "video",
"video": video_path,
},"type": "text", "text": prompt},
{
],
}
]with torch.no_grad():
= model_name.split("-")[0]
model_family match model_family:
case "InternVL3":
= processor.apply_chat_template(
inputs
messages,=True,
add_generation_prompt=True,
tokenize=True,
return_dict="pt",
return_tensors=fps,
fps# num_frames = 8
"cuda", dtype=DTYPE)
).to(
= model.generate(
output **inputs,
=max_tokens,
max_new_tokens=float(temperature),
temperature=temperature > 0.0,
do_sample
)= processor.decode(
output_text 0, inputs["input_ids"].shape[1] :], skip_special_tokens=True
output[
)case _:
raise ValueError(f"{model_name} is not currently supported")
return {
"output_text": output_text,
"fps": fps,
"inference_time": time.time() - s_time,
}
# the Gradio App
= gr.Interface(
app =inference,
fn=[
inputs="Input Video"),
gr.Video(label
gr.Textbox(="Prompt",
label=3,
lines="Some models like [cam motion](https://huggingface.co/chancharikm/qwen2.5-vl-7b-cam-motion-preview) are trained specific prompts",
info="Describe the camera motion in this video.",
value
),="Model", choices=list(MODEL_ZOO.keys())),
gr.Dropdown(label
gr.Number(="FPS",
label="inference sampling rate (Qwen2.5VL is trained on videos with 8 fps); a value of 0 means the FPS of the input video will be used",
info=8,
value=0,
minimum=1,
step
),
gr.Slider(="Max Tokens",
label="maximum number of tokens to generate",
info=128,
value=32,
minimum=512,
maximum=32,
step
),
gr.Slider(="Temperature",
label=0.0,
value=0.0,
minimum=1.0,
maximum=0.1,
step
),
],=gr.JSON(label="Output JSON"),
outputs="Video Chat with VLM",
title='comparing various "small" VLMs on the task of video captioning',
description="video_inference",
api_name
)
app.launch(=True
mcp_server )
- 1
-
you’ll need
spaces
for access to ZeroGPU,torch
for setting device and data type for your models, andtime
is good to understand how long each inference takes. - 2
-
you’ll need
gradio
for creating Gradio interface when deploying to HuggingFace Space. - 3
-
bitsandbytes
for quantization of models “on the fly” but requires a CUDA-enabled GPU3 - 4
-
installing
flash-attn
for ZeroGPU requires special handling - 5
-
Using
torch.float16
(half-precision) ortorch.bfloat16
reduces memory usage for model weights and activations by half compared totorch.float32
.bfloat16
is generally preferred for training stability due to its wider dynamic range, butfloat16
is often sufficient for inference and widely supported. - 6
-
accelerate
(required) will try to fit the model layers across available GPUs as much as possible and then offload the rest to CPU. - 7
-
This flag (also requires
accelerate
) tells transformers to load the model directly to the target device or stream it in a more memory-efficient way, avoiding a large CPU RAM spike. - 8
- Flash Attention is an optimized attention algorithm designed to address the memory and computational bottlenecks of the standard attention mechanism in Transformers. An CUDA enabled GPU is required. Read this lecture note for a deep dive.
- 9
-
Always set your model to evaluation mode (
model.eval()
) for inference. This disables layers like Dropout and BatchNorm, which behave differently during inference and can sometimes free up minor memory. - 10
- Models should be loaded well before inference to avoid inflating inference time.
- 11
- This is the max duration (in seconds) that the inference function can take before timing out. The user must also have this amount of available ZeroGPU time limit left before calling the function.
- 12
-
Wrap your inference calls within with
torch.no_grad()
context manager. This prevents PyTorch from building the computation graph for gradients, saving a significant amount of memory for intermediate activations. - 13
- this turns on the creation of a MCP server (in addition to the API), but will require a detailed docstring for the inference function which we don’t have. See the official doc for more details on Gradio’s MCP Server.
- 14
- with some models like Qwen2.5VL the fps will determine the frame extraction rate.
- 15
-
temperature
is a parameter that controls the randomness of the generated output. It influences how “creative” or “deterministic” the model’s responses will be. - 16
-
This is crucial. temperature only has an effect if sampling is enabled. If
do_sample=False
(which is the default if temperature is not explicitly set or is 0), the model will perform greedy decoding, always picking the most probable next token, regardless of the temperature value. - 17
-
when you are trying to load a gated model, like Gemma-3n-E4B-it, the transformers library automatically checks for a
HF_TOKEN
environment variable. If set, it will use that token for authentication without needinglogin()
or theuse_auth_token
variable in.from_pretrained()
- 18
-
For InternVL3 the
num_frames
parameter is also available but is mutually exclusive withfps
.
Resources
in addition to the few tips and tricks shared above, here are a few more resources that are worth checking out:
Footnotes
this is optional and in some cases it might be better to create the
requirements.txt
file manually, especially when working with ZeroGPU since it would be non-trivial to reproduce the space’s hardware setup. However, note that some spaces like this one for gemma-3n-E4B-it built by the HuggingFace Team does us uv.↩︎as of the writing of this post, ZeroGPU uses a H200 GPU which has about 141GB of VRAM so it’s unlikely to encounter any OOM issues. My “bug” was actually due to version pinning of the
torch
,torchvision
, andtransformers
libraries which I realized once looking at another space’srequirements.txt
file. In the end I was able to load up 7 VLMs, including one with 8-billion parameters. But the “real bug” is in not being able to use flash attention with the Qwen2.5VL model.↩︎without access to CUDA? you could actually use this space by the HuggingFace team to create your own quantized model using their hardware and the
bitsandbytes
library.↩︎