beginner guide on loading up a LLM for app development
Intro
for this post, we are going to use the multi-modal LVLM model by meta: llama3.2-vision
And we are going to load it up in 3 difference ways!!!
let’s import all the packages we need and load an image to test the model!
import os, sys, base64, httpxfrom PIL import Imagefrom io import BytesIOimport warningswarnings.filterwarnings('ignore') # make blog post pretty ;)image_url ="https://images2.9c9media.com/image_asset/2025_1_11_0c10be59-dad3-4f90-a1bd-087133dea6d2_jpg_1920x1080.jpg?width=320&height=180"image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")pil_im = Image.open(BytesIO(httpx.get(image_url).content))pil_im
Prompt Template
let’s create a prompt template that will be shared across the three different instances of the model
from langchain_core.prompts import ChatPromptTemplatetemplate_message = [ ("user", [ {'type': 'text', 'text': "please describe this image and list out all the numbers and name seen. \ no need to identify the person, just tell me what can be read on their jersey" }, {"type": "image_url","image_url": {"url": "data:image/jpeg;base64,{image_data}"}, } ], ), ]prompt_template = ChatPromptTemplate.from_messages(template_message)image_messages = prompt_template.format_messages(image_data = image_data)
Ollama (local)
for how to setup ollama, see this post. Also make sure you have ran pip install langchain-ollama
then we can pull our VLM using the CLI with ollama pull llama3.2-vision
which should install the 11b params version. For all available versions, see here
from langchain_ollama import ChatOllamachat_ollama = ChatOllama(model ='llama3.2-vision',temperature =0.0)
the first time loading up the model from a cold start will be a bit slower, expect subsequent load to be noticably faster
%%timer_ollama = chat_ollama(image_messages)
CPU times: user 16.3 ms, sys: 12.2 ms, total: 28.4 ms
Wall time: 1min 10s
results
print(r_ollama.content)
The image shows a basketball player standing on the court with his hands on his hips.
* The number 4 is visible on the back of one player's jersey.
* The name "Westbrook" is written across the top of that same jersey.
* The number 15 is visible on the back of another player's jersey.
* The name "Jokic" is written across the top of that same jersey.
The image depicts a basketball game in progress, with two players standing on the court. The player on the left is wearing a jersey with the number "4" and the name "WESTBROOK" in yellow letters. The player on the right is wearing a jersey with the number "15" and the name "JOKIC" in yellow letters.
Here is a list of the numbers and names visible on the jerseys:
* 4 (on the left player's jersey)
* WESTBROOK (on the left player's jersey)
* 15 (on the right player's jersey)
* JOKIC (on the right player's jersey)
Note: The numbers and names are written in yellow letters on the jerseys.
HuggingFace (hosted)
huggingface actually offers a serverless inference API which made literally thousands of models available, and all you need is a free API key!
This makes them solid choice for any Data Scientist doing experiments!!!
make sure you first pip install langchain-huggingface and follow the official docs here
Note that llama-3.2 is a gated model. You will have to give up your personal information first on the model card’s page before you could load the model.
The image is a photograph of two top NBA player (Russell Westbrook & Nikola Jokic) with their backs to the camera.
Key elements of the jerseys include:
- "Westbrook" (last name) and the number "4"
- "Jokic" ( last name) and the number "15"
In the background crowd of people are sitting attending the game.
The image conveys a sense of professional superiority due to fact that it captures the two top NBA players.
Conclusion
LLM are huge (llama3.2-vision-11b used in this example is about 7.9GB on disk) but they can be made to run locally with ollama
we also explored 2 additional ways to load hosted (Open-Sourced) model as well. Both options are free and offer lower latency (just a simple API call).
note that results for the same prompt do vary even when using the same underlying model and with temperature=0.0; The contextual difference is not big for the given image example but could be significant given specific image-type or prompt. So make sure you test out your model’s results qualitatively across different providers when developing your application!