import os, sys, base64, httpx
from PIL import Image
from io import BytesIO
import warnings
'ignore') # make blog post pretty ;)
warnings.filterwarnings(
= "https://images2.9c9media.com/image_asset/2025_1_11_0c10be59-dad3-4f90-a1bd-087133dea6d2_jpg_1920x1080.jpg?width=320&height=180"
image_url = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
image_data = Image.open(BytesIO(httpx.get(image_url).content))
pil_im pil_im
LLM app with langchain
Intro
for this post, we are going to use the multi-modal LVLM model by meta: llama3.2-vision
And we are going to load it up in 3 difference ways!!!
let’s import all the packages we need and load an image to test the model!
Prompt Template
let’s create a prompt template that will be shared across the three different instances of the model
from langchain_core.prompts import ChatPromptTemplate
= [
template_message "user",
(
['type': 'text',
{'text': "please describe this image and list out all the numbers and name seen. \
no need to identify the person, just tell me what can be read on their jersey"
},
{"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
}
],
),
]= ChatPromptTemplate.from_messages(template_message)
prompt_template = prompt_template.format_messages(image_data = image_data) image_messages
Ollama (local)
for how to setup ollama, see this post. Also make sure you have ran pip install langchain-ollama
then we can pull our VLM using the CLI with ollama pull llama3.2-vision
which should install the 11b
params version. For all available versions, see here
from langchain_ollama import ChatOllama
= ChatOllama(model = 'llama3.2-vision',temperature = 0.0) chat_ollama
the first time loading up the model from a cold start will be a bit slower, expect subsequent load to be noticably faster
%%time
= chat_ollama(image_messages) r_ollama
CPU times: user 16.3 ms, sys: 12.2 ms, total: 28.4 ms
Wall time: 1min 10s
results
print(r_ollama.content)
The image shows a basketball player standing on the court with his hands on his hips.
* The number 4 is visible on the back of one player's jersey.
* The name "Westbrook" is written across the top of that same jersey.
* The number 15 is visible on the back of another player's jersey.
* The name "Jokic" is written across the top of that same jersey.
OpenRouter (hosted)
OpenRouter is a hosted models aggregated and offers a list of free models!
they are nicely integrated into LangChain using the OpenAI framework, so make sure you also pip install langchain-openai
from langchain.chat_models import ChatOpenAI
= ChatOpenAI(
chat_llama = os.environ.get("OPENROUTER_KEY"),
openai_api_key= "https://openrouter.ai/api/v1",
openai_api_base="meta-llama/llama-3.2-11b-vision-instruct:free",
model_name= 0.0
temperature )
%%time
= chat_llama(image_messages) r_openrouter
CPU times: user 44.4 ms, sys: 16.6 ms, total: 60.9 ms
Wall time: 1.73 s
results
noticibly different from ollama version’s output
print(r_openrouter.content)
The image depicts a basketball game in progress, with two players standing on the court. The player on the left is wearing a jersey with the number "4" and the name "WESTBROOK" in yellow letters. The player on the right is wearing a jersey with the number "15" and the name "JOKIC" in yellow letters.
Here is a list of the numbers and names visible on the jerseys:
* 4 (on the left player's jersey)
* WESTBROOK (on the left player's jersey)
* 15 (on the right player's jersey)
* JOKIC (on the right player's jersey)
Note: The numbers and names are written in yellow letters on the jerseys.
HuggingFace (hosted)
huggingface actually offers a serverless inference API which made literally thousands of models available, and all you need is a free API key!
This makes them solid choice for any Data Scientist doing experiments!!!
make sure you first pip install langchain-huggingface
and follow the official docs here
Note that llama-3.2 is a gated model. You will have to give up your personal information first on the model card’s page before you could load the model.
%%time
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
= HuggingFaceEndpoint(
llm ="meta-llama/Llama-3.2-11B-Vision-Instruct",
repo_id="text-generation",
task=False,
do_sample=1.03,
repetition_penalty= 0.0,
temperature = os.environ.get("HUGGINGFACEHUB_API_TOKEN")
huggingfacehub_api_token
)
= ChatHuggingFace(llm=llm) chat_hf_llama
CPU times: user 246 ms, sys: 35.5 ms, total: 281 ms
Wall time: 391 ms
%%time
= chat_hf_llama(image_messages) r_hf_llama
CPU times: user 7.15 ms, sys: 2.82 ms, total: 9.97 ms
Wall time: 4.94 s
results
again different from ollama’s and OpenRouter’s
print(r_hf_llama.content)
The image is a photograph of two top NBA player (Russell Westbrook & Nikola Jokic) with their backs to the camera.
Key elements of the jerseys include:
- "Westbrook" (last name) and the number "4"
- "Jokic" ( last name) and the number "15"
In the background crowd of people are sitting attending the game.
The image conveys a sense of professional superiority due to fact that it captures the two top NBA players.
Conclusion
- LLM are huge (llama3.2-vision-11b used in this example is about 7.9GB on disk) but they can be made to run locally with ollama
- we also explored 2 additional ways to load hosted (Open-Sourced) model as well. Both options are free and offer lower latency (just a simple API call).
- note that results for the same prompt do vary even when using the same underlying model and with
temperature=0.0
; The contextual difference is not big for the given image example but could be significant given specific image-type or prompt. So make sure you test out your model’s results qualitatively across different providers when developing your application!