HuggingFace Freebie: Automatic Speech Recognition

HuggingFace
speech recognition
Author

im@johnho.ca

Published

Monday, February 3, 2025

Abstract
zero-cost Automatic Speech Recognition using HuggingFace’s serverless inference API

Intro

HuggingFace’s Serverless Inference API has changed

see the update post from July 2025 on how to call the updated API

with HuggingFace’s Serverless Inference API, a world of AI super power is now made possible at zero-cost. This post focus on the power of Automatic Speech Recognition (ASR).

Previously we have talked about using OpenAI’s Whisper. But that requires a pip install and local inference.

Here we’ll show that SOTA ASR is accessible with just an API call!

First, we’ll define some API calling functions.

import requests, os, base64, json
from IPython.display import Audio
from io import BytesIO

def data_query(filename: str, api_url: str):
    ''' sending data as binary bytes'''
    assert os.environ.get('HUGGINGFACEHUB_API_TOKEN'), f'HUGGINGFACEHUB_API_TOKEN not set in env'
    headers = {"Authorization": f"Bearer {os.environ['HUGGINGFACEHUB_API_TOKEN']}"}    

    with open(filename, "rb") as f:
        data = f.read()
    response = requests.post(api_url, headers=headers, data=data)
    return response.json()

def json_query(payload, api_url: str):
    '''sending more complex payload with parameters'''
    assert os.environ.get('HUGGINGFACEHUB_API_TOKEN'), f'HUGGINGFACEHUB_API_TOKEN not set in env'
    headers = {"Authorization": f"Bearer {os.environ['HUGGINGFACEHUB_API_TOKEN']}"}

    response = requests.post(api_url, headers=headers, json=payload)
    return response.content

def base64_encode_audio(filename: str, get_dataUrl: bool = False):
    fname, fext = os.path.splitext(os.path.basename(filename))
    with open(filename, "rb") as f:
        audio_data = f.read()
    audio_str = base64.b64encode(audio_data).decode('utf-8')
    # example: f'data:audio/octet-stream;base64,{audio_str}'
    return f'data:audio/{fext[1:]};base64,{audio_str}' if get_dataUrl else audio_str

Simple Transcript Generation

From this list of available models I chose OpenAI’s Whisper Large v3 Turbo

And for this example, we are going to recycle the audio from this previous Whisper vs Youtube Transcript API show-down:

Code
audio_b64_data = base64_encode_audio("DeepSeek.m4a", get_dataUrl= False)
sound_bytes = base64.b64decode(audio_b64_data)
Audio(sound_bytes)

To generate just the transcript, we only need to send the audio in binary format to the model’s endpoint:

%%time
asr_model_url = "https://api-inference.huggingface.co/models/openai/whisper-large-v3-turbo"
output_asr = data_query("DeepSeek.m4a", api_url= asr_model_url)
print(output_asr['text'])
 As big tech is getting hammered in today's selloff, CNBC's Magnificent 7 index dropping more than 1%. Well, there's a new emerging threat to mega caps, massive spending in America's dominance in the AI race. Deirdre Bosa digs into that for today's tech check. Hey, Dee. Hey, good morning, Leslie. So here's a name that our audience may want to write down, DeepSeq. This is a new free open source AI model that beats the latest open AI and meta models on key benchmarks. And it was made for a fraction of a fraction of the cost. Now, it was trained by a Chinese research lab that used NVIDIA H800s. That's a lower performance version of the H100 chips that are cheaper, more available and tailored for restricted markets like China. Now, I've been testing it out this morning and on the surface, it looks and acts just like open AI's chat GPT. And in fact, it actually thinks that is chat GPT. When I asked what model are you, it answered, I'm an AI language model created by open AI specifically based on the GPT for architecture, suggesting that it was trained on chat GPT outputs, which leaving aside terms of service violations, it means that entirely new state of the art models can be built on what is already out there. In other words, open AI's moat may be shrinking. If a model like DeepSea can emerge with competitive performance, minimal cost, and reliance on existing outputs, it signals a rapidly shrinking barrier to entry in AI development, challenging the current dominance of industry leaders like OpenAI. Based on technical tests designed to measure its coding performance, DeepSeq outperformed other models, including Meta's Llama 3.1 and OpenAI's GPT-40. And by the way, those are the latest state-of-the-art models. And that led Andrzej Kaparthi, a founding team member at OpenAI to post DeepSeek making it look easy today with an open frontier grade LLM trained on a joke of a budget. Now, to put that budget in perspective, it's pretty mind boggling. It costs just $5.5 million versus hundreds of millions of dollars for Meta's latest llama model and billions of dollars for GBT and Gemini models. This all raises an important question for investors as the AI trade evolves and technological progress stalls. Is training frontier models even a good investment anymore. Microsoft, Google, Amazon, Meta, OpenAI, they have made that a core mission over the last few years as they continue to spend billions building out AI infrastructure to train ever bigger and better models. DeepSeq did something highly competitive in just two months with dumbed down GPUs for less than $6 million. And guys, let's not forget this fact. It comes from China, whom Sam Altman and others say poses the greatest competitive threat to the dominance of U.S.-led AI. Now, this model is going to be tested in the weeks and months ahead, but the implication, guys, of this is just massive and will ripple through the AI community. Wow. Deirdre, I mean, we've been talking this morning about, obviously, the CapEx that we're expecting to be spent in 2025 on exactly what you're talking about. What is the core competency then here of the Chinese? Obviously, they're not using the latest chips. They don't have the computing power. They're not even spending on that. So what are they doing well? And yet, I know, right? And yet, you're right. They don't have all those things. They don't even have access to H100s. And yet they have built a model that competes with the ones from OpenAI and Meta. I mean, they've done this out of necessity using H800s, which, like I said, is a dumbed down version of the H100s. They've been able to do this, which essentially just means that these models are becoming more commoditized than ever. As that progress, the technological progress stalls, it's all about what can you build with the existing frontier models. And this is a perfect example of what a company that doesn't have to have built all the architecture, doesn't need those high-end GPs, is able to do simply by taking existing outputs. It also raises questions between open and closed source models, right? They were able to do so. This is an open source model. That was always the fear here with Meta's Lama, is that it gives the Chinese an advantage. It turns out the Chinese didn't even need Lama. They just needed stuff that was already put out there by ChatGPT and others. Now, wow, Deirdre, I feel like this is going to be a huge story in 2025 and the geopolitical ramifications of all of this as well. Really appreciate it. Deirdre Bosa.
CPU times: user 13.7 ms, sys: 20.2 ms, total: 33.9 ms
Wall time: 4.73 s

Transcript with Timestamp

if we want more granular details for downstream task, perhaps we might like want the timestamp.

In that case, we need to send a JSON payload with multiple parameters (see the official doc for a full list of configurable params):

%%time
payload = {
    'inputs': audio_b64_data,
    'parameters': {'return_timestamps': True}
}
output_asr = json_query(payload, api_url= asr_model_url)
output_asr = json.loads(output_asr.decode('utf-8'))
output_asr['chunks']
CPU times: user 44.1 ms, sys: 34.5 ms, total: 78.6 ms
Wall time: 5.95 s
[{'timestamp': [0.0, 8.38],
  'text': " As big tech is getting hammered in today's selloff, CNBC's Magnificent 7 index dropping"},
 {'timestamp': [8.38, 14.78],
  'text': " more than 1%. Well, there's a new emerging threat to mega caps, massive spending in America's"},
 {'timestamp': [14.78, 19.44],
  'text': " dominance in the AI race. Deirdre Bosa digs into that for today's tech check. Hey, Dee."},
 {'timestamp': [20.18, 24.6],
  'text': " Hey, good morning, Leslie. So here's a name that our audience may want to write down,"},
 {'timestamp': [0.0, 6.68],
  'text': ' DeepSeq. This is a new free open source AI model that beats the latest open AI and meta models on'},
 {'timestamp': [6.68, 11.68],
  'text': ' key benchmarks. And it was made for a fraction of a fraction of the cost. Now, it was trained by a'},
 {'timestamp': [11.68, 17.44],
  'text': " Chinese research lab that used NVIDIA H800s. That's a lower performance version of the H100"},
 {'timestamp': [17.44, 22.88],
  'text': ' chips that are cheaper, more available and tailored for restricted markets like China.'},
 {'timestamp': [23.2, 27.76],
  'text': " Now, I've been testing it out this morning and on the surface, it looks and acts just like open"},
 {'timestamp': [0.0, 6.08],
  'text': " AI's chat GPT. And in fact, it actually thinks that is chat GPT. When I asked what model are you,"},
 {'timestamp': [6.16, 11.54],
  'text': " it answered, I'm an AI language model created by open AI specifically based on the GPT for"},
 {'timestamp': [11.54, 17.8],
  'text': ' architecture, suggesting that it was trained on chat GPT outputs, which leaving aside terms of'},
 {'timestamp': [17.8, 23.32],
  'text': ' service violations, it means that entirely new state of the art models can be built on what is'},
 {'timestamp': [23.32, 29.62],
  'text': " already out there. In other words, open AI's moat may be shrinking. If a model like DeepSea can"},
 {'timestamp': [0.0, 4.5],
  'text': ' emerge with competitive performance, minimal cost, and reliance on existing outputs,'},
 {'timestamp': [4.88, 9.78],
  'text': ' it signals a rapidly shrinking barrier to entry in AI development, challenging the current dominance'},
 {'timestamp': [9.78, 15.06],
  'text': ' of industry leaders like OpenAI. Based on technical tests designed to measure its coding'},
 {'timestamp': [15.06, 22.6],
  'text': " performance, DeepSeq outperformed other models, including Meta's Llama 3.1 and OpenAI's GPT-40."},
 {'timestamp': [22.78, 27.12],
  'text': ' And by the way, those are the latest state-of-the-art models. And that led Andrzej Kaparthi,'},
 {'timestamp': [0.0, 5.64],
  'text': ' a founding team member at OpenAI to post DeepSeek making it look easy today with an open frontier'},
 {'timestamp': [5.64, 11.28],
  'text': " grade LLM trained on a joke of a budget. Now, to put that budget in perspective, it's pretty"},
 {'timestamp': [11.28, 17.7],
  'text': " mind boggling. It costs just $5.5 million versus hundreds of millions of dollars for Meta's latest"},
 {'timestamp': [17.7, 22.74],
  'text': ' llama model and billions of dollars for GBT and Gemini models. This all raises an important'},
 {'timestamp': [22.74, 28.24],
  'text': ' question for investors as the AI trade evolves and technological progress stalls. Is training'},
 {'timestamp': [0.0, 4.76],
  'text': ' frontier models even a good investment anymore. Microsoft, Google, Amazon, Meta, OpenAI,'},
 {'timestamp': [4.96, 9.04],
  'text': ' they have made that a core mission over the last few years as they continue to spend billions'},
 {'timestamp': [9.04, 15.32],
  'text': ' building out AI infrastructure to train ever bigger and better models. DeepSeq did something'},
 {'timestamp': [15.32, 20.88],
  'text': ' highly competitive in just two months with dumbed down GPUs for less than $6 million.'},
 {'timestamp': [21.4, 27.56],
  'text': " And guys, let's not forget this fact. It comes from China, whom Sam Altman and others say poses"},
 {'timestamp': [0.0, 5.54],
  'text': ' the greatest competitive threat to the dominance of U.S.-led AI. Now, this model is going to be'},
 {'timestamp': [5.54, 10.2],
  'text': ' tested in the weeks and months ahead, but the implication, guys, of this is just massive and'},
 {'timestamp': [10.2, 15.34],
  'text': " will ripple through the AI community. Wow. Deirdre, I mean, we've been talking this morning"},
 {'timestamp': [15.34, 21.4],
  'text': " about, obviously, the CapEx that we're expecting to be spent in 2025 on exactly what you're talking"},
 {'timestamp': [21.4, 27.46],
  'text': " about. What is the core competency then here of the Chinese? Obviously, they're not using the"},
 {'timestamp': [0.0, 3.64],
  'text': " latest chips. They don't have the computing power. They're not even spending on that. So what are"},
 {'timestamp': [3.64, 8.84],
  'text': " they doing well? And yet, I know, right? And yet, you're right. They don't have all those things."},
 {'timestamp': [8.9, 15.66],
  'text': " They don't even have access to H100s. And yet they have built a model that competes with the"},
 {'timestamp': [15.66, 21.72],
  'text': " ones from OpenAI and Meta. I mean, they've done this out of necessity using H800s, which, like I"},
 {'timestamp': [21.72, 26.02],
  'text': " said, is a dumbed down version of the H100s. They've been able to do this, which essentially"},
 {'timestamp': [0.0, 5.16],
  'text': ' just means that these models are becoming more commoditized than ever. As that progress,'},
 {'timestamp': [5.3, 10.02],
  'text': " the technological progress stalls, it's all about what can you build with the existing frontier"},
 {'timestamp': [10.02, 15.48],
  'text': " models. And this is a perfect example of what a company that doesn't have to have built all"},
 {'timestamp': [15.48, 21.06],
  'text': " the architecture, doesn't need those high-end GPs, is able to do simply by taking existing"},
 {'timestamp': [21.06, 26.18],
  'text': ' outputs. It also raises questions between open and closed source models, right? They were able'},
 {'timestamp': [0.0, 4.04],
  'text': " to do so. This is an open source model. That was always the fear here with Meta's Lama,"},
 {'timestamp': [4.32, 7.94],
  'text': " is that it gives the Chinese an advantage. It turns out the Chinese didn't even need Lama."},
 {'timestamp': [8.24, 11.68],
  'text': ' They just needed stuff that was already put out there by ChatGPT and others.'},
 {'timestamp': [12.3, 16.68],
  'text': ' Now, wow, Deirdre, I feel like this is going to be a huge story in 2025 and the geopolitical'},
 {'timestamp': [16.68, 20.22],
  'text': ' ramifications of all of this as well. Really appreciate it. Deirdre Bosa.'}]

Conclusion

HuggingFace’s Serverless Inference API is unleasing AI super power with just an API call away. With ASR at your fingertip and the advancement of LLMs, the apps that we can create are only limited by our imagination! Only question is what you will build with this?