Previously on Tech
Posts
Vibing with AI 📺 Episode 7

Vibing with AI 📺 Episode 7

This week we vibe with AI, create a text-based golf game, and research Pujols Homers.

Jason Yingling
March 07, 2025

Vibe Checking AI Models

Image via Ideogram.ai

New large language models are being released everyday. Each model claims to break new ground in reasoning, image generation, or even playing Pokémon Red. Selecting the “right” model can be anxiety inducing much like watching my oldest daughter carefully choose and weigh the benefits of the perfect Build-a-Bear in a jam packed mall.

Leading AI companies typically release their own benchmarks alongside their latest models (like Claude 3.7 Sonnet or ChatGPT 4.5). They often prefer to control their own benchmarks or haven't prioritized appearing on independent leaderboards such as Hugging Face's Open LLM Leaderboard. Which can make it difficult to compare the effectiveness of each model.

The best way to get a handle for how these models perform for you is to test them out. Pick a few and give them the same task. See how they feel. Get a sense of the vibes/ Most have free tiers that are perfectly adequate for experimenting.

You’ll start to notice differences between the “personality” of the models. I find ChatGPT feels a bit more, well, chatty, while Claude is more succinct and to the point, which I appreciate.

As models advance the key differentiation become platform-specific features. If you’re looking for a research assistant or image generation alongside your chat functionality you may go with ChatGPT for Deep Research and DALLE integrations. If you’re looking for coding assistance Claude 3.7 Sonnet may be the current leader.

Pro tip: Tools like GitHub Copilot and Cursor let you easily switch between multiple models.

When you’re really ready to get into the nitty gritty, Hugging Face can be a great resource for trying out specialized models for specific tasks like visual processing, math, or image generation.

Ultimately, your selection comes down to personal preference and specific needs. The next time you face a task, try rubber duck debugging it with several models and evaluate which response best serves your purpose. We could all take a cue from my youngest and start grabbing every bear within reach. We can check the vibes later.

🤿 Dive Deeper

Hugging Face has a a solid introduction to LLM Leaderboards to help you get the basics of how the leaderboards work.
Keep up to date with the latest rankings in the Chatbot Arena to see how human’s rank various chat agents across tasks.
If coding is more your thing you can check out the WebDev Arena Leaderboard.
Hugging Face’s Open LLM Leaderboard is the best source for seeing how models rank, but you won’t see ChatGPT 4.5 or Claude 3.7 Sonnet models on here.

Battle Of The AI Bots

Go to WebDev Arena and make sure you’re on the Battle tab in the left-hand navigation.

Add a prompt for a web application you’d like to see or copy and paste the following:

Create a text based golf game that generates the length and par values of a course. The user rolls a virtual 20-sided die that determines the accuracy of their shot. Low rolls result in bad shots. High rolls result in fairway hits. Show the distance remaining to the hole after each shot and where the ball lies. The user should be able to select between a standard set of clubs.

Hit enter and see what the the two competing models spin up. Depending on the complexity of your ask one or both models may fail, but you can always try agin.

I could have done without the golf game being so true to life…

The Pujols Test

I’m from (the greater) St. Louis (area), I grew up in the prime of Albert Pujols. I’ve seen him hit more homers in person than any other player. It’s March and baseball is just around the corner the long-ball naturally came to mind. Which is why I decided to use Pujols’ homers to test ChatGPT, Claude, Gemini, and Perplexity.

I started with a straightforward question: "How many home runs did Albert Pujols hit?" All four models correctly identified his career total of 703. However, Google Gemini incorrectly ranked him 5th all-time when he's actually 4th. When asked to name the players ahead of him, Gemini erroneously listed Alex Rodriguez with 696 home runs.

That question was easy enough, but let’s make it harder. The first game I went to at Busch Stadium III was on Easter Sunday in 2006. Albert hit three dingers that day including a walk-off in a victory over the Reds. All the while my brother and I let Cincinnati right fielder Austin Kearns know that Albert was, in fact, on it each time he came to the plate.

I asked each model, “How many homers did Albert Pujols hit on Easter?” Only Claude didn’t track down the April 16, 2006 game, admitting to not having access to that specific data. ChatGPT, Gemini, and Perplexity all mentioned that game as their answer, but only Perplexity copped to that possibly not being all his Easter homers.

For my final test, I asked each model to provide the dates for Easter from 2001 to 2022, then requested a tally of Pujols' home runs on those specific days. ChatGPT confidently but incorrectly claimed Pujols hit two home runs on Easter Sunday, April 4, 2010 (though he did hit two the following day). The other models maintained that April 16, 2006, was his only Easter Sunday with home runs.

All models missed Pujols' single home run games on Easter Sunday on April 11, 2004, April 8, 2007, and April 17, 2022, according to the Baseball Almanac. Claude deserves credit for not attempting to provide potentially incorrect information, and Perplexity did list the Baseball Almanac page as a resource.

That puts Albert at 6 total homers on Easter. All as a St. Louis Cardinal. This was admittedly an unscientific test of the models, but it is a good reminder to check the output from these models and not use it as gospel.

🏋️‍♂️ Unique Benchmarks

Its a me! - People are using Super Mario to benchmark AI now
Gotta catch ‘em all! - Learning Pokémon With Reinforcement Learning
More AI Apps - Meta plans to release standalone Meta AI app in effort to compete with OpenAI's ChatGPT

Around the Horn

Claude Code - Introducing Claude Code goes no waitlist with their command line based coding assistant.
Amazon is reportedly developing its own AI 'reasoning' model to throw another choice into the mix.
Pour one out for Skype Microsoft hangs up on Skype: Service to shut down May 5, 2025

Term of the Day

Arenas - Spaces where models are evaluated head-to-head on a given task by human-based voting systems.

Hugging Face

Next Time On

Have an idea for something you’d like to see more of or just want to get in touch? Send a reply to [email protected].

Until next time!