I tested whether Gemini, ChatGPT, and Claude can analyze videos – this one wins


I tested whether Gemini, ChatGPT, and Claude can analyze videos - this one wins

David Gewirtz / Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Gemini can watch YouTube, MP4, and MOV files.
  • Claude still can’t process video directly.
  • ChatGPT needs Codex help for deeper video work.

AIs do a fine job understanding text from prompts and documents. Most do quite well interpreting images, but what about video? Can your favorite AI actually understand what’s in a video? If it does understand, what can you do with it?

Also: I tested ChatGPT Plus vs. Gemini Pro to see which is better – and if it’s worth switching

In this article, we test ChatGPT, Claude, and Gemini to see how well they grok the contents of videos, both from YouTube and local files. The results are surprising.

The tests

I fed each AI a set of three videos. One is a YouTube video I published last year about the scientific process of annealing (yes, I am as exciting on video as I am on ZDNET). I tested the AIs to see if they can understand what’s in the video. Then, I tried to see if they could create a better thumbnail than I used on my YouTube channel.

The second video is a motion test for the DJI Neo 2 drone. It’s just a video of me standing in front of the drone, using gestures to control how the drone flies. No audio. I wanted to see if the AIs understand what’s happening there. That’s in MP4 format.

Also: I tested ChatGPT vs. Claude to see which is better – and if it’s worth switching

Finally, I have the original MOV file that I uploaded to YouTube for a walk-and-talk about my YouTube posting strategy. I posted it last year. For my test, I used the local video file. It’s the same one that I actually uploaded to YouTube. I’m using the local version for my AI test, though, because I wanted to see how well the AIs could ascertain what I’m talking about without any metadata, transcripts, or hints provided by YouTube. It’s just the video itself. If you want to see the after-uploaded version, here’s a link.

I tested the latest and best models. I tested the $20-per-month ChatGPT Plus plan, the $20-per-month Gemini Pro plan, and the $100-per-month Claude Max plan, which I use for Claude Code.

My prompt was simple: “Can you watch this video?” I found that “watch this video” worked better than “understand” or “summarize,” because both of those sent the AIs looking for metadata. By contrast, when I asked the AIs to “watch” the videos, they didn’t all go looking for metadata. All three AIs seemed to know that “watch” meant I wanted them to actually watch the videos themselves and demonstrate an understanding of what was in them.

And with that, let’s dig in.

Claude: nope

Sometimes prolonging the pain is just cruel. Rip that Band-Aid off and get it over with. So, Claude is a big no. Neither the app nor the web interface can, well, let’s have it tell us, shall we?

Also: How I used Claude AI to plan an entire hiking trip to the Adirondacks in 30 minutes – for free

Claude says, “I can’t watch video content directly. I can’t watch videos. I don’t have the ability to process video or audio content from YouTube links. I don’t have the ability to process the visual or audio frames of an MP4 file. Same answer here — I can’t directly watch or listen to a .mov file. I don’t process video or audio streams.”

And so it goes.

Gemini: yep

Gemini does have a standalone app, but I didn’t need to run it. Gemini’s web interface did a great job understanding all the video formats.

It didn’t matter whether I fed Gemini a YouTube URL, a 625MB MP4 file, or a whopping 1.65GB MOV file. Gemini, right in a browser tab, could handle it.

Also: This powerful Gemini setting made my AI results way more personal and accurate

The most interesting one is my drone test. That’s because the video contains no audio (not even background noise) and no context other than me standing there and moving my arms.

yard

Drone shot by David Gewirtz/ZDNET

And yet, Gemini was able to ascertain exactly what I was doing:

In the video, you’re testing out some hand gestures — raising your palm to the camera as if signaling it to stop or move. The camera follows your lead, changing its angle and distance as you guide it through the yard and eventually back toward the house.

Looks like a successful test of those drone gestures! Is there something specific about the footage or the camera’s response you wanted to go over?

I mean, like, wow. Let’s be clear. The drone itself was not visible in the video. It was acting as the camera. I’m betting there are a lot of humans who wouldn’t understand what was happening there (I’m looking at you, my neighbors!), let alone an AI.

It did successfully understand my annealing video. It was able to identify sections, report on specific points I made verbally, and otherwise demonstrate its understanding.

It also understood the uploaded walk-and-talk video, not only identifying the location, but the various aspects of my commentary throughout the video.

Also: I tested ChatGPT Images 2.0 vs. Gemini Nano Banana to see which is better – this model wins

The one place Gemini fell down was in the transition from Gemini’s understanding mode to Nano Banana’s image mode. Despite Nano Banana’s ability to make awesome images, it doesn’t understand life and the world the way the new ChatGPT Images 2.0 does, and it shows.

I fed Gemini the original thumbnail for the video and told it, “Choose a single frame for the maximum impact as a YouTube thumbnail, then, based on context of the video and my existing YouTube thumbnail style, use Nano Banana to create a high-click-value thumbnail.”

The image on the left is my original thumbnail. The next two were Gemini’s attempts. They’re certainly vivid and might attract more clicks, but Gemini decided to make up a dude with a beard and place him in the image rather than use my suave and sophisticated visage. And it spelled “FIRE” as “FCIRE.” And so it goes.

gemini

Screenshot by David Gewirtz/ZDNET and Gemini

If you want to quickly get an understanding of a video or ask questions about it, Gemini is definitely the best way to go. If you want to create thoughtful imagery from the video, Gemini is not yet up to the task.

ChatGPT: sorta, with a bit more work

Then there’s ChatGPT, which is more like a bad news, good news story.

First, the bad news. ChatGPT itself failed my tests. It couldn’t read the YouTube link. And while it theoretically can understand videos, they need to be under 500MB in size. Mine were not.

Then, the good news. If you combine the OpenAI Codex app with ChatGPT, it can do some very cool stuff. Codex is the agentic workhorse of OpenAI’s offerings. I kind of liken Codex to that handy neighbor we all know who can build almost anything out of almost anything.

Also: I stopped using ChatGPT for everything: These AI models beat it at research, coding, and more

Codex was able to read both local files and understand their meaning. For the drone test, Codex reported:

It looks like a backyard drone test shot.

What’s happening:

  1. A person stands in a residential backyard and faces the camera/drone.
  2. They gesture a few times (including a hand raise/wave-like motion).
  3. The camera viewpoint moves around them over time, changing angle and distance while keeping them mostly centered.
  4. The sequence ends still in the same yard area, with no major scene change or additional people/activities.

Yep. That’s what it’s about.

Codex was also able to watch and understand the walk-and-talk MOV file I uploaded. Initially, it couldn’t process the content, so Codex asked permission to install some Python code and libraries to allow it to do audio transcription. Once it built all that out, it was able to not only view the video, but also understand its context.

But then Codex couldn’t watch the YouTube stream. OK, fine. This is Codex. So, I asked, “Can you download the full video and then work on it locally?”

That worked. It automagically wrote a Python script, installed some libraries, helped invent video-downloading technology on the fly, and then watched my YouTube video.

But then I wanted it to create a thumbnail. I first asked if it had access to ChatGPT Images 2.0 (remember, they’re both OpenAI tools). It responded, “I have access to image generation tools in this session, but I don’t have a tool explicitly labeled Images 2.0 exposed to me.”

Also: I tested ChatGPT and Perplexity AI as my CarPlay voice assistants – both made Siri look bad

Let’s just ignore the unfortunate “exposed to me” phrasing. I had to explain to Codex that Images 2.0 was a thing, and point it to OpenAI’s site for it to understand. At that point, the agentic tool was aware of the images tool, but still couldn’t do much with it.

So, that’s when I acted as the conduit between Codex and ChatGPT. I told Codex, “Choose a single frame for the maximum impact as a YouTube thumbnail, export that thumbnail somewhere so ChatGPT can get to it, or so I can upload it to ChatGPT, and then, based on context of the video and my existing YouTube thumbnail style, write a prompt for ChatGPT to create a high-click-value thumbnail.”

Then, in ChatGPT, I uploaded the original thumbnail image I showed you earlier, and the frame that Codex chose. I then pasted in the prompt Codex created. As you can see, Codex/ChatGPT got more right than Gemini did.

chatgpt-thumb.png

Screenshot via ChatGPT and Codex by David Gewirtz/ZDNET

It picked up on the white, yellow, and black color scheme for my lettering. It didn’t include my logo, and it didn’t include the yellow stripe I use for my titles, but I actually quite like the juxtapositioning of my picture over the torch flame. ChatGPT and Codex actually used my image, unlike Gemini. But I do take issue with the aluminum bar. I used flat material. For some reason, the OpenAI tools decided to make it into square tubing.

Also: I tried ChatGPT Images 2.0: A fun, huge leap – and surprisingly useful for real work

Here’s where ChatGPT’s better image knowledge comes into play. I corrected it on the tubing vs. bar situation and asked it to regenerate. I prompted, “That aluminum is flat bar material about 1/8-inch thick, not square tubing. Keep everything else, but please fix the aluminum.”

chatgpt-2.png

Screenshot via ChatGPT and Codex by David Gewirtz/ZDNET

We were close. I didn’t like how it placed the Sharpie marks (which are used to tell when the metal is hot enough to bend), and the actual bend was far too sharp. One more prompt: “Good, but the bend is too sharp. It’s not a perfect right angle. There’s a curve because the aluminum needs to flex as it bends. Please revise. Also, the sharpie marks are perpendicular to the edge of the bar, not on an angle. They indicate where to bend.”

chatgpt-3.png

Screenshot via ChatGPT and Codex by David Gewirtz/ZDNET

That’s good enough. I think it’s possible to feed Codex and ChatGPT a video with no additional context and get out a YouTube thumbnail. You could probably use it to analyze other types of videos and produce images from those as well.

It’s not super-convenient, but it does work pretty well.

AI can indeed watch video

There are few things to note. First, the AI was able to fully interpret the videos in a much shorter time than their actual play time. Both the science video and the walk-and-talk are about 15 minutes long, but both Gemini and ChatGPT were able to “watch” and parse them for understanding in what I would say was about two or three minutes each.

Second, both show fairly powerful interpretation skills. I found their ability to understand that the silent video I gave them was a drone test to be rather impressive. The drone mostly stayed at human height, yet they were both able to ascertain context from the frames in the video.

There are certainly some practical uses. I gave Gemini a YouTube video of a CBS report on the OpenAI trial and asked it to provide me details about what was discussed. I can definitely see using it to scan through security camera video to find a specific type of action quickly.

I can also definitely see giving the AI a longer video and having it pull out the major points. What was particularly useful is that Gemini time-stamped each of the key thoughts, so I could just click the time stamps and drop into the video at that point.

Then, of course, there’s the actual use of these tools to create YouTube thumbnails. I still prefer to do it by hand. But the fact that these AIs can extract usable frames and construct thumbnails means that creators have a new tool at their disposal.

Also: I used Claude Code to vibe code a Mac app in 8 hours, but it was more work than magic

Overall, I’m impressed with Gemini and the pairing of ChatGPT and Codex for video-watching ability. Isn’t it interesting that Gemini doesn’t need two tools (after all, it is called “Gemini”), but ChatGPT needs Codex? Things like this amuse me.

Even though Claude bombed at this test, it still has value. Claude is one of my favorites for vibe coding.

What productivity benefits can you see getting from the video-watching capabilities of these AIs? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


The Windows Insider Program is about to get much easier

Ed Bott / Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Microsoft is making the Insider Program less complicated.
  • Beta channel will be a more reliable preview of the next retail release.
  • Other changes will allow testers to quickly enable/disable new features.

Last month, Microsoft took official notice of its customers’ many complaints about Windows 11. Pavan Davaluri, the executive vice president who runs the Windows and Devices group, promised sweeping changes to Windows 11. Today, the company announced the first of those changes in a post authored by Alec Oot, who’s been the principal group product manager for the Windows Insider Program since January 2024.

Those changes will streamline the Insider program, which has lost sight of its original goals in the past few years. (For a brief history of the program and what had gone wrong, see my post from last November: “The Windows Insider Program is a confusing mess.”)

Also: If Microsoft really wants to fix Windows 11, it should do these four things ASAP

If you’re currently participating in the Windows Insider Program, these are meaningful changes. Here’s what you can expect.

Simplifying the Insider channel lineup

Throughout the Windows 11 era, signing up for the Insider program has required choosing one of four channels using a dialog in Windows Settings. Here’s what those options look like today on one of my test PCs.

insider-program-channels-lineup-old

The current Insider channel lineup is confusing, to say the least.

Screenshot by Ed Bott/ZDNET

Which channel should you choose? As the company admitted in today’s post, “the channel structure became confusing. It was not clear what channel to pick based on what you wanted to get out of the program.”

The new lineup consists of two primary channels: Experimental and Beta. The Release Preview channel will still be available, primarily for the benefit of corporate customers who want early access to production builds a few days before their official release. That option will be available under the Advanced Options section.

windows-insider-channel-lineup-new

This simplified lineup is easier to follow. Beta is the upcoming retail release, Experimental is for the adventurous.

Screenshot courtesy of Microsoft

Here’s Microsoft’s official description of what’s in each channel now, with the company’s emphasis retained:

  • Experimental replaces what were previously the Dev and Canary channels. The name is deliberate: you’re getting early access to features under active development, with the understanding that what you see may change, get delayed, or not ship at all. We’ve heard your feedback that you want to access and contribute to features early in development and this is the channel to do that.
  • Beta is a refresh of the previous Beta Channel and previews what we plan to ship in the coming weeks. The big change: we’re ending gradual feature rollouts in Beta. When we announce a feature in a Beta update and you take that update, you will have that feature. You may occasionally see small differences within a feature as we test variations, but the feature itself will always be on your device.

These changes will apply to the Windows Insider Program for Business as well.

Offering a choice of platforms

For those testers who want to tinker with the bleeding edge of Windows development, a few additional options will be available in the Experimental channel. These advanced options will allow you to choose from a platform that’s aligned to a currently supported retail build. Currently, that’s Windows 11 version 25H2 or 26H1, with the latter being exclusively for new hardware arriving soon with Snapdragon X2 Arm chips.

Also: Microsoft account vs. local account: How to choose

There will also be a Future Platforms option, which represents a preview build that is not aligned to a retail version of Windows. According to today’s announcement, this option is “aimed at users who are looking to be at the forefront of platform development. Insiders looking for the earliest access to features should remain on a version aligned to a retail build.”

windows-insider-advanced-options-new

The Future Platforms option is the equivalent of the current Canary channel

Screenshot courtesy of Microsoft

Minimizing the chaos of Controlled Feature Rollout

Last month, I urged Microsoft to stop using its Controlled Feature Rollout technology, especially for builds in the Beta channel. Apparently, someone in Redmond was listening.

One of the most common questions we receive from Insiders is “why don’t I have access to a feature that’s been announced in a WIP blog?” This is usually due to a technology called Controlled Feature Rollout (CFR), a gradual process of rolling out new features to ensure quality before releasing to wider audiences. These gradual rollouts are an industry standard that help us measure impact before releasing more broadly. But they also make your experience unpredictable and often mean you don’t get the new features that motivated many of you to join the Insider program to begin with.

Moving forward, Insider builds in the Beta channel will no longer suffer from this gradual rollout of features. Meanwhile, the company says, “Insiders in the Experimental channel will have a new ability to enable or disable specific features via the new Feature Flags page on the Windows Insider Program settings page.”

windows-insider-feature-flags

Builds in the Experimental channel will include the option to turn new features on or off.

Screenshot courtesy of Microsoft

Not every feature will be available from this list, but the intent is to add those flags for “visible new features” that are announced as part of a new Insider build.

Making it easier to change channels

The final change announced today is one I didn’t see coming. Historically, leaving the Windows Insider Program or downgrading a channel (from Dev to Beta, for example) has required a full wipe and reinstall. That’s a major hurdle and a big impediment to anyone who doesn’t have the time or technical skills to do that sort of migration.

Also: Why Microsoft is forcing Windows 11 25H2 update on all eligible PCs

Beginning with the new channel lineup, it should be easier to change channels or leave the program without jumping through a bunch of hoops.

To make this a more streamlined and consistent experience, we’re making some behind the scenes changes to enable Insider builds to use an in-place upgrade (IPU) to hop between versions. This will allow in most cases Insiders to move between Experimental, Beta, and Release Preview on the same Windows core version, or leave the program without a clean install. An IPU takes a bit more time than your normal update but migrates your apps, settings, and data in-place.

If you’ve chosen one of the future platforms from the Experimental channel, those options don’t apply. To move back to a supported retail platform, you’ll need to do a clean install.

Also: Apple, Google, and Microsoft join Anthropic’s Project Glasswing to defend world’s most critical software

The upshot of all these changes should make things a lot clearer for anyone trying to figure out what’s coming in the next big feature update. Beta channel updates, for example, should offer a more accurate preview of what’s coming in the next big feature update, so over the next month or two we should get a better picture of what’s coming in the 26H2 release, due in October.

When can we start to see those changes rolling out to the general public? Stay tuned.





Source link