Imagine this: cutting-edge AI can generate videos so realistic, they could be mistaken for actual surgical footage. But what if these videos, while visually impressive, completely misunderstand the very procedures they depict? This is precisely the issue researchers uncovered when testing Google's Veo-3 video AI, revealing a significant gap between its visual capabilities and its grasp of medical reality.
To put Veo-3 to the test, experts used real surgical footage as a benchmark. They tasked the AI with predicting the next eight seconds of a surgical procedure based on a single initial image. To measure its performance, an international team created the SurgVeo benchmark, using 50 actual videos of abdominal and brain surgeries.
Four experienced surgeons then meticulously reviewed the AI-generated clips, evaluating them across four key criteria: visual appearance, instrument use, tissue feedback, and overall medical sense.
The results were eye-opening. Veo-3 initially produced videos that looked remarkably authentic, with some surgeons even describing the quality as "shockingly clear." However, this initial impression quickly dissolved upon closer inspection. In tests involving abdominal surgery, the model scored a respectable 3.72 out of 5 for visual plausibility after just one second. But as the need for medical accuracy increased, its performance plummeted.
For abdominal procedures, the AI's instrument handling earned a mere 1.78 points, tissue response scored only 1.64, and surgical logic was the lowest at 1.61. It could convincingly create images, but it couldn't replicate the reality of an operating room.
The challenges were even more pronounced with brain surgery footage. From the very first second, Veo-3 struggled with the intricate precision required in neurosurgery. For brain operations, instrument handling dropped to 2.77 points (compared to 3.36 for abdominal) and surgical logic plummeted to as low as 1.13 after eight seconds.
The researchers also meticulously analyzed the types of errors. Over 93 percent of the errors were related to medical logic: the AI invented tools, imagined impossible tissue responses, or performed actions that made no clinical sense. Only a small fraction of errors (6.2 percent for abdominal and 2.8 percent for brain surgery) were tied to image quality.
Researchers attempted to provide Veo-3 with more context, such as the specific type of surgery or the exact phase of the procedure. However, these attempts yielded no meaningful or consistent improvement. The team concluded that the real problem wasn't the information provided, but the model's fundamental inability to process and understand it.
So, what does this mean for the future of AI in medicine? The SurgVeo study highlights how far current video AI is from achieving true medical understanding. While future systems could potentially assist in training doctors, help with surgical planning, or even guide procedures, today's models are not yet capable. They can produce videos that appear real, but they lack the essential knowledge to make safe or meaningful decisions.
But here's where it gets controversial... The researchers plan to release the SurgVeo benchmark, inviting other teams to test and improve their models. This could lead to faster progress, but it also raises concerns about the potential for misuse.
Consider this: synthetic AI-generated videos could be used for medical training. Unlike approaches where AI videos train robots for general tasks, in healthcare, these kinds of AI "hallucinations" could be dangerous. If a system like Veo-3 generates videos that look plausible but depict medically incorrect procedures, it could inadvertently teach robots or trainees the wrong techniques.
These findings also suggest that the concept of video models as "world models" is still a distant goal. Current systems can imitate how things look and move, but they lack a reliable grasp of physical or anatomical logic. As a result, their videos might seem convincing at a glance, but they can't capture the real logic or cause-and-effect behind surgery.
And this is the part most people miss... While video AI struggles, text-based AI is already making real gains in medicine. In one study, Microsoft's "MAI Diagnostic Orchestrator" delivered diagnostic accuracy four times higher than experienced general practitioners in complex cases, although the study notes some methodological limitations.
What are your thoughts? Do you think the potential benefits of AI in medical training outweigh the risks? Share your opinions in the comments below!