There are more AI health tools than ever—but how well do they work?


Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.

Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated. 

Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers. 

Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.

Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”

They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.

OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.” 

Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.



Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


spring-sale-imagery

DeWalt/ZDNET

Spring means lawn and garden prep and DIY projects around the house. And if you’ve been looking for a handy gadget to help you with small repairs and crafts, you can pick up the DeWalt MT21 11-in-1 multitool at Amazon ahead of its Big Spring Sale for 25% off, bringing the price down to $30 (matching the lowest price of the year so far). It also comes with a belt sheath to keep it close by on jobsites.

Also: 10 DIY gadgets I never leave out of my toolkit

The MT21 has a compact design, measuring just 4 inches when fully folded and expanding to 6 inches when the pliers are deployed. The hinged handle is made of durable steel with a rubberized grip in iconic DeWalt yellow and black, adding a bit of visual flair while making the multitool more comfortable to use. Each of the included tools is also made of stainless steel for strength and reliability on jobsites and in the garage.

Also: The best Amazon Spring Sale DeWalt deals

The 11 featured tools include: regular and needlenose pliers, wire cutters, two flathead screwdrivers, a Phillips screwdriver, a file, a can and bottle opener, a saw blade, a straight-edge blade, and an awl tool. Each tool folds into the handle to keep them out of the way until needed and to protect your hands while using the multitool. 

We’re big fans of multitools here at ZDNET, and definitely recommend this highly rated one from DeWalt.

How I rated this deal 

DeWalt is one of the leading names in power tools, and if you’re looking for a handy EDC gadget or just need something for occasional DIY repairs, the MT21 multitool is a great choice. With 11 tools in a single gadget, you can do everything from assembling flat-pack furniture to minor electrical repairs. While not the steepest discount, getting your hands on a high-quality multitool for 25% off is still a great value. That’s why I gave this deal a 3/5 Editor’s rating.

Amazon’s Big Spring Sale runs March 25-31, 2026. 


Show more

Deals are subject to sell out or expire anytime, though ZDNET remains committed to finding, sharing, and updating the best product deals for you to score the best savings. Our team of experts regularly checks in on the deals we share to ensure they are still live and obtainable. We’re sorry if you’ve missed out on this deal, but don’t fret — we’re constantly finding new chances to save and sharing them with you at ZDNET.com


Show more

We aim to deliver the most accurate advice to help you shop smarter. ZDNET offers 33 years of experience, 30 hands-on product reviewers, and 10,000 square feet of lab space to ensure we bring you the best of tech. 

In 2025, we refined our approach to deals, developing a measurable system for sharing savings with readers like you. Our editor’s deal rating badges are affixed to most of our deal content, making it easy to interpret our expertise to help you make the best purchase decision.

At the core of this approach is a percentage-off-based system to classify savings offered on top-tech products, combined with a sliding-scale system based on our team members’ expertise and several factors like frequency, brand or product recognition, and more. The result? Hand-crafted deals chosen specifically for ZDNET readers like you, fully backed by our experts. 

Also: How we rate deals at ZDNET in 2026


Show more





Source link