Gemini Robotics-ER 1.5: High real world accuracy?

Google released a preview model, the Gemini Robotics ER 1.5, and claims it specializes in understanding the physical world. This article performs several tests to try its capabilities.


1. Counting objects

Providing a picture of a tray of cookies, and ask the model to count the number. The task is very straightforward; the only challenge is that some cookies overlap each other.

How many pink monster cookies are there?

The result:

Robotics ER 1.5:

Based on the image, there appear to be 18 pink monster cookies on the tray.

OpenAI GPT-5 mini:

There are 14 pink monster cookies visible in the photo.

X Grok 4:

There are 14 pink monster cookies left in the basket.

 

  • The only model that got the correct answer is Robotics ER.
  • Grok 4 even misidentifies the tray as a basket.

2. Estimating Distance

Providing a picture of a pathway, with one of my feet in it. The prompt asks to estimate the number of steps between me and the end of the path.

Estimate number of steps to the end of the path

The result:

Robotics ER 1.5

Based on the image, a reasonable estimate would be approximately 15 to 20 steps to reach the end of the visible path where it goes between the buildings.

OpenAI GPT-5 mini:

About 10–12 steps (roughly 11 steps).

How I estimated:
- I scaled from the shoe in the photo (assumed ~0.28 m long) and counted ~30 shoe‑lengths to the end → distance ≈ 30 × 0.28 m ≈ 8.4 m.

- Using a typical adult step/stride ≈ 0.70–0.80 m, steps ≈ 8.4 / 0.70–0.80 ≈ 10–12 → ~11 steps.

If you want a tighter estimate, measure your shoe length (or your normal step length) and I’ll recalculate.

X Grok 4:

Based on the image, the visible path appears to be about 8-10 meters long (using the shoe as a rough scale, assuming it's ~30 cm). With an average adult step length of 0.75 m, that would take approximately 10-13 steps to reach the end near the building entrance.

 

  • The correct answer is 15 (Or 16 if including my first step).
  • The only model that got the correct answer is Robotics ER, though it provides the largest range gap.

3. Extracting information from a wet surface

Providing a picture of a wheel, and asking the size of the rim. The tyre is slightly wet, making it harder to read the tyre spec (which easily gives the information of the rim size). 

How many inches is this rim?

The results:

Robotics ER 1.5:

Based on the numbers visible on the tire sidewall, the rim size for this tire is 18 inches.

The numbers on the tire read "215/55R18", where "18" indicates the rim diameter in inches.

OpenAi GPT-5 mini:

Can you take a close photo of the tire sidewall showing the numbers (or tell me the tire size printed, e.g., `205/55R16`)? 
The number after the `R` is the rim diameter in inches.

X Grok 4:

The rim in the image is 17 inches in diameter, based on the tire sidewall marking (P215/55R17) and the wheel design matching a Hyundai Kona Electric.

 

  • Once again, Robotics ER 1.5 is the only one giving the correct answer (18 inches).
  • That is pretty impressive, as I could not read the words on the side wall myself, even if I zoomed in on the picture.

AI Summary AI Summary
gpt-5-mini-2025-08-07 2025-10-18 22:07:31
Google’s preview Gemini Robotics ER 1.5, claimed to specialize in physical-world understanding, was evaluated on three vision tasks: counting overlapping cookies, estimating steps to the end of a path, and reading a rim size from a wet tire. In each test the author reports Robotics ER produced the correct answer while competing models (GPT‑5 mini and Grok 4) erred, suggesting Robotics ER offers notably stronger real-world visual reasoning and measurement accuracy.
Chrome On-device AI 2025-10-19 14:20:22

Share Share this Post