I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:
```
Create a devenv project that does the following:
- Read the image at maze.jpg
- Write a script that solves the maze in the most optimal way between the mouse and the cheese
- Generate a new image which is of the original maze, but with a red line that represents the calculated path
Use whatever lib/framework is most appropriate```
Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604
Solution: https://imgur.com/a/bkJloPTPrograms can solve mazes and LLMs can program. That's a different thing completely.
That just seems like an arbitrary limitation. Its like asking someone to do answer a math calculation but "no thinking allowed". Like, I guess we can gauge if a model just _knows all knowable things in the universe_ using that method... but anything of any value that you are gauging in terms of 'intelligence', is going to actually be validating their ability to go "outside the scope" of what they actually are (an autocomplete on steroids).
We know there are very simple maze solving algorithms you could code in few lines of Python but no one could claim that constitutes intelligence. The difference is between applying intuitive logic and using a predetermined tool.
If you allow tool use much simpler models can solve it.