Programs can solve mazes and LLMs can program. That's a different thing completely.
That just seems like an arbitrary limitation. Its like asking someone to do answer a math calculation but "no thinking allowed". Like, I guess we can gauge if a model just _knows all knowable things in the universe_ using that method... but anything of any value that you are gauging in terms of 'intelligence', is going to actually be validating their ability to go "outside the scope" of what they actually are (an autocomplete on steroids).
We know there are very simple maze solving algorithms you could code in few lines of Python but no one could claim that constitutes intelligence. The difference is between applying intuitive logic and using a predetermined tool.