I think what would be interesting is if it could play the game with vision only inputs. That would represent a massive leap multimodal understanding.