HN via remix.js for vilnius.js

What is the methodology of your benchmark?

On the contrary, I personally think these broader benchmarks are meaningless. I think personalized benchmarks are the way to go. They should answer "How does this model perform for MY use-case?" rather than trying to answer "How does this model perform across all coding environments?"

Case in point: I use Elixir which is not as popular as Python, is always a hit or miss with most SOTA models at the top of these benchmarks. Whereas, the ones in the middle of the benchmarks (like the GLM) almost always outperform even SOTA models from Google / Anthropic. However, this is relevant only for my use case and I wouldn't just advocate a model for everyone based off my use-case alone.