by Reubend 2 days ago
After trying to understand their method, I think you're right. Doesn't seem like anything that I would personally call "diffusion". Much closer to MTP + speculative decoding.
Then again, their results with it are great. It would be interesting to benchmark it against standard SD on a model that already uses MTP.
Yeah, I think it's a super neat way to do MTP. Conceptually much more pleasing and simple than existing methods. Especially since this way scaling `k` as models get better will be easier. Wish it had been presented as such.