This reminds me a lot of the tricks to turn BERT into a generative model. I guess the causal masking that keeps it to essentially be autoregressive is an important difference though. Kind of best of both worlds.