by a1j9o94 3 days ago

You would only use the base model during training. This is a distillation technique