What are the trade-offs of using next-token-prediction and masked-language-modeling training techniques?