Hi,
Great work on this, and thanks for releasing the training code.
I had a question about how general the methodology is for training LLaMA with longer context lengths. Is this approach intended to be transferable to other models as well? In particular, I tried adapting the code to work with Qwen3, but I kept running into different errors.
I wanted to check whether this is likely due to mistakes or missing adjustments in my code changes, in which case I would really appreciate any suggestions, or whether extending the context length in this way is fundamentally incompatible with Qwen3’s architecture, for example due to design choices such as GQA.
I would really appreciate your guidance on this.
Best regards,
Amirhossein Abaskohi
PhD Student, UBC
Hi,
Great work on this, and thanks for releasing the training code.
I had a question about how general the methodology is for training LLaMA with longer context lengths. Is this approach intended to be transferable to other models as well? In particular, I tried adapting the code to work with Qwen3, but I kept running into different errors.
I wanted to check whether this is likely due to mistakes or missing adjustments in my code changes, in which case I would really appreciate any suggestions, or whether extending the context length in this way is fundamentally incompatible with Qwen3’s architecture, for example due to design choices such as GQA.
I would really appreciate your guidance on this.
Best regards,
Amirhossein Abaskohi
PhD Student, UBC