Malpractices in LLM research
https://aclanthology.org/2024.eacl-long.5.pdf
This paper, published in EACL 2024, argues that evaluation results of close-sourced models like GPT4 are severely unreliable, done via malpractices. Those malpractices include:
- Data contamination: Leaking the test set to the model itself (via the use of ChatGPT web interface that seems to allow learning from the user prompts)
- Reproducibility issues:
- Only half of the papers examined (212) provided code repository. Funnily and sadly, some papers provide link to an empty or invalid repository… (Why do you do that, researchers?)
- Evaluation fairness issues:
- ChatGPT is usually thought to be the only model that should be measured. Many papers don’t bother to compare ChatGPT’s performance to other models.
- No statistic tests are performed on the performance results of the models
- Different samples used between closed-sourced and open-sourced models are done in an unfair way
To protect the integrity of NLP research, report indirect data leakage at https://leak-llm.github.io/.