Scientific Experiments in Machine Learning
Just done the first paper submission in my second year of PhD. It is a resource paper releasing a dataset, where the second contribution is the insights from empirical studies on the data itself. My labmate did the hard work of leading the annotation process, while I do the not-so-easy job of running experiments to fine-tune and run inferences with LLMs. I learned quite a bit from it.
Finetuneing LLMs
- From the implementation perspective, VLMs are just slightly different from LLMs. The difference is not significant compare to the difference between different types of LLMs.
- For the input to the forward pass, in addition to
input_ids
andattention_mask
, there is also anpixel_values
tensor. - The output is also different, defined entirely by the implementation of the
forward()
function of the model class.
- For the input to the forward pass, in addition to
Trainer
is a very powerful approach for training and evaluating LLMs. However, it did not work for my specific use case. I wanted to fine-tune a low-rank adapter (LoRA) for LLaVA such that performance on evaluation set is measured by the NLI score between the generated text and the ground truth. The evaluation metric, by design, must be computed in acompute_metrics
function. To calculate NLI score, that function must have access to the generated text. But it does not – only the logits for the FIRST token is available. You may say one can just implement a generation strategy, like greedy decoding, from scratch. However, for LLaVA, that requires thepixel_values
, which represents the image. And guess what? Only theinput_ids
is available forcompute_metrics
!- It is tempting to raise a PR to the
transformers
library to add this feature :) I probably will raise an issue and see what happens.
- It is tempting to raise a PR to the
Organizing experiments
In principle, given an experimental design, one can easily organize the code and data folders neatly. However, in practice, lots of things can go wrong: the experimental design can change, more independent variables can be introduced or removed, the dependent variables can change, jobs can fail, there may be bugs in the experiment code, etc. Any of those things, if happen, mess with the organization of code and data.
Given those challenges, I think a good ML researcher needs to achieve two things with their experiments:
- Reproducibility: Ideally, for each number being reported in the paper, both code and data for that number must be saved in a readable way. That will become tricky when the final result is preceded by not just one program execution, but multiple (e.g., one data processing script, one model training script, another patchy model training script, then an inference script, then a metric generation script)
- Efficiency: which means the ability to produce results fast. The time taken to produce the results, (strongly) assuming the data is already availabe, can be decomposed into (1) human-labor programming time and (2) machinery running time. While (2) can be trivially optimized via more efficient use of software and some clever algorithmic tricks, (1) is not that easy. Due to the dynamic nature of the requirements for ML code, it may take significant delay time to modify a codebase to accomodate every new experiment. A quick and dirty way is to duplicate an existing script and modify it just for the new experiment. But that causes code duplication, which in turn significantly harm the controlability (e.g., when a change is needed acroos many settings, changes need to be made across many files, which is proned to errors) and readablity of the code base (because the codebase is ~10 times larger).
As such, there is a reproducibility-efficiency trade-off in developing a codebase for ML experiments. I need more experience in order to figure out a system that works for myself. Coincidentally, in EMNLP this year (that I am going to attend!), a keynote speaker – Percy Liang – works on those two things. He made CodaLab with those two criteria as the selling points. I should probably check it out.