Scientific Experiments in Machine Learning

Just done the first paper submission in my second year of PhD. It is a resource paper releasing a dataset, where the second contribution is the insights from empirical studies on the data itself. My labmate did the hard work of leading the annotation process, while I do the not-so-easy job of running experiments to fine-tune and run inferences with LLMs. I learned quite a bit from it.

Finetuneing LLMs

From the implementation perspective, VLMs are just slightly different from LLMs. The difference is not significant compare to the difference between different types of LLMs.
- For the input to the forward pass, in addition toinput_ids and attention_mask, there is also an pixel_values tensor.
- The output is also different, defined entirely by the implementation of the forward() function of the model class.
Trainer is a very powerful approach for training and evaluating LLMs. However, it did not work for my specific use case. I wanted to fine-tune a low-rank adapter (LoRA) for LLaVA such that performance on evaluation set is measured by the NLI score between the generated text and the ground truth. The evaluation metric, by design, must be computed in a compute_metrics function. To calculate NLI score, that function must have access to the generated text. But it does not – only the logits for the FIRST token is available. You may say one can just implement a generation strategy, like greedy decoding, from scratch. However, for LLaVA, that requires the pixel_values, which represents the image. And guess what? Only the input_ids is available for compute_metrics!
- It is tempting to raise a PR to the transformers library to add this feature :) I probably will raise an issue and see what happens.

Organizing experiments

In principle, given an experimental design, one can easily organize the code and data folders neatly. However, in practice, lots of things can go wrong: the experimental design can change, more independent variables can be introduced or removed, the dependent variables can change, jobs can fail, there may be bugs in the experiment code, etc. Any of those things, if happen, mess with the organization of code and data.

Given those challenges, I think a good ML researcher needs to achieve two things with their experiments:

Reproducibility: Ideally, for each number being reported in the paper, both code and data for that number must be saved in a readable way. That will become tricky when the final result is preceded by not just one program execution, but multiple (e.g., one data processing script, one model training script, another patchy model training script, then an inference script, then a metric generation script)
Efficiency: which means the ability to produce results fast. The time taken to produce the results, (strongly) assuming the data is already availabe, can be decomposed into (1) human-labor programming time and (2) machinery running time. While (2) can be trivially optimized via more efficient use of software and some clever algorithmic tricks, (1) is not that easy. Due to the dynamic nature of the requirements for ML code, it may take significant delay time to modify a codebase to accomodate every new experiment. A quick and dirty way is to duplicate an existing script and modify it just for the new experiment. But that causes code duplication, which in turn significantly harm the controlability (e.g., when a change is needed acroos many settings, changes need to be made across many files, which is proned to errors) and readablity of the code base (because the codebase is ~10 times larger).

As such, there is a reproducibility-efficiency trade-off in developing a codebase for ML experiments. I need more experience in order to figure out a system that works for myself. Coincidentally, in EMNLP this year (that I am going to attend!), a keynote speaker – Percy Liang – works on those two things. He made CodaLab with those two criteria as the selling points. I should probably check it out.