The state of NLP: a lens via EMNLP 2024

EMNLP is one of the world’s three biggest conferences in CL/NLP fields (along with ACL and NAACL). This year, I had a chance to attend EMNLP¹, joining other four thousand participants, witnessing over one thousand publications. While attending, I took some “field notes” about this conference, which speaks towards the current state of NLP research²

alt text Opening ceremony of EMNLP 2024

What is NLP?

Let’s talk about the current meaning of the phrase “NLP” first. We are at a time where “NLP” is a popular phrase. When a phrase gets used a lot, its meaning shifts.

Natural Language Processing (NLP) is often used interchangeably with Computational Linguistics (CL). However, they are not the same thing. Given that linguistics is “the study of languages”, CL is the study of languages using computational methods. So here, both the subject of study (languages) and the methodology (computation) is defined. Meanwhile, NLP is the study of methods to computationally process languages. As such, NLP’s subject of study is actually the methods that processes languages.

But:

What does it mean to “process” languages in the NLP sense?
Isn’t that “processing” one of the things we study when studying languages, which is the definition of CL?
If so, is NLP a subfield of CL?

These questions cannot be answered based on the literal words anymore. Instead, one must look into how the two research communities are implicitly defining them. To answer this, recall that these two fields are both sciences about languages. At the same time, recall that science has four goals — describe, explain, predict, and control (w.r.t. its subject). Based on this framework and my few years of observing these fields, I would argue that, in 2024, CL is about describe and explain languages, while NLP is about predict and control language-related phenomena and artifacts.

Characterizing EMNLP 2024

The plenary events of the conference this year all addressed LLMs:

The panel discussion this year is entitled “Increasing significance of NLP in the age of Large Language Models” ³. The overall sentiment in the panel is that LLMs is impressive, but it is far from acing the entire NLP field. They urge students to focus on the important problems and critically think about their research methods independently from LLMs. For example, panelist Sasha recommended students to build a strong technical foundation on Pytorch programming and statistical approaches to languages as Jurasky and Martin’s SLP3 book. He argued that there should be a cornerstone textbook where all new students are trained upon. I agree with this — we need to embrace a richer set of common technical vocabularies to talk about NLP research, otherwise, things would be as superficial as prompt engineering and and as absurd as the GPU arm race.
The first keynote by Percy Liang points out the harm of close-sourced LLMs, mostly from private companies. He defines the three levels of openness in LLM release: black-box API, open weights, and open-sourced code and “data”. The more closed is the LLM, the less useful (and reliable) it is to scientific research ⁴.
The second keynote by Anca Dragan brought a fresh perspective that RL is dangerous if used carelessly. That is relevant to NLP because RLHF is applied more and more, considered a patch for all “alignment” problems in LLMs.
The third keynote by Tom Griffiths talks about two Bayesian way to think about LLMs, which I think is pretty future-forward. The first way is to think of a forward pass as Bayesian inference, where the prompt is the condition, and the LLM’s knowledge is the likelihood model. That is largely a theory, tested in multiple scenarios, suggesting more fruitful claims about what to expect about LLMs. The second way is something I couldn’t catch.

The keynotes and panels are usually from thought leaders who are doing things in a visionary way. Venturing into the posters sessions and chatting with fellow participants allowed me to directly see what the current NLP research practices are. Below are some key highlights:

😯 EMNLP is more about NLP than CL. EMNLP, expected by its name (i.e., Empirical Methods in NLP), is an NLP venue. As a participant of this year’s EMNLP, I observed that the majority of work presented here is indeed about NLP instead of CL. They mostly proposed methods to do language tasks and datasets to evaluate those methods.
🥳 Performance on traditional NLP tasks have been enhanced a lot by LLMs. In NLP, there are traditional tasks that sound very geeky, like coreference resolution, sentiment analysis, dependency parsing, etc. They used to be the main focus of NLP before the deep learning era starting in 2010. But as of 2024, all of them have seen significant performance boost thanks to the existence of LLMs (beside large-scaled datasets and compute). I think this is a nice benefit when LLMs can produce natural-sounding text in various aspects. But I am wary because these performance boosts are only with respect to current measures, which have their own limitations. For example, LLMs are widely know to be able to memorize its training text — that can be the confounding factor affecting the performance boost.
😕 Research questions in EMNLP are homogeneously LLM-centric.
- The popular questions being answered right now are:
  - Are traditional tasks actually been solved by LLMs?
  - What new great things LLMs have enabled us to “do”?
  - How silly LLMs are?
  - What should we do next, in an LLM-driven way? (1/5 Best Papers Awards here is proposing a novel image-translation task whose novelty is shown via VLMs’ poor performance)
- These are all valid questions, the fact that the entire NLP research community focuses on these narrow questions is a concerning phenomenon.
- This monotonicity in research findings is bad. To be fair, there are legit reasons to study LLMs, because they seem to be the best tool in the NLP toolkit to solve NLP problems. However, based on my earlier definition, NLP is about methods for language tasks. It is not just about LLMs. Studying LLM in NLP is like making your favorite meal — it feels good and is beneficial if doing it occasionally, but doing it excessively means missing out on the entire cuisine.
- Why is this the case? I am think it is because of three things: the AI hype from companies, the failure of the review process to control the quality of papers, and the lack of a prestigious venue for language modeling (CoLM is just started this year).

Personal remarks

As a new student in NLP, I feel puzzled about the state of the field right now. NLP publications, even at such a top-tier conference, are homogenous. The collective picture created by the papers are not pretty. It does not create a sense of excitement for breakthroughs, or collective progress on some grand challenges.

I even feel that the quality of the papers are not that high. Many times, a paper does not present a new method or significant results. Perhaps, when these conferences scale up to thousands of papers, some quality control aspect is overlooked.

Attending ICAPS 2024 was a better experience than EMNLP 2024. In ICAPS, the conference size is smaller, with only around 200 papers. The community is manageable in size, it is easier to get to know what everyone is doing. That is important, because having a complete picture of a field prevents the FOMO feeling and somehow creates a sense of belonging for the participants.

I presented our survey on computational meme understanding ↩
EMNLP, ACL, and NAACL papers all go through the same review system called ARR Rolling Review. Papers at EMNLP, therefore, is a uniformly random sample representative of contemporary NLP research. ↩
EMNLP 2024 panel is more insightful than ICAPS 2024. The EMNLP moderator did a more elegant job, and the EMNLP panelists also seem to gave deeper answers. ↩
During the talk, Percy aggressively advertised his group’s work that doesn’t seem to effectively support his argument. ↩