Today I read more about coreference resolution and started to feel that this problem is one of the frontier problems in NLP, where solving it would make AI much more intelligent. The reason is that, while language models have been good with generating sensible texts, classifying texts, and so on, there are still things that seeems challenging.

For example, in the sentence ‘He cannot put the trophy in the suitcase because it is too big’, to know if ‘it’ refers to ‘the trophy’ or ‘the suitcase’ requires several things. Firstly, one must know that someone is trying to ‘put’ something (A) into another thing (B), so that A should be smaller than B. This is ‘commonsense’ reasoning, or I would say, physical knowledge. Therefore, if the action fails (‘cannot’) because something is ‘too big’, that thing should be A, which is ‘the trophy’.

This task is an effort to formalize the expectation for AI to be able to talk about anything with humans. Maybe such an AI should be multimodal because humans require multiple senses to make sense of the world. Those modalities could be images, sound, smell, taste, and touching sensation (seems like the first 2 account for most of our information). So where is NLP? Maybe there is another modality for it, which is thoughts and emotions inside human’s head expressed via texts everywhere. Indeed, there is some weak mapping between language and thoughts, as formalized by Saussure’s signifier and signified. So NLP is essentially an information processing problem on a abstract modality called thoughts that can possibly contains almost all of human’s logic.

Back to Coref. I like this formalization because it moves away from statistically trying to make a language model says something exactly like the internet, to explicitly making correct reasoning about the real world via language. What an interesting challenge to think about.