First of all, let me explain what “hapax legomena” is: it refers to words (and, by extension, concepts) that occurred just once throughout an entire corpus of text. An example is the word “hebenon”, occurring just once within Shakespeare’s Hamlet. Therefore, “hebenon” is a hapax legomenon. The “hapax legomenon” concept itself is a kind of hapax legomenon, IMO.
According to Wikipedia, hapax legomena are generally discarded from NLP as they hold “little value for computational techniques”. By extension, the same applies to LLMs, I guess.
While “hapax legomena” originally refers to words/tokens, I’m extending it to entire concepts, described by these extremely unknown words.
I am a curious mind, actively seeking knowledge, and I’m constantly trying to learn a myriad of “random” topics across the many fields of human knowledge, especially rare/unknown concepts (that’s how I learnt about “hapax legomena”, for example). I use three LLMs on a daily basis (GPT-3, LLama and Gemini), expecting to get to know about words, historical/mythological figures and concepts unknown to me, lost in the vastness of human knowledge, but I now know, according to Wikipedia, that general LLMs won’t point me anything “obscure” enough.
This leads me to wonder: are there LLMs and/or NLP models/datasets that do not discard hapax? Are there LLMs that favor less frequent data over more frequent data?
My intuition:
So I don’t think this approach will help you a lot even for finding words and phrases. And everything I’ve said can be extended to semantic noise too, so your extended question also seems a hopeless endeavour when approached specifically with LLMs or big data analysis of text.