Discussion on comparision with previous work and citation?

#1
by JeremiahZ - opened

Various experiment designs and results in this blog have striking similarities with an earlier EMNLP 2024 Findings paper Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia, where the authors replace the GPT-2 tokenizer with a synthetic base 100 tokenizer. The authors have already shown that 1 digit tokenizers have superior performance, while also diving into failure patterns on length extrapolation. However, this blog does not cite this paper. Nor does it discuss the relationship with previous work. Could the team please address this issue?

Hugging Face org

Hi, thank you for bringing this to our attention.
This space is a mere blogpost and does not purport to be a full scientific paper, but we appropriately cite previous work that was a strong inspiration for our experimental setup, namely https://arxiv.org/pdf/2402.14903 (which is actually not cited in your own paper).
While your paper is a very interesting work, its focus seems narrower than the blogpost, which additionally tests R2L vs L2R (from a quick read, this comparison seems to be absent from your paper) and how training tokenizers with, for example, BPE will arbitrarily assign some tokens to some numbers but not others.
Additionally, your paper was published on arxiv in late September, around the same time we were already communicating about experiments that made it to the final version of the blogpost (https://x.com/garrethleee/status/1853870656506798454), which itself was published less than a month later. It seems a stretch to claim the (again, very interesting) paper you mention is an "earlier" paper, and not concurrent work, but I am happy to add a citation where it might make sense if you feel strongly about this.
Best

Sign up or log in to comment