may you clarify
from
https://huggingface.co/blog/Pringled/model2vec
Zipf
As we take a simple mean over tokens in the space, it is important that the vectors are weighted correctly. Normally, a sentence transformer would be there to correctly weight all the tokens for us given the context, but we don't have that luxury any more. Intuitively, we would like to use something like Inverse Document Frequency (IDF) to down-weight very frequent or uninteresting words. But we don't have access to a corpus over which to compute document frequencies.
To overcome this, we opt to use a well-known principle from language sciences, which is that, given a frequency ranked list, the frequency of the items in that list follow a power law distribution. This is called Zipf's law. So, if we take the assumption that a vocabulary is ranked by frequency, we can accurately down-weight really frequent items without needing to have access to actual frequencies. As tokenizer vocabularies are sorted by frequency, we already have access to a ranked list, so this optimization can be applied without any additional work
so for hypothetical Zipf input
[ [ 0.2,0.5,0.7] , [1.2, 0.9,0.2], [0.4, 0.3, 0.2] ,[1.3, 2.4, 3.2]]
you
1
sort input according to each vector norm
so you get
[ [0.4, 0.3, 0.2] , [ 0.2,0.5,0.7] , [1.2, 0.9,0.2],[1.3, 2.4, 3.2] ]
2
you divide each vector by its norm
[ [0.4, 0.3, 0.2]/n1 , [ 0.2,0.5,0.7]/n2 , [1.2, 0.9,0.2] /n3 ,[1.3, 2.4, 3.2]/n4 ]
3
then final embedding is mean of this down-weighted vectors ?
( [0.4, 0.3, 0.2]/n1 + [ 0.2,0.5,0.7]/n2 + [1.2, 0.9,0.2] /n3 + [1.3, 2.4, 3.2]/n4) / 4
Is it correct algorithm ?