I have been engaged on an utility the place, given an enter string, the target is to advocate an output string that’s just like the enter string, for some notion of similarity. A machine studying mannequin, on this case a SentenceTransformers mannequin, is taught this notion of similarity by exhibiting it many examples of input-output pairs. The mannequin’s weights are then used to encode the half to be advisable as a vector, and written out right into a search index, on this case OpenSearch. At inference time, the identical mannequin is used to encode the enter string, and OpenSearch’s vector search is used to search out and advocate the closest string to the enter in vector area.
My dataset consisted of about 10,000 input-output pairs. I break up it up into 90% coaching set (approx 9,000 pairs) and 10% take a look at set (approx 1,000 pairs). I selected the sentence-transformers/all-MiniLM-L6-v2 mannequin, a great pre-trained general-purpose mannequin for vector search in its personal proper, that maps pairs right into a dense 384-dimensional area. Sentence Transformer fashions use Contrastive Studying, which suggests I would like each optimistic and unfavorable pairs to coach it, however my coaching set are all optimistic pairs by definition. Moderately than attempt to generate unfavorable pairs alone, I used the built-in MultipleNegativesRanking (MNR) loss, which takes optimistic pairs and generates unfavorable pairs by sampling from the batches. I skilled for 10 epochs, utilizing the AdamW optimizer (the default) with studying price 2e-5 (additionally the default), saving the very best checkpoint (utilizing similarity on validation set).
To guage, I generated the top-100 nearest neighbors for every enter of the take a look at set pairs, after which computed Recall @ok and MRR (Imply Reciprocal Rank) @Okay for ok = 1, 3, 5, 10, 20, 50, 100
. For recall, I’d rating a match @ok as profitable if the output of the pair appeared inside the prime ok nearest neighbors returned from the vector search. This rating is then averaged throughout all of the 1,000 take a look at set pairs for every worth of ok. MRR is analogous, besides the recall rating for every take a look at set pair and ok is split by the (1-based) place that matched (and 0 if no match).
The baseline for the experiment was computed utilizing an index of encodings of the enter half created utilizing the inventory all-MiniLM-L6-v2
mannequin.
I had additionally not too long ago learn Jo Kristian Bergum’s weblog posts Billion Scale Vector Search with Vespa, half one and two, and extra not too long ago Matryoshka Binary Vectors: Slash Vector Search Prices with Vespa on the Vespa weblog, the place he compares efficiency of vectors of various storage varieties, amongst different issues. Vespa permits for a lot of totally different storage varieties, however I’m utilizing OpenSearch, which provides assist for float (float32)
, byte (int8)
and binary (bool)
storage varieties. I used to be curious to see (a) if I might exchange my float based mostly vectors with these, and if that’s the case, how their efficiency would examine. That is what this put up is about.
The index mappings for the float, byte and binary vector discipline is as follows. These would have to be set throughout index creation.
Float Vector
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{ "kind": "knn_vector", "dimension": 384, "methodology": { "title": "hnsw", "engine": "lucene", "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 24 } } } |
Byte vector
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{ "kind": "knn_vector", "dimension": 384, "data_type": "byte", "methodology": { "title": "hnsw", "engine": "lucene", "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 24, } } } |
Binary vector
1 2 3 4 |
{ "kind": "binary", "doc_values": "true" } |
To generate the vectors, I used the fine-tuned model of the all-MiniLM-L6-v2
mannequin as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary utilizing the next capabilities.
1 2 3 4 5 6 7 8 9 10 11 |
def convert_to_bytes(arr: np.ndarray): return np.flooring(arr * 128).astype(np.int8) def convert_to_bits(arr: np.ndarray): bits = np.packbits( np.the place(arr > 0, 1, 0)).astype(np.int8) arr_b = bits.reshape(arr.form[0], -1) hex_bits = [] for row in arr_b: hex_bits.append(str(binascii.hexlify(row), "utf-8")) return hex_bits |
On the finish of this indexing course of, I had three separate indexes of roughly 10,000 information, one with one a part of the pair encoded as a float32 vector, one other encoded as a byte (int8) vector, and the final as a binary vector. To offer you a tough thought of the storage necessities (tough as a result of there are fields apart from the vectors for every index), the sizes reported by /_cat/indices
are proven under.
Vector Storage Sort | Index Dimension |
---|---|
float | 184.9 MB |
byte | 39.6 MB |
float | 15.6 MB |
On the question facet, I exploit the next Script Rating queries as described within the Byte-quantized vectors in OpenSearch weblog put up and Precise k-NN with scoring script documentation pages. The queries are all rating scripts as proven under. The question for float and byte vectors are the identical, the one distinction is that the float_vector
is quantized right down to int8 within the second case.
Float and byte vector
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "script_score": { "question": { "match_all": {} }, "script": { "supply": "knn_score", "lang": "knn", "params": { "discipline": "{field_name}", "query_value": "{float_or_byte_vector}", "space_type": "cosinesimil" } } } } |
Binary Vector
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "script_score": { "question": { "match_all": {} }, "script": { "supply": "knn_score", "lang": "knn", "params": { "discipline": "{field_name}", "query_value": "{binary_vector}", "space_type": "hammingbit" } } } } |
The chart under exhibits the Recall and MRR @ok for numerous values of ok as described above.
Very first thing to notice is that fine-tuning helps, or at the very least it helped quite a bit on this case, most likely as a result of the notion of similarity I used to be working with was extra nuanced than Cosine similarity. With respect to drift vectors versus byte vectors, float vectors have a slight edge as you possibly can see from the desk under (in the event you look actually exhausting you can too see the float vector (orange line) and byte vector (inexperienced line) nearly overlaid on one another. Whereas binary vectors do not carry out as nicely, they nonetheless carry out higher than the baseline, and they’re much quicker, to allow them to be helpful as the primary stage of a two-stage retrieval pipeline.
Imply Reciprocal Rank
Recall
By way of response time, binary vectors are the quickest, adopted by byte vectors, adopted by float vectors. I measured the response time for every of the 1,000 take a look at set queries to extract 100 nearest neighbors from OpenSearch, then calculated the imply and normal deviation for every vector storage kind. Listed below are the outcomes.
Vector Storage Sort | Imply Response Time | Customary Deviation |
---|---|---|
float | 0.382 s | 0.097 s |
byte | 0.231 s | 0.050 s |
binary | 0.176 s | 0.048 s |
I initially introduced this up within the Relevance Slack channel as a result of I had made some errors in my analysis and was getting outcomes that didn’t agree with frequent sense. Due to all my “non-work colleagues” on the channel who helped me out by validating my observations and being sounding boards which ultimately helped me discover the errors in my evaluation. In any case, I figured that the ultimate outcomes could also be helpful to the broader neighborhood of vector search fans and professionals, so I’m sharing this. Hope you discover it helpful.