Table of documentation contents

Vector Index (ANN) Plugins

With Weaviate, data is stored in a vector-first manner. A well performing ANN algorithm is used for indexing data with vectors, namely HNSW. Since Weaviate's vector indexing is pluggable, other (ANN) methods could be used, instead of HNSW. Stay tuned for updates on the software.

Introduction

Weaviate’s vector-first storage system takes care of all storage operations with a pluggable vector index. Storing data in a vector-first manner does not only allow for semantic or context-based search, but also makes it possible to store very large amounts of data without decreasing performance (assuming scaled well horizontally or having sufficient shards for the indices).

What is a vector?

A vector is a long list of numbers. Data objects can be stored by choosing the numbers in this vector particular to this data object.

Why index data as vectors?

Now, a long list of numbers does not carry any meaning by itself. But if the numbers in this list are chosen to indicate the semantic similarity between the data objects represented by other vectors, then the new vector contains information about the data object’s meaning and relation to other data.

To make this concept more tangible, think of vectors as coordinates in a n-dimensional space. For example, we can represent words in a 2-dimensional space. If you use an algorithm that learned the relations of words or co-occurrence statistics between words from a corpus (like GloVe), then single words can be given the coordinates (vectors) according to their similarity to other words. These algorithms are powered by Machine Learning and Natural Language Processing concepts. In the picture below you see how this concept looks (simplified). The words Apple and Banana are close to each other. The distance between those words, given by the distance between the vectors, is small. But these two fruits are further away from the words Newspaper and Magazine.

2D Vectors visualization

Another way to think of this is how products are placed in a supermarket. You’d expect to find Apples close to Bananas, because they are both fruits. But when you are searching for a Magazine, you would move away from the Apples and Bananas, more towards the aisle with for example Newspapers. This is how the semantics of concepts can be stored in Weaviate as well, depending on the module you’re using to calculate the numbers in the vectors. Not only words or text can be indexed as vectors, but also images, video, DNA sequences, etc. Read more about which model to use here.

Supermarket map visualization

How to choose the right vector index plugin

The first vector-storage plugin Weaviate supports is HNSW, which is also the default vector index type. Typically for HNSW is that this index type is super fast at query time, but more costly when it comes to building (adding data with vectors). If your use case values fast data upload higher than super fast query time and high scalability, then other vector index types may be a better solution (e.g. Spotify’s Annoy). If you want to contribute to a new index type, you can always contact us or make a pull request to Weaviate and build your own index type, stay tuned for updates!

Configuration of vector index type

The index type can be specified per data class. Currently the only index type is HNSW, so all data objects will be indexed using the HNSW algorithm unless you specify otherwise in your data schema.

Example of a class vector index configuration in your data schema:

{
  "class": "Article",
  "description": "string",
  "properties": [ 
    {
      "name": "title",
      "description": "string",
      "dataType": ["string"]
    }
  ],
  "vectorIndexType": " ... ",
  "vectorIndexConfig": { ... }
}

Note that the vector index type only specifies how the vectors of data objects are indexed and this is used for data retrieval and similarity search. How the data vectors are determined (which numbers the vectors contain) is specified by the "vectorizer" parameter which points to a module such as "text2vec-contextionary" (or to "none" if you want to import your own vectors). Learn more about all parameters in the data schema here.

More Resources

If you can’t find the answer to your question here, please look at the:

  1. Frequently Asked Questions. Or,
  2. Knowledge base of old issues. Or,
  3. For questions: Stackoverflow. Or,
  4. For issues: Github. Or,
  5. Ask your question in the Slack channel: Slack.
Tags
  • Vector Index Plugins