Welcome to the documentation about Weaviate! Here you will find what's Weaviate all about, how to start your own Weaviate instance and interact with it, and how to use it to perform semantic search and automatic classification.
Like what you see? Consider giving us a ⭐ on Github.
Weaviate is a cloud-native, realtime vector search engine integrating scalable machine learning models. It uses a word vector storage mechanism called the “Contextionary”, which uses Natural Language Processing principles to give context to the data and its language in your dataset. This makes Weaviate capable of semantic search and automatic classification of unstructured data.
Demo and overview
Concept and demo presented at FOSDEM 2020;
In almost any situation where you work with data, you store information related to something in the real world. This can be data about transactions, cars, airplanes, products; you name it. The challenge with current databases is that it is difficult for the software to grasp the context of the entity you refer to in your datasets. Do the characters “Apple” refer to the company or the fruit?
The Weaviate Vector Search Engine aims to solve this problem. Every time you store data, Weaviate indexes the data based on the linguistical context through a feature called The Contextionary. For example, when you store data about a Company called “Apple”, Weaviate automatically contextualizes the data related to an iPhone.
If you want to learn how the Contextionary does this, you can read more about our Contextionary here. We don’t just want to store the data, but also the information and its context so that knowledge can be derived from it.
Because most data is related to something (e.g., “Amsterdam” is the capital of “The Netherlands”), we store not only the concept itself but also the relation to other concepts (e.g., “the city Amsterdam” to “the country The Netherlands”). This means that the data you add to a Weaviate instance creates a network of knowledge, better known as a graph.
Core use cases
- Semantic Search to find on both concepts and keywords.
- Automatic Classification to automatically extend your graph.
- Knowledge Representation to represent information that systems and humans understand.
Weaviate consists of four core features:
- The Contextionary (c11y) is a vector index which stores all data objects based on their semantics (their meaning). This allows users to now only directly search and retrieve data, but also to search for its concepts.
- We believe that GraphQL combined with a RESTful API, provides the best user experience to query Weaviate.
- Weaviate can automatically build its own graph relations through conceptual classification.
- With Weaviate you can create a semantic Knowledge Network based on a P2P network of Weaviates.
Core developer features
- Contextionary - the core graph embedding mechanism (i.e., ML-model) that indexes all data objects.
- GraphQL API - an easy to use interface to query a Weaviate.
- RESTful API - an easy to use interface to populate a Weaviate.
- Containerized - with Docker and Kubernetes, to run it efficiently.
- Scalable - to support huge graph sizes with fast vector space querying.
About the Contextionary
The Contextionary (derived from dictionary, aka
C11Y) gives context to the language used in your dataset (there is an individual Contextionary per language). When running a Weaviate instance, it comes with an out of the box Contextionary which is trained on the Common Crawl, Wikipedia and, the Wiktionary. We aim to make the C11Y available for use cases in any domain, regardless if they are business-related, academic or other. But you can also create your own Contextionary if desired.
The Contextionary doesn’t use a traditional storage and indexing mechanism, but it uses vector positions to place data into a 300-dimensional space. When you run a Weaviate, it comes with a pre-trained Contextionary (you never have to do any training yourself) that contains the contextual representation that allows Weaviate to store data based on its contextual meaning.
An empty Weaviate (with preloaded Contextionary) could be envisioned like this:
When using Weaviate’s RESTful API to add data, the Contextionary calculates the position in the vector space that represents the real-world entity.
The process from a data object to a vector position is calculated based on the centroid of the words weighted by the occurrences of the individual words in the original training text-corpus (e.g., the word
the is seen as less important than the word
When a new class object is created, it will be added to a Weaviate.
When using the GraphQL interface, you can target a thing or action directly, or by searching for a nearby concept. E.g., the
company Apple from the previous illustration, can be found by searching for the concept
Because Weaviate converts all data objects in a vector position based on their semantic meaning, data object get a logical distance from each other. This allows for a variety of automated classification tasks Weaviate can perform in near-realtime.
Example of a classification task
Inside the Weaviate below, there are three data objects stored, a country, and two cities.
The country has a property called
hasCapital of which the reference is unset. We can now request Weaviate to connect the most likely candidate as the capital. Because Weaviate -through the schema- knows that the value of
hasCapital must be a
City it can choose from both Amsterdam and New York. Because of the semantic relation of Amsterdam to The Netherlands, a decision can be made.
When creating automatic classification tasks, the user is able to define how certain Weaviate needs to be of the connection. During querying, the user can see if the relation was made automatically or manually.
Want to get started or want to learn more? These resources might help you further:
- Get started:
- Learn more:
If you can’t find the answer to your question here, please look at the: