Creating a Database Schema Configuration File

The local Hyperspace Client uses a database configuration schema (can be provided as a python dictionary or as a file, for example, named 'schema.json') to define the Collection to be created (as described in the next step). Similar to other search databases, the Hyperspace database leverages this file to outline the data scheme and required customized settings.

The database configuration schema consists of a dictionary containing key-value pairs that lay out the structure (data schema fields) of the data to be uploaded into a Collection. Defined in standard .json format, the configuration is provided under the dict key 'configuration' in the form {'configuration': {'fieldname1': type}, {'fieldname2': type}, ...}. Each attribute (field) is described by an attribute name (such as city, country, and street), key (attribute of the value, such as type and low_cardinality), and value (property, such as keyword, boolean, and true).

Create a data schema document as a local variable –

The data schema should be provided as .json file, or as a python dictionary with the following structure:

{
 'configuration': { 
 'Metadata key1': {'type': 'keyword'},
 'Metadata key2': {'type': 'int'},
 'Metadata key3': {'type': 'float',
                   'struct_type': 'list'},
 'vector1': {
             'type': 'dense_vector',
             'index_type': 'index type',
             'dim': dimension
            },
 'vector2':{
            'type': 'dense_vector',
            'index_type': 'index type',
            'dim': dimension,
            }
}  

Vector fields should be given the type "dense_vector", while metadata fields can be given any type from the supported data types list.

Hyperspace does not currently support signed integers.

Optional Fields

The following optional type attributes can be added –

Index

The key ‘index’ allows to disable the indexing of a field. When set to False, the relevant field will be included in the dataset, but will not be indexed and will not contribute to search results. The default value for ‘index’ is True. See example of usage under ‘open_now’ in the above example.

Struct_type – List

Values (non-keyword fields) are configured as scalars by default.

To build a list of the same data type, add the key and value 'struct_type': 'list'. For example –

'genres': {
    'struct_type': 'list',
    'type': 'int'
}

For example –

document['genres'] = [0, 1, 0]

Metadata fields of type "keyword" can describe a keyword (str) or lists of keywords (list[str]), without the need to state type "list".

The length of each keyword is limited to 256 characters.

Nested Objects

Hyperspace supports the use of nested objects as part of a document. To include nested objects, define the relevant field type as 'nested', and the corresponding sub items under the 'fields' key.

For example –

config = {
    "configuration": {
            "id": {
              "type": "keyword",
              "id": True
            },
            "description": {
                "type": "keyword",
            }
            "paragraphs": {
                "type": "nested",
                "fields": {
                    "text": {
                        "type": "keyword"
                    },
                    "count": {
                        "type": "integer"
                    },
                    "value": {
                        "type": "float"
                    }
                },
        },
    }
}

In the above example, "paragraphs" is a nested object with subfields named "text", "count" and "value".

Cardinality

Cardinality refers to the number of unique values an attribute can have.

Hyperspace provides the option to accelerate search performance by setting one of the following cardinality attributes to true.

To accelerate the search, apply the appropriate cardinality attribute where relevant –

  • 'low_cardinality' – Indicates that this attribute has up to 10 possible unique values. It is suitable for fields with a limited set of possible values.

  • 'high_cardinality' – Indicates that this attribute has more than 100 possible values. It indicates that this attribute has more than 100 possible unique values, meaning that it has a broader range of distinct possible values.

To accelerate the search, apply the appropriate cardinality attribute where relevant.

For example –

'vertical': {
    'type': 'keyword', 
    'low_cardinality': true
}

Dense Vector

The 'dense_vector' value assigned to the 'type' attribute instructs Hyperspace to index and map the imported data to be suitable for a Vector Search.

Note – Currently, 'dense_vector' is the only 'type' attribute value that is supported for Vector Search. In the future, an additional option called 'sparse_vector' will be supported.

  • 'type' [string] – For vectors, specify 'dense_vector'.

  • 'dim' [integer] – Specifies the dimension of the vector, which indicates the number of values that the vector will contain. This is essential storage and search optimization in the database in order to enable efficient and accurate handling of vector operations. For binary vectors, this number must be divisible by 8. Note – 'index_type' and 'dim' must always be provided together or not at all, and they cannot be used in combination with 'struct_type', 'low_cardinality', or 'high_cardinality'.

  • 'index_type' [string] – Specifies the indexing method (data distribution) to be used for this vector, which influences both the speed of operations performed on the vector and their accuracy. Choosing the highest speed may necessitate a minor trade-off in accuracy. This choice also impacts the types of mathematical operations that can be conducted.

    • 'brute_force' – KNN using brute force, which is accurate yet time-consuming.

    • 'hnsw' – Indexing by Hierarchical Navigable Small World method.

    • 'ivf' – Indexing by Inverted File Index scheme.

    • 'bin_ivf' – Indexing by Inverted File Index scheme for binary vectors.

{
  'xyz': {
  'type': 'dense_vector',   
  'index_type': 'hnsw',
  'dim': 768,
  'metric': 'ip' 
   }
}
  • 'metric' – Specifies the metric to be employed to calculate the similarity (or distance) between vectors as one of the following options –

    • 'ip' – Inner Product / Cosine Similarity – This option must be specified when the 'hnsw' or the 'ivf' 'index_type' (described above) is selected.

    • 'hamming' – Hamming Distance – This option must be specified when the 'bin_ivf' 'index_type' (described above) is selected.

    See additional info here.

  • 'nlist' (int, default 128) – Only used for index_type = ivf or bin_ivf. This option is used during index creation and represents the number of buckets used during clustering. A larger nlist leads to quicker search with lower accuracy.

  • 'm' (int, default 30): Used exclusively for index_type = hnsw. It specifies the number of arcs per new element. A higher M value should correspond to datasets with higher intrinsic dimensionality and/or higher recall. This means that if the dataset has more complex features or you want more accurate results, consider using a higher M value.

  • 'ef' (int, default 16) – Used exclusively for 'index_type = hnsw'. This represents the dynamic list size for nearest neighbors. A larger ef value results in better accuracy but slower search times. Essentially, by setting a larger ef, you're allowing the algorithm to consider more potential neighbors for a better match, but this comes at the cost of longer processing times. ef must be larger than the number of queried nearest neighbors (NN).

  • 'ef_construction' (int, default 360) – Used exclusively for index_type = hnsw. It is similar to ef, but used for index creation. Though the upload and index creation may require more time, this option provides a more precise search outcome.

Id

Each document muct have a unique identifier, or "id". You can manually set an id per document by defining a designated field in the config schema file. Use the following example-

"id_field": {
  "type": "keyword",
  "id": true
},

An id per document will then be required during the upload step. If id field is not set, the system will assign ids automatically.

Example Configuration

The following configurations describes a hybrid combination of vector fields (vector1 and vector2) designated for vector search, with various metadata fields (series, genres, etc.), that can be used as part of classic search.

{
  'configuration': { 
             'series': {'type': 'keyword'},
             'genre_ids': {'struct_type': 'list', 
                           'type': 'int', 
                           'low_cardinality': true},
             'id': {'type': 'int'},
             'text embedding': {'dim': 1024, 
                                   'metric': 'IP', 
                                   'type': 'dense_vector, 
                                   'index_type': 'hnsw'},
             'production_companies': {'type': 'keyword'},
             'production_countries': {'type': 'keyword'},
             'rating': {'type': 'float'},
             'spoken_languages': {'struct_type': 'list', 'type': 'keyword'},
             'title': {'type': 'keyword'}},
             'vector1': {'type': 'dense_vector',
                         'index_type': 'hnsw',
           	         'dim': 768
           	        },
             'vector2':{'type': 'dense_vector',
                                'index_type': 'hnsw',
                                'dim': 768,
                                'm': 15,
                                'ef': 100,
                                'ef_construction': 192
                        }
}  

The following is a second example file –

{
 'configuration': {
           'city': {'type': 'keyword'},
           'country': {'type': 'keyword'},
           'open_now': {'type': 'boolean', ‘index’: false},
           'zip_code': {'type': 'integer'},
           'street': {'type': 'keyword'},
           'vertical': {'type': 'keyword', 'low_cardinality': true},
           'embedded_name': {
               'type': 'dense_vector',
               'index_type': 'bin_ivf',
               'dim': 768
               "metric_type":"hamming"
           }
       }
}

The 'type' attribute is valid for both Classic and Vector search.

struct_type, low_cardinality and high_cardinality are only valid for Classic Search.

dense_vector, index_type, dim and metric are only valid for Vector Search.

In the example above, the city, country, open_now, zip_code and street attributes are valid for both Classic Search and Vector Search. The vertical attribute is only valid for Classic Search. The embedded_name attribute is only valid for Vector Search.

Last updated