Hyperspace Query Flow

Candidate Generation and Scoring

Database search is the process of retrieving the documents that best meet the query conditions. This flow is comprised of two main steps:

  • Space Reduction and Candidate Generation: The search space is reduced, and a list of candidate documents that passed the query filtering is generated.

  • Candidate Ranking: Candidates are ranked according to a score that corresponds to how well that match to the query.

Given a query, the naïve approach is to consider all the documents in the collection, while evaluating the expression score(i) = user_query(Document_i) and return the top K matching documents. However, this approach does not scale because it is impractical to review all the documents in the collection for each query. To overcome this problem, one needs some way to reduce the search space dramatically, from all dataset documents down to thousands or even hundreds of documents, so that user_query(Di) evaluation is only performed on a small fraction of the dataset. This is called space reduction or filtering, and the reduced group of documents is called the candidate group. Once the search space is reduced, it is easy to evaluate the score per document over this space and to return the K top matching documents. The next sections describe how filtering and scoring are specified in DSL syntaxes.

Query DSL Interface

The following code snippet shows the same query as in the Elastic DSL interface format, shown above. Even without considering the additional functionality, the Python syntax is much simpler and more readable than the DSL syntax.

{
  "query": {
    "bool": {
      "must": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "email domain": "yahoo.com"
              }
            },
            "boost": 2.2
          }
        }
      ],
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "first name": "John"
              }
            },
            "boost": 8.2
          }
        }
      ]
    }
  }
}

TF-IDF

Hyperspace supports TF-IDF scoring for keyword field types. This scoring method emphasizes rarer terms over more common ones. For example, depending on the user's query logic, documents containing "gmail.com" in the email_domain field may be deemed more relevant than those with "yahoo.com", reflecting Gmail's widespread use. This principle suggests that a "gmail.com" filter is considered weak due to its widespread usage. Further details on this topic can be found in the Scoring section.

Last updated