Hyperspace Query Flow

Candidate Generation and Scoring

Database search is the process of retrieving the documents that best meet the query conditions. This flow is comprised of two main steps:

  • Space Reduction and Candidate Generation: The search space is reduced, and a list of candidate documents that passed the query filtering is generated.

  • Candidate Ranking: Candidates are ranked according to a score that corresponds to how well that match to the query.

Given a query, the naïve approach is to consider all the documents in the collection, while evaluating the expression score(i) = user_query(Document_i) and return the top K matching documents. However, this approach does not scale because it is impractical to review all the documents in the collection for each query. To overcome this problem, one needs some way to reduce the search space dramatically, from all dataset documents down to thousands or even hundreds of documents, so that user_query(Di) evaluation is only performed on a small fraction of the dataset. This is called space reduction or filtering, and the reduced group of documents is called the candidate group. Once the search space is reduced, it is easy to evaluate the score per document over this space and to return the K top matching documents. The next sections describe how filtering and scoring are specified in Hyperspace Python and DSL syntaxes.

Query Python Interface

The Hyperspace engine uses a primary conditional expression to narrow down the search space and generate a candidate group. Thus, only documents meeting this primary condition are considered as candidates.

The following code snippets show a basic query in the Hyperspace Python interface. In this example, the match("email domain") condition is the primary condition (lines 1–4), and lines 6–11 apply to the documents that meet the primary condition.

  • The "email domain" field is matched with the corresponding value, "yahoo.com".

  • Only documents that satisfy the primary condition are advanced to the candidate group. Consequently, the system only conducts an in-depth evaluation of the logic of documents within this group.

params = {
 "email domain": "yahoo.com",
 "first name": "John"
}
 
def user_query(params,doc) :
    if match("email domain") :
        score0 = 3.4
        if match("first name") :
            score0 += 7.0
        else :
            score0 -= 1.2
        return score0 

Query DSL Interface

The following code snippet shows the same query as in the Elastic DSL interface format, shown above. Even without considering the additional functionality, the Python syntax is much simpler and more readable than the DSL syntax.

{
  "query": {
    "bool": {
      "must": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "email domain": "yahoo.com"
              }
            },
            "boost": 2.2
          }
        }
      ],
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "first name": "John"
              }
            },
            "boost": 8.2
          }
        }
      ]
    }
  }
}

The Query Logic and Query Document Concept

In many applications, queries are used repeatedly with the same logic but with different parameters. For instance, different instances of the above query might substitute the values "yahoo.com" and "John" with other values. Hyperspace engines allows to distinguish between 'Query logic' and 'Query documents'. While the Query documents vary, the Query logic remains constant.

This distinction is crucial as it eliminates the need for recompiling the query logic before execution, thereby significantly reducing latency.

The following code snippets show the Query logic and the Query documents derived from the above user query.

Query Logic

params = {
    "email domain" : "yahoo.com",
    "first name" : "John"
    }
def user_query_logic(params, doc) : 
    score0 = 0.0
    if match("email domain") :    
        score0 = 3.4
        if match("first name") :
            score0 += 7.0
        else :
            score0 -= 1.2
    return score0 
    

TF-IDF

Hyperspace supports TF-IDF scoring for keyword field types. This scoring method emphasizes rarer terms over more common ones. For example, depending on the user's query logic, documents containing "gmail.com" in the email_domain field may be deemed more relevant than those with "yahoo.com", reflecting Gmail's widespread use. This principle suggests that a "gmail.com" filter is considered weak due to its widespread usage. Further details on this topic can be found in the Scoring section.

Last updated