Below is a quick description of how to get Search.io search up and configured. This is broken into several sections for convenience.

Before you begin

  1. We use a schema for high performance. You can create this via the console, API or command tools. To add records, their fields must exist in the schema.

  2. You need to create pipelines to add, replace, and search data. You can’t get far without creating these.

  3. When you make a pipeline API request you must specify a pipeline name and optionally a version. If you don’t specify a version, the default version is used (set using the API).

  4. Pipelines have pre and post steps. In a record pipeline, post happens after the record has been added/updated. In a query pipeline, post happens after a query has returned (async).

  5. Pipelines are immutable once created. You can create as many versions as you like, but they are immutable so machine learning can rely on the data generated.

  6. The order of steps in a pipeline can be important. For example query re-writing should happen before synonyms and any conditional filters relying on NLP etc.

  7. All pipeline steps can be conditionally executed.

Command-line tool

The scloud command can be used to set up and test a Search.io collection. You can do almost everything with this command-line tool.

Getting started

To get the tool and workspace up and running follow these steps:

  1. Downloads available for macOS x86-64 (Intel CPUs), macOS ARM (M1 CPUs), Linux, and Windows or view the Homebrew tap available here https://github.com/search-io/homebrew-tap

  2. Add the location of scloud to your system PATH and ensure it is executable (chmod +x <file-path>/scloud)

  3. Run the following init command replacing relevant fields:

scloud init -project="<project_ID>" -collection="<collection_name>" -creds="<key>,<secret>"
CODE
⚠️ Security & Privacy settings on OSX might block the execution of the app. You will have to go to the settings to allow the execution.

Or you can just use the base command and enter as prompted:

scloud init
CODE

You can check this worked using:

scloud config get
CODE

You can also list your different configurations:

scloud config list
CODE

To switch to a different config:

scloud config set <name>
CODE

To delete a config:

scloud config delete <name>
CODE

Schema

The easiest way to get started with a schema is to infer it from your data, you can do this for CSV and JSONL style data sets and outputs a schema JSON file. Example:

# scloud will attempt to infer the format from the file extension
scloud schema detect "data.json" -out "schema.json"

# alternatively you can specify the format 
scloud schema detect "data.csv" -format csv -out "schema.json"
scloud schema detect "data.json" -format jsonl -out "schema.json"
CODE

Detecting a schema is the fastest way to get up and running. You can also import CSV and JSON directly from the command line after using this tool. See Record section below.

If working with an existing collection, you can download the schema into a JSON file using:

scloud schema get > schema.json
CODE

Once you’ve edited the schema JSON file, you can update it using:

scloud schema add schema.json
CODE

we don’t currently allow existing field type changes.

An example schema field will look like:

    {
      "name": "brand",
      "description": "The product brand",
      "type": "STRING",
      "repeated": false,
      "mode": "NULLABLE"
    }
JSON

Things to note here:

  • type is the data type for the field. Search.io supports STRING, INTEGER, FLOAT, BOOLEAN, TIMESTAMP

  • repeated denotes whether the field is an array. To create a vector for example you would use a repeated field with type FLOAT.

  • mode supports NULLABLE (default), UNIQUE which is a primary key for this collection and REQUIRED which means the field cannot be NULL.

Pipelines

Pipelines are typically defined in YAML, which allows engineers to comment the file with annotations where needed. They have a name and a version which are both strings. We typically use semvar notation for the versions.

In order to create a pipeline you need to understand what steps are and which ones are available in your account. You can list the steps using:

scloud pipeline steps
CODE

This will return a list of query pipeline steps by default. But you can also return a list of record pipeline steps by adding the type=record flag to the above command. e.g.

scloud pipeline steps -type=record
CODE

record and query steps are typically not shown together, so the -type=record flag is required on other pipeline commands when working with record pipelines and steps.

You can also look at steps that are post specific using:

scloud pipeline steps -stepType=post
CODE

Post steps can do really powerful things also. Some examples are merchandising injections of specific results for matching queries and learn-to-rank models

To retrieve the documentation for a specific step, use the following command:

scloud pipeline step -step=<id>
CODE

When looking at step config, they will denote if they are specific to pre or post or supported by both. They can also have several key sections:

  1. CONSTANTS are configuration that is not externally editable by the caller of the pipeline.

  2. INPUTS are configuration expecting pipeline inputs to control the data flow. For example “q” is an input.

  3. OUTPUTS are parameters that get written back to the Params object that flows through the pipeline. The params are initialized with the INPUTS object and then modified throughout the execution. This allows steps to augment the params. So outputs may modify existing params, or create new ones. At the completion of the pipeline, the params that are different from the original inputs are returned to the caller as outputs.

When defining a pipeline, the CONSTANTS and INPUTS define what can be configured for a step. So for example the documentation for the index-spelling step says:

Type:          TYPE_UNSPECIFIED
Step Types:    []
Identifier:    index-spelling
Title:         Index spelling
Description:   augment query input terms with spelling suggestions

INPUTS
ID        NAME      TYPE      DEFAULT VALUE   DESCRIPTION
lang      lang      string    "en"            language pack to use when processing text
text      text      string    ""              query string to perform autocomplete or spell check on

OUTPUTS
ID                  NAME                TYPE                              DEFAULT VALUE   DESCRIPTION
wordSuggestions     wordSuggestions     comma-separated list of strings   ""              list of query word suggestions
phraseSuggestions   phraseSuggestions   comma-separated list of strings   ""              list of query phrase suggestions

CONSTANTS
NAME         TYPE                          VALUE       DESCRIPTION
model        string                        "en.dict"   model to use for spelling suggestions
skipLabels   comma-separated string list   ""          don't provide suggestions for words which have training data for the given labels
CODE

To configure this step in a pipeline you could do something like:

- id: index-spelling
  params:
    text:
    - name: q
  consts:
    model:
    - value: en.dict
YAML

This is saying we are setting the text input to the step to use the input param q (by default it is looking for an input called text, which we are overriding). From the description we can see this is an input feeding the spell check.

The second thing configured above is a constant, which is setting the model constant to be en.dict. Because this is a constant, it cannot be modified by the pipeline inputs. This constant is not actually required as the default value was already en.dict.

Other things to note here:

  • There is another input lang, which has a default input en. This is not being configured in our example YAML above, so this will use the default English processing. It is possible to change this default and or set the input on any query execution to control the language used.

  • The outputs don’t have default values and aren’t configured, so they won’t be written to the params and returned in the output.

Managing pipelines

List the pipelines for a collection (optionally use the type=record flag) using:

scloud pipeline list
CODE

Download a specific pipeline in YAML format using the following:

scloud pipeline get -name=<name> -version=<version>
CODE

Download a specific pipelines in YAML legacy format:

scloud pipeline get -name=<benchmark> -version=<version> -output-format=legacy
CODE

Create a new pipeline using:

scloud pipeline create -type=query|record -path=<path>
CODE

Where <path> is the file path to the YAML file. The name specified in the pipeline YAML definition will be used. The type has to be “query” (default if not specified) or “record”.

Check the default version:

scloud pipeline get-default -name=<name>
CODE

Set the default version:

scloud pipeline set-default -name=<name> -version=<version>
CODE

Run a query pipeline:

scloud pipeline query -name=<name> -version=<version> -inputs='{"q":"nike"}'
CODE

Run a query pipeline and only return certain fields:

scloud pipeline query -name=<name> -version=<version> -inputs='{"q":"nike","fields":"name,brand"}' 
CODE

Run an experiment query using pipeline YAML (should only be used for testing purposes):

scloud pipeline experiment -path=<path> -inputs='{"q":"nike"}'
CODE

Run a replace pipeline:

scloud pipeline replace -name=<name> -version=<version> -inputs='{"visionIn":"image", "visionOut":"imageTags"}' -field=id -value=12345 -record='{"id": 12345,"name":"Big Screen TV", "price":2000.00}'
CODE

Record pipelines

In order to ingest records, you need to be able to create an initial record pipeline. The most basic record pipeline just denotes which fields should be searchable. An example is shown below:

name: record
version: 1
description: starter pipeline that makes some fields searchable
pre-steps:
- id: create-indexes
  consts:
    fields:
    - value: name,description,brand,type,categories,imageTags
CODE

The create-indexes step is a pre-step, which is setting up the pipeline before execution. There are no post-steps. If you were to check the documentation on this step you would also see the default language is en. Interestingly you can create multiple indexes for different languages.

Creating indexes makes fields searchable. But we also want to know which are most important to train spelling and suggestions from. We can add an additional section to the above record pipeline YAML to train this process:

post-steps:
- id: train-autocomplete
  consts:
    model:
    - value: en.dict
    trainQueryFields:
    - value: name:name,brand:brand,categories:categories
    trainSpellingFields:
    - value: name:name,brand:brand,description:description,categories:categories,type:type
CODE

In the above we can see this is being defined as a post-step, so it will happen after the record is successfully added. This is done after for two reasons, 1. if the add fails it won’t train the spelling with things that won’t exist, and 2. it won’t block the addition request returning, but will happen async afterward (this is calling a separate service internally).

Aside from simple steps for processing the inbound data, you can also call external services, one example of doing this is the http-fetch-json step, which in the below example is calling a model to do image analysis and add tags to the record representing the image:

- id: http-fetch-json
  consts:
    url: 
    - value: https://us-central1-example.cloudfunctions.net/vision-api
    timeout: 
    - value: 5000ms
    payloadFields:
    - value: image
    payloadParams:
    - value: visionIn,visionOut
    authToken:
    - value: keds3js#jd@
CODE

This step sends the payloadFields to the specified URL endpoint and merges the response back into the input params. The payloadParams in the above case are determining this step is expecting to define an input field and an output field. They are given these longer names in case the pipeline has many inputs controlling multiple steps, where “in” and “out” would carry little meaning. The authToken is used for authentication. There are other types supported for this step also.

Steps such as the above one allow powerful data processing pipelines to be configured and run when loading in records.

The http-fetch-json step is a great way to quickly connect external services in the data processing pipeline. In general if there is a data processing change that is not possible in an existing step, you can create a custom service or function to do this.

Once created, we can get the usage of our pipeline using:

scloud pipeline usage -name=record -version=1 -type=record
CODE

To load data using this pipeline, we can now use one of the API clients to send a record into this pipeline. An example in Go looks like the following:

pipe := client.Pipeline(name, version)
key, outputs, err := pipe.ReplaceRecord(ctx, map[string]string{"visionIn": "image", "visionOut": "imageTags"}, Search.io.NewKey(ID, rec["id"]), rec)
if err != nil {
	log.Printf("err: %v (%v)", err.Error(), rec)
}
GO

Notes:

  • The above supports concurrency over gRPC, so you can use the pipe safely in multiple worker goroutines

  • You will notice this uses a Replace pipeline and not an Add. This is because a replace is actually an upsert, which will create the record if it does not exist, or update if it does.

  • The visionIn and visionOut are pipeline inputs, which control the vision API step we defined earlier.

  • The response includes a key for the record as well as outputs (if any are defined in the steps). These are useful as they can inform the caller of important things that happened during the processing.

Query pipelines

In order to run query pipelines, you need to create some first. Query pipelines are highly configurable, so this intro is not exhaustive, but should be a generic starting point.

We are in the process of greatly simplifying the current options. Some of these are highly configurable, but difficult to understand. Ask if you are unsure.

The most basic query pipeline YAML starts the same as a record pipeline:

name: my-pipeline
version: 1
description: not much happening here!
YAML

Next is a series of steps turning basic functionality on:

pre-steps:
- id: set-filter
- id: set-fields
- id: pagination
- id: count-aggregate
  params:
    fields:
      - name: "countAggregate"
- id: count-aggregate-filter
  params:
    fields:
      - name: "count"
    filters:
      - name: "countFilters"
- id: max-aggregate
  params:
    fields:
    - name: "max"
- id: min-aggregate
  params:
    fields:
    - name: "min"
- id: date-aggregate
  params:
    fields:
    - name: "date"
- id: sort
  params:
    fields:
    - name: "sort"
CODE

Above is verbose, but it allows almost unlimited configuration options. Once example here is you may wish to restrict which fields can be returned. You can do things such as this by removing the options to specify things. Same for aggregates, sort and pagination.

Now we have enabled filters, the default fields to return (all), aggregates, filter aggregates, sorting by fields, pagination, etc.

Query rewriting

This is the process of rewriting the query with known changes. There are many ways to do this. Below is an example that does a simple find and replace. Interestingly this can set the replacements to a different param. In the below case the output is the same as the input (q), but it doesn’t have to be.

- id: string-replace
  params:
    outText:
    - name: q
    text:
    - name: q
  consts:
    replace:
    - value: '*:,''s:s,northface:north face,airforce:air force'
CODE

To do more complex rewrites, you can use regular expressions. These are very configurable. Below is an example containing two steps. The first string-regexp-extract step is looking for gmail style filter patterns, such as in:inbox , when found this is extracted from the input q , the text minus the replacement is put back into q and the match is put in a parameter called in.

The second step below is then looking for the new param in and if it exists it is dynamically adding a filter as such.

- id: string-regexp-extract
  title: gmail style in:tag-name style filters
  consts:
    pattern:
    - value: in:(?P<inValue>[a-zA-Z_\-]+)
    matchTemplate:
    - value: ${inValue}
  params:
    text:
    - name: "q"
    outText:
    - name: "q"
    match:
    - name: "in"  

- id: add-filter
  title: in:cheap
  condition: in = 'cheap'
  consts:
    filter:
    - value: price_range = '1 - 50'
CODE

Looking at an example here, say we had the input query:

{
  "q": "phone in:cheap"
}
JSON

After the string-regexp-extract step the params would then look like:

{
  "q": "phone",
  "in": "cheap"
}
CODE

The add-filter step is then dynamically activated because it meets the specified condition: condition: in = 'cheap' .

All steps in pipelines can have an activation condition expression. This allows functionality to be dynamically switched on or off based on inputs and augmentation by prior steps.

This is also why the order of steps is sometimes important.

Note: we can easily set up more combinations here as the extraction step is pulling out a pattern match:

- id: add-filter
  title: in:top-rated
  condition: in = 'top-rated'
  consts:
    filter:
    - value: rating >= 5
CODE

The other main query re-writing step is nlp, which is typically a custom model, so not discussing here, but this can do very advanced NLP for specific use-cases.

Spell correction

Spell correction is done with a single step. This step creates a matrix of alternative spellings and their associated probabilities of being correct. The query then executes all possible combinations based on their probabilities, which enables misspellings to typically hit their target. The model takes into account word and phrase sequence probabilities.

- id: index-spelling
  params:
    text:
    - name: q
CODE

Synonyms

Synonyms are created and managed in the admin console currently. These apply to all pipelines in the collection that include the following step:

- id: synonym
  params:
    text:
    - name: q
CODE

Index score boosting

Runs periodic updates to the reverse index scores and updates them relatively to how each particular record has performed for that particularly query intersection. In contrast to TFIDF and BM25 this actually learns what each term is worth for each result and promotes them as such.

- id: index-text-score-instance-boost
  consts:
    minCount:
    - value: 50
    threshold:
    - value: 0.5
CODE

Without significant queries and result interactions (clicks, purchases, etc), this will have minimal or no impact. Works best for high volume queries. Update frequency is ~30 min

Field importance

These steps set which indexes are read and scored and how important each field is when running a query.

The current method of field scoring is very granular. This is done so a query can use different inputs to query different fields and use different weightings also. This is useful for personalization, etc. We will soon release simpler steps to make this less verbose and handle the most common cases.

- id: feature-boost-value
  consts:
    value:
    - value: 0.2
- id: set-score-mode
  consts:
    mode:
    - value: MAX
- id: index-text-index-boost
  params:
    text:
    - name: q
  consts:
    field:
    - value: name
    score:
    - value: 1.0
- id: index-text-index-boost
  params:
    text:
    - name: q
  consts:
    field:
    - value: description
    score:
    - value: 0.5
CODE

Notes:

  • feature-boost-value sets the additive boost weight % of the total score. This set to 0.2 means the indexScore (reverse index matches) will be worth 80% of the score and additive feature boosts will be the remaining 20%. This is a trade-off of business logic and text relevance value in the overall ranking score.

  • set-score-mode can use MAX or ABS. We recommend MAX for most cases.

  • the score for each field can be set in a multitude of ways and will be normalized. If you want a very fixed ordering of field importance you can use descending scores of 1.0, 0.5, 0.25, etc. This would mean a match in the 2nd and 3rd most important fields would still be worth less (0.5+0.25) than a match in the 1st field (1.0).

Boosting

Boosting is used to change the way different records are ranked based on business logic, machine learning feedback or other reasons. Boosting can be applied multiplicatively or additively (typical case and the default). The scores are added up and normalized to the feature-boost-value.

Examples:

Below is a range-boost, which is a linear booster from start (100% of the score) to end (0% of the score). The score is the total strength of the boost. So a bestSellingRank=10000 will get the full 0.1 score contribution.

- id: range-boost
  title: bestSellingRank is important -> lower is better
  consts:
    field:
    - value: bestSellingRank
    score:
    - value: 0.1
    start:
    - value: 10000
    end:
    - value: 0
CODE

A multi-range-boost is similar to a range-boost except it has more than 1 interval. This allows any distribution to be quickly approximated. Using collection statistics we can generate a set of intervals to apply a shape correction. In this case below the averageProfit has a non-linear shape applied to it.

- id: add-multi-range-boost
  consts:
    field:
    - value: averageProfit
    pointScores:
    - value: 0:0,5:0.2,20:0.4,50:0.6,500:1.0
    score:
    - value: 0.1
CODE

A filter-boost applies a boost if a record matches a specific filter expression. In the below case if the input q contains “airpod”, “ipad”, “iphone” or “macbook” we are boosting items where the brand=”Apple”. This is an example of a conditional boost as it is only added if the query inputs meet a certain condition, but filter-boost is used a lot. Another example is to boost items in stock, etc.

- id: filter-boost
  condition: q ~ 'airpod' OR q ~ 'iphone' OR q ~ 'ipad' OR q ~ 'macbook'
  consts:
    filter:
    - value: brand = 'Apple'
    score:
    - value: 0.2
CODE

Personalization based steps to change the ranking based on non-query attributes can also be easily configured once up and running. An example is shown below, where the brandAffinity is another expected input, when this input is set, the boost activates and passes the brandAffinity value in the query to boost items that match. This is a simple example, but much more complex examples can be supported.

- id: filter-boost
  condition: brandAffinity != ''
  consts:
    filter:
    - value: brand = brandAffinity
    score:
    - value: 0.4
CODE

Even if you are unsure how to weight boosts for personalisation, it can be useful to pass in the input parameters anyhow. We record all the inputs and can correlate them against performance, so it’s possible to predict the ideal boost to use here after collecting some data.

Geo-boosting can be done by passing a geo function in the filter expressions. For example:

- id: filter-boost
  title: distance from center
  consts:
    filter:
    - value: GEO_INSIDE(lat, long, location_lat, location_lng, 1)
    score:
    - value: "0.015625"
  condition: location_lat != '' AND location_lng != ''
CODE

There are many other boost types:

  • element-boost compares two lists and does a cosine overlap between them. The cosine similarity score is then multiplied by the boost score

  • vector-boost compares two vectors and does a cosine overlap between them. The cosine similarity score is then multiplied by the boost score

  • percentage-boost is a convenience wrapper for a range-boost from 0.0-1.0

Post-steps

After a query has completed and returned to the caller it is also possible to run post steps. These are useful for many functions including:

  • train-autocomplete trains future autocompletion improvements using the input query text

  • promotions adds in merchandising style record injections into results when matching specified query inputs

  • learn-to-rank can re-rank results based on machine learning models, such as XGBoost.

learn-to-rank requires an offline manual model creation process currently. This works best with lots of training data, be mindful of overfitting.

Records

Aside from using pipelines to add and update records. You can also interact with records directly.

Get a record from a collection

A record can be retrieved from a collection by specifying the name of a unique identifier field and its corresponding value. Search.io assigns its own unique identifier called “_id” to the record, however, you can use any unique identifier field in the record,

scloud record get -field=<name> -value=<value>

# using Search.io's unique identifier
scloud record get -field="_id" -value="1234"
CODE

Delete a record from a collection

A record can be deleted in a collection by specifying the name of a unique identifier field and its corresponding value.

scloud record delete -field=<name> -value=<value>

# Example using Search.io's unique identifier
scloud record delete -field="_id" -value="1234"
CODE

Edit a record in a collection

New fields can be added into the existing record, assuming those new fields are defined in the schema. To add new fields to an existing record:

scloud record mutate -field=<name> -value=<value> -data<JSON encoded map of fields:value>

# Example using Search.io's unique identifier, adding a new "company field"
scloud mutate -field="_id" -value="1234" -data='{"company":"Search.io"}'
CODE

Count number of record in a collection

To retrieve the total number of records in a collection

scloud record count
CODE

Importing records into a collection

You can import data in CSV or JSON (nl) formats into a collection using a record pipeline

# scloud will attempt to infer the format from the file extension
scloud record import -file "data.json" -name="record" -version=1

# alternatively you can specify the format
scloud record import -file "data.csv" -format="csv" -name="record" -version=1
scloud record import -file "data.json" -format="jsonl" -name="record" -version=1
CODE

Filters

Basic supported filter operators are shown below:

Operator

Description

Example

Equal To (=)

Field is equal to a value (numeric or string)

brand = 'Apple'

List elements are all equal

tags = ['apple', ‘orange’,'banana']

Not Equal To (!=)

Field is not equal to a value (numeric or string)

brand != 'Apple'

Greater Than (>)

Field is greater than a numeric value

age > 21

Greater Than Or Equal To (>=)

Field is greater than or equal to a numeric value

age >= 21

Less Than (<)

Field is less than a given numeric value

price < 50.00

Less Than Or Equal To (<=)

Field is less than or equal to a given numeric value

price < 50.00

Begins With (^)

Field begins with a string

domain ^ 'www'

Ends With ($)

Field ends with a string

domain $ '.com'

Contains (~)

Field contains a string

q ~ 'apple'

List contains an element

tags ~ ['apple']

Does Not Contain (!~)

Field does not contain a string

q !~ 'apple'

List does not contain an element

tags !~ ['apple']

IS_NULL()

Field is NULL

IS_NULL(domain)

IS_NOT_NULL()

Field is NOT NULL

IS_NULL(domain)

User Interfaces

There are several ways to build interfaces client-side: JavaScript and ReactJS clients.

An example for ecommerce is shown below. This was built with the JavaScript client and Tailwind.