> ## Documentation Index
> Fetch the complete documentation index at: https://private-7c7dfe99-fix-nav-issues.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

> Documentation for Machine Learning Functions

# Machine Learning Functions

<h2 id="evalmlmethod">
  evalMLMethod
</h2>

Prediction using fitted regression models uses `evalMLMethod` function. See link in `linearRegression`.

<h2 id="stochasticlinearregression">
  stochasticLinearRegression
</h2>

The [stochasticLinearRegression](/reference/functions/aggregate-functions/stochasticLinearRegression) aggregate function implements stochastic gradient descent method using linear model and MSE loss function. Uses `evalMLMethod` to predict on new data.

<h2 id="stochasticlogisticregression">
  stochasticLogisticRegression
</h2>

The [stochasticLogisticRegression](/reference/functions/aggregate-functions/stochasticLogisticRegression) aggregate function implements stochastic gradient descent method for binary classification problem. Uses `evalMLMethod` to predict on new data.

<h2 id="naivebayesclassifier">
  naiveBayesClassifier
</h2>

Classifies input text using a Naive Bayes model with n-grams and Laplace smoothing. The model must be configured in ClickHouse before use.

**Syntax**

```sql theme={null}
naiveBayesClassifier(model_name, input_text);
```

**Arguments**

* `model_name` — Name of the pre-configured model. [String](/reference/data-types/string)
  The model must be defined in ClickHouse's configuration files (see below).
* `input_text` — Text to classify. [String](/reference/data-types/string)
  Input is processed exactly as provided (case/punctuation preserved).

**Returned Value**

* Predicted class ID as an unsigned integer. [UInt32](/reference/data-types/int-uint)
  Class IDs correspond to categories defined during model construction.

**Example**

Classify text with a language detection model:

```sql theme={null}
SELECT naiveBayesClassifier('language', 'How are you?');
```

```response theme={null}
┌─naiveBayesClassifier('language', 'How are you?')─┐
│ 0                                                │
└──────────────────────────────────────────────────┘
```

*Result `0` might represent English, while `1` could indicate French - class meanings depend on your training data.*

***

<h3 id="implementation-details">
  Implementation Details
</h3>

**Algorithm**
Uses Naive Bayes classification algorithm with [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) to handle unseen n-grams based on n-gram probabilities based on [this](https://web.stanford.edu/~jurafsky/slp3/4.pdf).

**Key Features**

* Supports n-grams of any size
* Three tokenization modes:
  * `byte`: Operates on raw bytes. Each byte is one token.
  * `codepoint`: Operates on Unicode scalar values decoded from UTF‑8. Each codepoint is one token.
  * `token`: Splits on runs of Unicode whitespace (regex \s+). Tokens are substrings of non‑whitespace; punctuation is part of the token if adjacent (e.g., "you?" is one token).

***

<h3 id="model-configuration">
  Model Configuration
</h3>

You can find sample source code for creating a Naive Bayes model for language detection [here](https://github.com/nihalzp/ClickHouse-NaiveBayesClassifier-Models).

Additionally, sample models and their associated config files are available [here](https://github.com/nihalzp/ClickHouse-NaiveBayesClassifier-Models/tree/main/models).

Here is an example configuration for a naive Bayes model in ClickHouse:

```xml theme={null}
<clickhouse>
    <nb_models>
        <model>
            <name>sentiment</name>
            <path>/etc/clickhouse-server/config.d/sentiment.bin</path>
            <n>2</n>
            <mode>token</mode>
            <alpha>1.0</alpha>
            <priors>
                <prior>
                    <class>0</class>
                    <value>0.6</value>
                </prior>
                <prior>
                    <class>1</class>
                    <value>0.4</value>
                </prior>
            </priors>
        </model>
    </nb_models>
</clickhouse>
```

**Configuration Parameters**

| Parameter  | Description                                                                                                           | Example                                                  | Default            |
| ---------- | --------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | ------------------ |
| **name**   | Unique model identifier                                                                                               | `language_detection`                                     | *Required*         |
| **path**   | Full path to model binary                                                                                             | `/etc/clickhouse-server/config.d/language_detection.bin` | *Required*         |
| **mode**   | Tokenization method:<br />- `byte`: Byte sequences<br />- `codepoint`: Unicode characters<br />- `token`: Word tokens | `token`                                                  | *Required*         |
| **n**      | N-gram size (`token` mode):<br />- `1`=single word<br />- `2`=word pairs<br />- `3`=word triplets                     | `2`                                                      | *Required*         |
| **alpha**  | Laplace smoothing factor used during classification to address n-grams that do not appear in the model                | `0.5`                                                    | `1.0`              |
| **priors** | Class probabilities (% of the documents belonging to a class)                                                         | 60% class 0, 40% class 1                                 | Equal distribution |

**Model Training Guide**

**File Format**
In human-readable format, for `n=1` and `token` mode, the model might look like this:

```text theme={null}
<class_id> <n-gram> <count>
0 excellent 15
1 refund 28
```

For `n=3` and `codepoint` mode, it might look like:

```text theme={null}
<class_id> <n-gram> <count>
0 exc 15
1 ref 28
```

Human-readable format is not used by ClickHouse directly; it must be converted to the binary format described below.

**Binary Format Details**
Each n-gram stored as:

1. 4-byte `class_id` (UInt, little-endian)
2. 4-byte `n-gram` bytes length (UInt, little-endian)
3. Raw `n-gram` bytes
4. 4-byte `count` (UInt, little-endian)

**Preprocessing Requirements**
Before the model is being created from the document corpus, the documents must be preprocessed to extract n-grams according to the specified `mode` and `n`. The following steps outline the preprocessing:

1. **Add boundary markers at the start and end of each document based on tokenization mode:**

   * **Byte**: `0x01` (start), `0xFF` (end)
   * **Codepoint**: `U+10FFFE` (start), `U+10FFFF` (end)
   * **Token**: `<s>` (start), `</s>` (end)

   *Note:* `(n - 1)` tokens are added at both the beginning and the end of the document.

2. **Example for `n=3` in `token` mode:**

   * **Document:** `"ClickHouse is fast"`
   * **Processed as:** `<s> <s> ClickHouse is fast </s> </s>`
   * **Generated trigrams:**
     * `<s> <s> ClickHouse`
     * `<s> ClickHouse is`
     * `ClickHouse is fast`
     * `is fast </s>`
     * `fast </s> </s>`

To simplify model creation for `byte` and `codepoint` modes, it may be convenient to first tokenize the document into tokens (a list of `byte`s for `byte` mode and a list of `codepoint`s for `codepoint` mode). Then, append `n - 1` start tokens at the beginning and `n - 1` end tokens at the end of the document. Finally, generate the n-grams and write them to the serialized file.

***

{/*AUTOGENERATED_START*/}
