> ## Documentation Index > Fetch the complete documentation index at: https://private-7c7dfe99-fix-nav-issues.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. > 自然语言处理 (NLP) 函数文档 # 自然语言处理 (NLP) 函数 export const CloudNotSupportedBadge = () => { return

Not supported in ClickHouse Cloud

; }; export const ExperimentalBadge = () => { return

Experimental feature. Learn more.

; }; {/*AUTOGENERATED_START*/}

## detectCharset

引入版本：v22.2.0 检测采用非 UTF8 编码的输入字符串的字符集。此函数为 Experimental，未来的发行版中可能会发生不可预测的向后不兼容更改。将 `allow_experimental_nlp_functions = 1` 设为 1 以启用该函数。 **语法** ```sql theme={null} detectCharset(s) ``` **参数** * `s` — 要分析的文本。[`String`](/zh/reference/data-types/string) **返回值** 返回包含所检测字符集代码的字符串。[`String`](/zh/reference/data-types/string) **示例** **基本用法** ```sql title=Query theme={null} SELECT detectCharset('Ich bleibe für ein paar Tage.') ``` ```response title=Response theme={null} WINDOWS-1252 ```

## detectLanguage

引入版本：v22.2.0 检测采用 UTF8 编码的输入字符串所使用的语言。该函数使用 [CLD2 库](https://github.com/CLD2Owners/cld2) 进行检测，并返回 2 位 ISO 语言代码。输入内容越长，语言检测就越准确。此函数为 Experimental 功能，未来的发行版中可能会发生不可预测的、向后不兼容的变更。设置 `allow_experimental_nlp_functions = 1` 以启用该功能。 **语法** ```sql theme={null} detectLanguage(s) ``` **参数** * `text_to_be_analyzed` — 要分析的文本。[`String`](/zh/reference/data-types/string) **返回值** 返回检测到的语言的 2 字母 ISO 代码。其他可能的结果包括：`un` = 未知，无法检测出任何语言；`other` = 检测到的语言没有 2 字母代码。[`String`](/zh/reference/data-types/string) **示例** **混合语言文本** ```sql title=Query theme={null} SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there\'s a will, there\'s a way.') ``` ```response title=Response theme={null} fr ```

## detectLanguageMixed

Introduced in：v22.2.0 与 [`detectLanguage`](#detectLanguage) 函数类似，但 `detectLanguageMixed` 返回一个 `Map`，其中 2 字母语言代码映射为相应语言在文本中所占的百分比。此函数为 Experimental，未来的发行版中可能会发生不可预测的向后不兼容更改。设置 `allow_experimental_nlp_functions = 1` 以启用该函数。 **语法** ```sql theme={null} detectLanguageMixed(s) ``` **参数** * `s` — 要分析的文本 [`String`](/zh/reference/data-types/string) **返回值** 返回一个映射，其中键为 2 位 ISO 代码，相应的值为该语言在文本中所占的百分比 [`Map(String, Float32)`](/zh/reference/data-types/map) **示例** **混合语言** ```sql title=Query theme={null} SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.') ``` ```response title=Response theme={null} {'ja':0.62,'fr':0.36} ```

## detectLanguageUnknown

引入版本：v22.2.0 与 [`detectLanguage`](#detectLanguage) 函数类似，不同之处在于 detectLanguageUnknown 函数可处理非 UTF8 编码的字符串。当字符集为 UTF-16 或 UTF-32 时，建议优先使用此版本。此函数为 Experimental，在未来的发行版中可能会发生不可预测的、向后不兼容的更改。设置 `allow_experimental_nlp_functions = 1` 以启用该函数。 **语法** ```sql theme={null} detectLanguageUnknown('s') ``` **参数** * `s` — 要分析的文本。[`String`](/zh/reference/data-types/string) **返回值** 返回检测到的语言的 2 字母 ISO 代码。其他可能的结果包括：`un` = 未知，无法检测到任何语言；`other` = 检测到的语言没有 2 字母代码。[`String`](/zh/reference/data-types/string) **示例** **基本用法** ```sql title=Query theme={null} SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.') ``` ```response title=Response theme={null} de ```

## detectTonality

引入版本：v22.2.0 确定给定文本数据的情感倾向。 **限制** 此函数当前版本的限制在于，它使用内置情感字典，且仅适用于俄语。此函数为 Experimental，未来的发行版中可能会发生不可预测的向后不兼容变更。设置 `allow_experimental_nlp_functions = 1` 以启用该函数。 **语法** ```sql theme={null} detectTonality(s) ``` **参数** * `s` — 要分析的文本。[`String`](/zh/reference/data-types/string) **返回值** 返回文本中各单词情感值的平均值 [`Float32`](/zh/reference/data-types/float) **示例** **俄语情感分析** ```sql title=Query theme={null} SELECT detectTonality('Шарик - хороший пёс'), detectTonality('Шарик - пёс'), detectTonality('Шарик - плохой пёс') ``` ```response title=Response theme={null} 0.44445, 0, -0.3 ```

## lemmatize

引入版本：v21.9.0 对给定单词执行词形还原。此函数需要依赖字典运行，可从 [github](https://github.com/vpodpecan/lemmagen3/tree/master/src/lemmagen3/models) 获取。有关如何从本地文件加载字典的更多详细信息，请参见页面 ["定义字典"](/zh/reference/statements/create/dictionary/sources/local-file)。此函数属于 Experimental 功能，未来的发行版中可能会发生不可预测的向后不兼容更改。设置 `allow_experimental_nlp_functions = 1` 以启用该功能。 **语法** ```sql theme={null} lemmatize(lang, word) ``` **参数** * `lang` — 要应用其规则的语言。[`String`](/zh/reference/data-types/string) * `word` — 需要进行词形还原的小写单词。[`String`](/zh/reference/data-types/string) **返回值** 返回该单词经过词形还原后的形式。[`String`](/zh/reference/data-types/string) **示例** **英语词形还原** ```sql title=Query theme={null} SELECT lemmatize('en', 'wolves') ``` ```response title=Response theme={null} wolf ```

## stem

引入版本：v21.9.0 使用 Snowball 算法对单个单词或单词数组执行词干提取。每个输入字符串都必须是一个小写单词——包含空白字符的字符串会引发异常。传入大写字符会产生未定义结果。对于标量输入 (包括 FixedString) ，返回 String；对于数组输入，返回 Array(String)。支持 String 和 FixedString 的 Nullable 与 LowCardinality 变体。 **语法** ```sql theme={null} stem(word, language) ``` **参数** * `word` — 要进行词干提取的单个小写单词 (或单词数组) 。必须使用小写——大写字符会产生未定义的结果。接受 String、FixedString、Array(String)、Array(FixedString)、Array(Nullable(String)) 或 Array(Nullable(FixedString))。[`String`](/zh/reference/data-types/string) 或 [`FixedString`](/zh/reference/data-types/fixedstring) 或 [`Array(String)`](/zh/reference/data-types/array) 或 [`Array(FixedString)`](/zh/reference/data-types/array) * `language` — 要应用其词干提取规则的语言。使用双字母 ISO 639-1 代码 (例如 'en'、'de'、'fr') ，请参见 [https://en.wikipedia.org/wiki/List\_of\_ISO\_639\_language\_codes。\[\`String\`](https://en.wikipedia.org/wiki/List\_of\_ISO\_639\_language\_codes。\[`String`)]\(/zh/reference/data-types/string) **返回值** 单词的词干形式 (String) ，或词干提取后的单词数组 (Array(String)) 。[`String`](/zh/reference/data-types/string) 或 [`Array(String)`](/zh/reference/data-types/array) **示例** **对单个单词进行词干提取** ```sql title=Query theme={null} SELECT stem('blessing', 'en') AS res ``` ```response title=Response theme={null} bless ``` **单词数组的词干提取** ```sql title=Query theme={null} SELECT stem(['blessing', 'disguise'], 'en') AS res ``` ```response title=Response theme={null} ['bless','disguis'] ``` **对 FixedString 执行词干提取** ```sql title=Query theme={null} SELECT stem(toFixedString('blessing', 10), 'en') AS res ``` ```response title=Response theme={null} bless ``` **Nullable 单词的词干提取** ```sql title=Query theme={null} SELECT stem(toNullable('blessing'), 'en') AS res ``` ```response title=Response theme={null} bless ```

## synonyms

引入版本：v21.9.0 查找给定词语的同义词。同义词扩展有两种类型： * `plain` * `wordnet` 使用 `plain` 扩展类型时，需要提供一个简单文本文件的路径，其中每一行对应一个特定的同义词集合。该行中的词语必须以空格或制表符分隔。使用 `wordnet` 扩展类型时，需要提供一个包含 WordNet 词库的目录路径。该词库必须包含 WordNet 词义索引。此函数为 Experimental，在未来的发行版中可能会发生不可预测的向后不兼容变更。设置 `allow_experimental_nlp_functions = 1` 以启用它。 **语法** ```sql theme={null} synonyms(ext_name, word) ``` **参数** * `ext_name` — 将在其中执行搜索的扩展名称。[`String`](/zh/reference/data-types/string) * `word` — 要在该扩展中搜索的词。[`String`](/zh/reference/data-types/string) **返回值** 返回给定词的同义词数组。[`Array(String)`](/zh/reference/data-types/array) **示例** **查找同义词** ```sql title=Query theme={null} SELECT synonyms('list', 'important') ``` ```response title=Response theme={null} ['important','big','critical','crucial'] ```