es ltr 里有特征仓库的概念,一个特征仓库其实就是一个 es 的索引,可以存储特征和模型的元数据。
什么是 es ltr 特征
Elasticsearch LTR 的特征和 es 的 query 密切相关,如用户搜索的 query 在文档中各个字段的相似度分(相关分)可以作为训练集中的特征使用,同样作为模型预测的一个特征使用。
es ltr 特征定义主要使用 mustache 语言,模板中变量用两个大括号定义,如 {{keywords}}
、{{users_lat}}
、{{users_lon}}
等。
特征仓库
es ltr 里有特征仓库的概念,一个特征仓库其实就是一个 es 的索引,可以存储特征和模型的元数据。
初始化
一般初始化默认的特征仓库,如下:
PUT _ltr
删除默认的特征仓库,如下:
DELETE _ltr
至于删除特征仓库谨慎使用,它会清楚所有的特征集合、模型等数据。
业务特征仓库
生产环境根据业务创建特征仓库,便于维护和提高可读性,具体如下:
PUT _ltr/{featurestore}
以创建问答业务的特征仓库为例:
PUT _ltr/qa_featurestore
返回如下:
{ "acknowledged" : true, "shards_acknowledged" : true, "index" : ".ltrstore_qa_featurestore" }
特征及特征集合
一个特征集合是多个特征组成的集合,它是作用于特征打印(feature logging)和离线训练(offline training),具体是通过特征打印构造离线训练的样本,然后生成的模型是和特征集合是映射的关系。
创建特征集合
用 POST 请求构造特征集合,默认形式如下:
POST _ltr/_featureset/{featureset_name}
{
......
}
实际应用中,一般会根据业务自定义特征空间,它定义在 _ltr
和 _featureset
之间,具体如下:
POST _ltr/movie_search/_featureset/more_movie_features
{
"featureset": {
"features": [
{
"name": "title_query",
"params": [
"keywords"
],
"template_language": "mustache",
"template": {
"match": {
"title": "{{keywords}}"
}
}
},
{
"name": "title_query_boost",
"params": [
"some_multiplier"
],
"template_language": "derived_expressions",
"template": "title_query * some_multiplier"
},
{
"name": "custom_title_query_boost",
"params": [
"some_multiplier"
],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "params.feature_vector.get('title_query') * (long)params.some_multiplier",
"params": {
"some_multiplier": "some_multiplier"
}
}
}
]
}
}
特征添加
es ltr 也提供了在已有的特征集合中添加特征的 api,往往实际开发中,会随着业务的变化,会有一些新的特征出现,如新增一个文本字段的匹配等,如果你希望它参与后续的特征打印和模型预测,需要使用 _addfeatures api 添加,具体示例如下:
POST /_ltr/_featureset/my_featureset/_addfeatures
{
"features": [{
"name": "user_rating",
"params": [],
"template_language": "mustache",
"template" : {
"function_score": {
"functions": {
"field": "vote_average"
},
"query": {
"match_all": {}
}
}
}
}]
}
示例
问答特征库
以笔者参与的问答系统特征为例,主要是问题文本与答案文本的相关性分数,问答本身的历史统计特征,具体代码如下:
PUT _ltr/qa_featurestore/_featureset/qa_v1_featureset
{
"featureset": {
"features": [
{
"name": "qc_match_score_original",
"params": [
"original_query"
],
"template_language": "mustache",
"template": {
"match": {
"question_content": "{{original_query}}"
}
}
},
{
"name": "ac_match_score_original",
"params": [
"original_query"
],
"template_language": "mustache",
"template": {
"match": {
"answer_content": "{{original_query}}"
}
}
},
{
"name": "qc_4_display_match_score_original",
"params": [
"original_query"
],
"template_language": "mustache",
"template": {
"match": {
"question_content_4_display": "{{original_query}}"
}
}
},
{
"name": "ac_4_display_match_score_original",
"params": [
"original_query"
],
"template_language": "mustache",
"template": {
"match": {
"answer_content_4_display": "{{original_query}}"
}
}
},
{
"name": "like_rate",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['like_rate'].size()>0){return doc['like_rate'].value;}else{return 0;}"
}
},
{
"name": "like_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['like_cnt'].size()>0){return doc['like_cnt'].value;}else{return 0;}"
}
},
{
"name": "like_cnt_4_display",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['like_cnt_4_display'].size()>0){return doc['like_cnt_4_display'].value;}else{return 0;}"
}
},
{
"name": "view_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['view_cnt'].size()>0){return doc['view_cnt'].value;}else{return 0;}"
}
},
{
"name": "view_cnt_4_display",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['view_cnt_4_display'].size()>0){return doc['view_cnt_4_display'].value;}else{return 0;}"
}
},
{
"name": "quality_score",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['quality_score'].size()>0){return doc['quality_score'].value;}else{return 0;}"
}
},
{
"name": "word_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['word_cnt'].size()>0){return doc['word_cnt'].value;}else{return 0;}"
}
},
{
"name": "question_view_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_view_cnt'].size()>0){return doc['question_view_cnt'].value;}else{return 0;}"
}
},
{
"name": "question_answer_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_answer_cnt'].size()>0){return doc['question_answer_cnt'].value;}else{return 0;}"
}
},
{
"name": "question_answer_like_cnt_4_display",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_answer_like_cnt_4_display'].size()>0){return doc['question_answer_like_cnt_4_display'].value;}else{return 0;}"
}
},
{
"name": "question_uv_ctr_28d_wilson_95",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_uv_ctr_28d_wilson_95'].size()>0){return doc['question_uv_ctr_28d_wilson_95'].value;}else{return 0;}"
}
},
{
"name": "question_uv_ctr_28d",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_uv_ctr_28d'].size()>0){return doc['question_uv_ctr_28d'].value;}else{return 0;}"
}
},
{
"name": "question_ctr_28d",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_ctr_28d'].size()>0){return doc['question_ctr_28d'].value;}else{return 0;}"
}
},
{
"name": "impression_28d",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['impression_28d'].size()>0){return doc['impression_28d'].value;}else{return 0;}"
}
},
{
"name": "audio_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['audio_cnt'].size()>0){return doc['audio_cnt'].value;}else{return 0;}"
}
},
{
"name": "video_cnt",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['video_cnt'].size()>0){return doc['video_cnt'].value;}else{return 0;}"
}
},
{
"name": "question_ctr_28d_wilson_95",
"params": [],
"template_language": "script_feature",
"template": {
"lang": "painless",
"source": "if(doc['question_ctr_28d_wilson_95'].size()>0){return doc['question_ctr_28d_wilson_95'].value;}else{return 0;}"
}
}
]
}
}
es 对应的字段为 question_content(问题文本)、answer_content(答案文本)、question_content_4_display(外露问题文本)和 answer_content_4_display(外露答案文本);
特征库前四个是相关性文本分数特征,其余都是问答的静态特征和历史统计特征;
其中相关性分数的模版语言(template_language)采用 mustache,属性特征采用 script_feature。