Search Modules¶
Indexing¶
- class app.indexing.BaseDocumentPreprocessor(config: IndexConfig)[source]¶
Base class for document preprocessors.
Classes referenced in index configuration preprocess field, has to be derived from it.
- abstract preprocess(document: dict[str, Any]) FetcherResult [source]¶
Preprocess the document before data ingestion in Elasticsearch.
This can be used to make document schema compatible with the project schema or to add custom fields.
- Returns:
a FetcherResult object:
the status can be used to pilot wether to index or not the document (even delete it)
the document is the transformed document
- class app.indexing.BaseTaxonomyPreprocessor(config: IndexConfig)[source]¶
Base class for taxonomy entries preprocessors.
Classes referenced in index configuration preprocess field, has to be derived from it.
- abstract preprocess(taxonomy: Taxonomy, node: TaxonomyNode) TaxonomyNodeResult [source]¶
Preprocess the taxonomy entry before ingestion in Elasticsearch, and before synonyms generation
This can be used to make document schema compatible with the project schema or to add custom fields.
- Returns:
a TaxonomyNodeResult object:
the status can be used to pilot wether to index or not the entry (even delete it)
the entry is the transformed entry
- class app.indexing.DocumentProcessor(config: IndexConfig)[source]¶
DocumentProcessor is responsible of converting an item to index into a dict that is ready to be indexed by Elasticsearch.
- from_result(result: FetcherResult) FetcherResult [source]¶
Generate an item ready to be indexed by elasticsearch-dsl from a fetcher result.
- Parameters:
result – the input data
- Returns:
a new result with transformed data, ready to be indexed or removed or skipped.
In case of indexing or removal, the document always contains an id_ item
- app.indexing.generate_dsl_field(field: FieldConfig, supported_langs: Iterable[str]) Field [source]¶
Generate Elasticsearch DSL field from a FieldConfig.
This will be used to generate the Elasticsearch mapping.
This is an important part, because it will define the behavior of each field.
- Parameters:
field – the field to use as input
supported_langs – an iterable of languages (2-letter codes), used to know which sub-fields to create for text_lang and taxonomy field types
- Returns:
the elasticsearch_dsl field
- app.indexing.generate_index_object(index_name: str, config: IndexConfig) Index [source]¶
Index configuration for project index, that will contain the data
- app.indexing.generate_mapping_object(config: IndexConfig) Mapping [source]¶
ES Mapping for project index, that will contain the data
- app.indexing.generate_taxonomy_index_object(index_name: str, config: IndexConfig) Index [source]¶
Index configuration for indexes containing taxonomies entries
- app.indexing.generate_taxonomy_mapping_object(config: IndexConfig) Mapping [source]¶
ES Mapping for indexes containing taxonomies entries
- app.indexing.process_taxonomy_field(data: dict[str, Any], field: FieldConfig, taxonomy_config: TaxonomyConfig, split_separator: str) dict[str, Any] | None [source]¶
Process data for a taxonomy field type.
There is not much to be done here, as the magic of synonyms etc. happens by ES itself, thanks to our mapping definition, and a bit at query time.
- Parameters:
data – input data, as a dict
field – the field config
split_separator – the separator used to split the input field value, in case of multi-valued input (if field.split is True)
- Returns:
the processed value
- app.indexing.process_text_lang_field(data: dict[str, Any], input_field: str, split: bool, lang_separator: str, split_separator: str, supported_langs: set[str]) dict[str, Any] | None [source]¶
Process data for a text_lang field type.
Generates a dict ready to be indexed by Elasticsearch, with a subfield for each language.
- Parameters:
data – input data, as a dict
input_field – the name of the field to use as input
split – whether to split the input field value, using split_separator as separator
lang_separator – the separator used to separate the language code from the field name
split_separator – the separator used to split the input field value, in case of multi-valued input (if split is True)
supported_langs – a set of supported languages (2-letter codes), used to know which sub-fields to create
- Returns:
the processed data, as a dict
Query¶
- app.query.add_languages_suffix(analysis: QueryAnalysis, langs: list[str], config: IndexConfig) QueryAnalysis [source]¶
Add correct languages suffixes to fields of type text_lang or taxonomy
This match in a langage OR another
- app.query.boost_phrases(analysis: QueryAnalysis, boost: float, proximity: int | None) QueryAnalysis [source]¶
Boost all phrases in the query
- app.query.build_completion_query(q: str, taxonomy_names: list[str], langs: list[str], size: int, config: IndexConfig, fuzziness: int | None = 2)[source]¶
Build an elasticsearch_dsl completion Query.
- Parameters:
q – the user autocomplete query
taxonomy_names – a list of taxonomies we want to search in
langs – the languages we want search in
size – number of results to return
config – the index configuration to use
fuzziness – fuzziness parameter for completion query
- Returns:
the built Query
- app.query.build_elasticsearch_query_builder(config: IndexConfig) ElasticsearchQueryBuilder [source]¶
Create the ElasticsearchQueryBuilder object according to our configuration
- app.query.build_search_query(params: SearchParameters, es_query_builder: ElasticsearchQueryBuilder) QueryAnalysis [source]¶
Build an elasticsearch_dsl Query.
- Parameters:
params – SearchParameters containing all search parameters
es_query_builder – the builder to transform the luqum tree to an elasticsearch query
- Returns:
the built Search query
- app.query.check_query(params: SearchParameters, analysis: QueryAnalysis) None [source]¶
Run some sanity checks on the luqum query
- app.query.compute_facets_filters(q: QueryAnalysis) QueryAnalysis [source]¶
Extract facets filters from the query
For now it only handles SearchField under a top AND operation, which expression is a bare term or a OR operation of bare terms.
We do not verify if the field is an aggregation field or not, that can be done at a later stage
- Returns:
a new QueryAnalysis with facets_filters attribute as a dictionary of field names and list of values to filter on
- app.query.create_aggregation_clauses(config: IndexConfig, fields: set[str] | list[str] | None) dict[str, Agg] [source]¶
Create term bucket aggregation clauses for all fields corresponding to facets, as defined in the config
- app.query.parse_query(q: str | None) QueryAnalysis [source]¶
Begin query analysis by parsing the query.
- app.query.parse_sort_by_field(sort_by: str | None, config: IndexConfig) str | None [source]¶
Parse sort_by parameter, special handling is performed for text_lang subfield.
- Parameters:
sort_by – the raw sort_by value
config – the index configuration to use
- Returns:
None if sort_by is not provided or the final value otherwise
- app.query.parse_sort_by_script(sort_by: str, params: dict[str, Any] | None, config: IndexConfig, index_id: str) dict[str, Any] [source]¶
Create the ES sort expression to sort by a script
- app.query.resolve_open_ranges(analysis: QueryAnalysis) QueryAnalysis [source]¶
We need to resolve open ranges to closed ranges before using elasticsearch query builder
- app.query.resolve_unknown_operation(analysis: QueryAnalysis) QueryAnalysis [source]¶
Resolve unknown operations in the query to a AND
Search¶
- app.search.search(params: SearchParameters) ErrorSearchResponse | SuccessSearchResponse [source]¶
Run a search
Facets¶
A module to help building facets from aggregations
- app.facets.build_facets(search_result: SuccessSearchResponse, query_analysis: QueryAnalysis, lang: str, index_config: IndexConfig, facets_names: list[str] | None) dict[str, FacetInfo] [source]¶
Given a search result with aggregations, build a list of facets for API response
- app.facets.translate_facets_values(lang: str, facets: dict[str, FacetInfo], index_config: IndexConfig)[source]¶
Translate values of facets
Charts¶
- app.charts.build_charts(search_result: SuccessSearchResponse, index_config: IndexConfig, requested_charts: list[DistributionChart | ScatterChart] | None) dict[str, dict[str, Any]] [source]¶
Build and return vega charts representations for the given requested charts
- app.charts.build_distribution_chart(chart: DistributionChart, values, index_config: IndexConfig)[source]¶
Return the vega structure for a Bar Chart Inspiration: https://vega.github.io/vega/examples/bar-chart/
- app.charts.build_scatter_chart(chart_option: ScatterChart, search_result, index_config: IndexConfig)[source]¶
Build a scatter plot only for values from search_results (only values in the current page) TODO: use values from the whole search? Inspiration: https://vega.github.io/vega/examples/scatter-plot/
- app.charts.empty_chart(chart_name)[source]¶
Return a responsive vega chart using signals and auto-size https://gist.github.com/donghaoren/023b2246569e8f0615017507b473e55e
Vega is used as a JSON visualization grammar Doc: https://vega.github.io/vega/docs/ It would have been possible to use higher lever vega-lite API, which is able to write vega specifications but it’s probably too much for our usage Inspired by: https://vega.github.io/vega/examples/bar-chart/
ES Scripts¶
Module to manage ES scripts that can be used for personalized sorting
- app.es_scripts.get_script_id(index_id: str, script_id: str)[source]¶
We prefix scripts specific to an index with the index_id.
- app.es_scripts.get_script_prefix(index_id: str)[source]¶
We prefix scripts specific to an index with the index_id.
- app.es_scripts.sync_scripts(index_id: str, index_config: IndexConfig) dict[str, int] [source]¶
Resync the scripts between configuration and elasticsearch.
Taxonomy ES¶
Operations on taxonomies in ElasticSearch
See also app.taxonomy
- app.taxonomy_es.create_synonyms_files(taxonomy: Taxonomy, langs: list[str], target_dir: Path)[source]¶
Create a set of files that can be used to define a Synonym Graph Token Filter
We will match every known synonym in a language to the identifier of the entry. We do this because we are not sure which is the main language for an entry.
Also the special xx language is added to every languages if it exists.
- app.taxonomy_es.get_taxonomy_names(items: list[tuple[str, str]], config: IndexConfig) dict[tuple[str, str], dict[str, str]] [source]¶
Given a set of terms in different taxonomies, return their names
Analyzers (Utils)¶
Defines some analyzers for the elesaticsearch fields.
- app.utils.analyzers.get_autocomplete_analyzer(lang: str) CustomAnalysis [source]¶
Return the search analyzer to use for the autocomplete field
- app.utils.analyzers.get_taxonomy_indexing_analyzer(taxonomy: str, lang: str) CustomAnalysis [source]¶
We want to index taxonomies terms as keywords (as we only store the id), but with a specific tweak: transform hyphens into underscores,
- app.utils.analyzers.get_taxonomy_search_analyzer(taxonomy: str, lang: str, with_synonyms: bool) CustomAnalysis [source]¶
Return the search analyzer to use for the taxonomized field
- Parameters:
taxonomy – the taxonomy name
lang – the language code
with_synonyms – whether to add the synonym filter
- app.utils.analyzers.get_taxonomy_stop_words_filter(taxonomy: str, lang: str) TokenFilter | None [source]¶
Return the stop words filter to use for the taxonomized field analyzer
IMPORTANT: de-activated for now ! If we want to handle them, we have to remove them in synonyms, so we need the list.