Search Modules

Indexing

class app.indexing.BaseDocumentPreprocessor(config: IndexConfig)[source]

Base class for document preprocessors.

Classes referenced in index configuration preprocess field, has to be derived from it.

abstract preprocess(document: dict[str, Any]) FetcherResult[source]

Preprocess the document before data ingestion in Elasticsearch.

This can be used to make document schema compatible with the project schema or to add custom fields.

Returns:

a FetcherResult object:

  • the status can be used to pilot wether to index or not the document (even delete it)

  • the document is the transformed document

class app.indexing.BaseTaxonomyPreprocessor(config: IndexConfig)[source]

Base class for taxonomy entries preprocessors.

Classes referenced in index configuration preprocess field, has to be derived from it.

abstract preprocess(taxonomy: Taxonomy, node: TaxonomyNode) TaxonomyNodeResult[source]

Preprocess the taxonomy entry before ingestion in Elasticsearch, and before synonyms generation

This can be used to make document schema compatible with the project schema or to add custom fields.

Returns:

a TaxonomyNodeResult object:

  • the status can be used to pilot wether to index or not the entry (even delete it)

  • the entry is the transformed entry

class app.indexing.DocumentProcessor(config: IndexConfig)[source]

DocumentProcessor is responsible of converting an item to index into a dict that is ready to be indexed by Elasticsearch.

from_result(result: FetcherResult) FetcherResult[source]

Generate an item ready to be indexed by elasticsearch-dsl from a fetcher result.

Parameters:

result – the input data

Returns:

a new result with transformed data, ready to be indexed or removed or skipped.

In case of indexing or removal, the document always contains an id_ item

inputs_from_data(id_, processed_data: dict[str, Any]) dict[str, Any][source]

Generate a dict with the data to be indexed in ES

app.indexing.generate_dsl_field(field: FieldConfig, supported_langs: Iterable[str]) Field[source]

Generate Elasticsearch DSL field from a FieldConfig.

This will be used to generate the Elasticsearch mapping.

This is an important part, because it will define the behavior of each field.

Parameters:
  • field – the field to use as input

  • supported_langs – an iterable of languages (2-letter codes), used to know which sub-fields to create for text_lang and taxonomy field types

Returns:

the elasticsearch_dsl field

app.indexing.generate_index_object(index_name: str, config: IndexConfig) Index[source]

Index configuration for project index, that will contain the data

app.indexing.generate_mapping_object(config: IndexConfig) Mapping[source]

ES Mapping for project index, that will contain the data

app.indexing.generate_taxonomy_index_object(index_name: str, config: IndexConfig) Index[source]

Index configuration for indexes containing taxonomies entries

app.indexing.generate_taxonomy_mapping_object(config: IndexConfig) Mapping[source]

ES Mapping for indexes containing taxonomies entries

app.indexing.process_taxonomy_field(data: dict[str, Any], field: FieldConfig, taxonomy_config: TaxonomyConfig, split_separator: str) dict[str, Any] | None[source]

Process data for a taxonomy field type.

There is not much to be done here, as the magic of synonyms etc. happens by ES itself, thanks to our mapping definition, and a bit at query time.

Parameters:
  • data – input data, as a dict

  • field – the field config

  • split_separator – the separator used to split the input field value, in case of multi-valued input (if field.split is True)

Returns:

the processed value

app.indexing.process_text_lang_field(data: dict[str, Any], input_field: str, split: bool, lang_separator: str, split_separator: str, supported_langs: set[str]) dict[str, Any] | None[source]

Process data for a text_lang field type.

Generates a dict ready to be indexed by Elasticsearch, with a subfield for each language.

Parameters:
  • data – input data, as a dict

  • input_field – the name of the field to use as input

  • split – whether to split the input field value, using split_separator as separator

  • lang_separator – the separator used to separate the language code from the field name

  • split_separator – the separator used to split the input field value, in case of multi-valued input (if split is True)

  • supported_langs – a set of supported languages (2-letter codes), used to know which sub-fields to create

Returns:

the processed data, as a dict

Query

app.query.add_languages_suffix(analysis: QueryAnalysis, langs: list[str], config: IndexConfig) QueryAnalysis[source]

Add correct languages suffixes to fields of type text_lang or taxonomy

This match in a langage OR another

app.query.boost_phrases(analysis: QueryAnalysis, boost: float, proximity: int | None) QueryAnalysis[source]

Boost all phrases in the query

app.query.build_completion_query(q: str, taxonomy_names: list[str], langs: list[str], size: int, config: IndexConfig, fuzziness: int | None = 2)[source]

Build an elasticsearch_dsl completion Query.

Parameters:
  • q – the user autocomplete query

  • taxonomy_names – a list of taxonomies we want to search in

  • langs – the languages we want search in

  • size – number of results to return

  • config – the index configuration to use

  • fuzziness – fuzziness parameter for completion query

Returns:

the built Query

app.query.build_elasticsearch_query_builder(config: IndexConfig) ElasticsearchQueryBuilder[source]

Create the ElasticsearchQueryBuilder object according to our configuration

app.query.build_search_query(params: SearchParameters, es_query_builder: ElasticsearchQueryBuilder) QueryAnalysis[source]

Build an elasticsearch_dsl Query.

Parameters:
  • params – SearchParameters containing all search parameters

  • es_query_builder – the builder to transform the luqum tree to an elasticsearch query

Returns:

the built Search query

app.query.check_query(params: SearchParameters, analysis: QueryAnalysis) None[source]

Run some sanity checks on the luqum query

app.query.compute_facets_filters(q: QueryAnalysis) QueryAnalysis[source]

Extract facets filters from the query

For now it only handles SearchField under a top AND operation, which expression is a bare term or a OR operation of bare terms.

We do not verify if the field is an aggregation field or not, that can be done at a later stage

Returns:

a new QueryAnalysis with facets_filters attribute as a dictionary of field names and list of values to filter on

app.query.create_aggregation_clauses(config: IndexConfig, fields: set[str] | list[str] | None) dict[str, Agg][source]

Create term bucket aggregation clauses for all fields corresponding to facets, as defined in the config

app.query.parse_query(q: str | None) QueryAnalysis[source]

Begin query analysis by parsing the query.

app.query.parse_sort_by_field(sort_by: str | None, config: IndexConfig) str | None[source]

Parse sort_by parameter, special handling is performed for text_lang subfield.

Parameters:
  • sort_by – the raw sort_by value

  • config – the index configuration to use

Returns:

None if sort_by is not provided or the final value otherwise

app.query.parse_sort_by_script(sort_by: str, params: dict[str, Any] | None, config: IndexConfig, index_id: str) dict[str, Any][source]

Create the ES sort expression to sort by a script

app.query.resolve_open_ranges(analysis: QueryAnalysis) QueryAnalysis[source]

We need to resolve open ranges to closed ranges before using elasticsearch query builder

app.query.resolve_unknown_operation(analysis: QueryAnalysis) QueryAnalysis[source]

Resolve unknown operations in the query to a AND

Search

app.search.search(params: SearchParameters) ErrorSearchResponse | SuccessSearchResponse[source]

Run a search

Facets

A module to help building facets from aggregations

app.facets.build_facets(search_result: SuccessSearchResponse, query_analysis: QueryAnalysis, lang: str, index_config: IndexConfig, facets_names: list[str] | None) dict[str, FacetInfo][source]

Given a search result with aggregations, build a list of facets for API response

app.facets.translate_facets_values(lang: str, facets: dict[str, FacetInfo], index_config: IndexConfig)[source]

Translate values of facets

Charts

app.charts.build_charts(search_result: SuccessSearchResponse, index_config: IndexConfig, requested_charts: list[DistributionChart | ScatterChart] | None) dict[str, dict[str, Any]][source]

Build and return vega charts representations for the given requested charts

app.charts.build_distribution_chart(chart: DistributionChart, values, index_config: IndexConfig)[source]

Return the vega structure for a Bar Chart Inspiration: https://vega.github.io/vega/examples/bar-chart/

app.charts.build_scatter_chart(chart_option: ScatterChart, search_result, index_config: IndexConfig)[source]

Build a scatter plot only for values from search_results (only values in the current page) TODO: use values from the whole search? Inspiration: https://vega.github.io/vega/examples/scatter-plot/

app.charts.empty_chart(chart_name)[source]

Return a responsive vega chart using signals and auto-size https://gist.github.com/donghaoren/023b2246569e8f0615017507b473e55e

Vega is used as a JSON visualization grammar Doc: https://vega.github.io/vega/docs/ It would have been possible to use higher lever vega-lite API, which is able to write vega specifications but it’s probably too much for our usage Inspired by: https://vega.github.io/vega/examples/bar-chart/

ES Scripts

Module to manage ES scripts that can be used for personalized sorting

app.es_scripts.get_script_id(index_id: str, script_id: str)[source]

We prefix scripts specific to an index with the index_id.

app.es_scripts.get_script_prefix(index_id: str)[source]

We prefix scripts specific to an index with the index_id.

app.es_scripts.sync_scripts(index_id: str, index_config: IndexConfig) dict[str, int][source]

Resync the scripts between configuration and elasticsearch.

Taxonomy ES

Operations on taxonomies in ElasticSearch

See also app.taxonomy

app.taxonomy_es.create_synonyms_files(taxonomy: Taxonomy, langs: list[str], target_dir: Path)[source]

Create a set of files that can be used to define a Synonym Graph Token Filter

We will match every known synonym in a language to the identifier of the entry. We do this because we are not sure which is the main language for an entry.

Also the special xx language is added to every languages if it exists.

see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-with-synonyms.html#synonyms-store-synonyms-file

app.taxonomy_es.get_taxonomy_names(items: list[tuple[str, str]], config: IndexConfig) dict[tuple[str, str], dict[str, str]][source]

Given a set of terms in different taxonomies, return their names

Analyzers (Utils)

Defines some analyzers for the elesaticsearch fields.

app.utils.analyzers.get_autocomplete_analyzer(lang: str) CustomAnalysis[source]

Return the search analyzer to use for the autocomplete field

app.utils.analyzers.get_taxonomy_indexing_analyzer(taxonomy: str, lang: str) CustomAnalysis[source]

We want to index taxonomies terms as keywords (as we only store the id), but with a specific tweak: transform hyphens into underscores,

app.utils.analyzers.get_taxonomy_search_analyzer(taxonomy: str, lang: str, with_synonyms: bool) CustomAnalysis[source]

Return the search analyzer to use for the taxonomized field

Parameters:
  • taxonomy – the taxonomy name

  • lang – the language code

  • with_synonyms – whether to add the synonym filter

app.utils.analyzers.get_taxonomy_stop_words_filter(taxonomy: str, lang: str) TokenFilter | None[source]

Return the stop words filter to use for the taxonomized field analyzer

IMPORTANT: de-activated for now ! If we want to handle them, we have to remove them in synonyms, so we need the list.

app.utils.analyzers.get_taxonomy_synonym_filter(taxonomy: str, lang: str) TokenFilter[source]

Return the synonym filter to use for the taxonomized field analyzer

app.utils.analyzers.number_of_fields(mapping: Mapping | dict[str, dict[str, Any]]) int[source]

Return the number of fields in the mapping

Connection (Utils)

app.utils.connection.current_es_client()[source]

Return ElasticSearch default connection