Skip to content
This repository was archived by the owner on Sep 8, 2021. It is now read-only.

Improve search accuracy #1235

Merged
merged 2 commits into from
Dec 3, 2019
Merged

Conversation

tesshucom
Copy link
Contributor

@tesshucom tesshucom commented Sep 15, 2019

Related to #1142.
This PR proposes the following two improvements.

  • Setting boost value. Legacy code was conscious of introduction, but it was dead code.
  • Airsonic-specific Stopward for music search.

In a search system, these are usually considered at design time.
The current search does n’t take into account the convenience of users who type short phrases.

Here is a simple example.


Stopward change example

Will cannot be used in legacy searches.

image

image


Example of boost value adjustment

It is reasonable to give priority to the leftmost item in the search results.
Since the boost value is assigned a very small value, the priority will be reversed if the cost of the Artist name is high.

before

image

after

image


Brainstorm is required for the value set in Stopward.
(Particularly the opinions of English speaking people are necessary. Because I am Japanese.)

  • Once this is done, maintenance is easy and anyone can do it. For example, if you add a composer, you can give it a new priority.
  • Stopwords are assigned differently for Artist and other fields. This is because the words you want to ignore, such as feat, with, are a little different.

@muff1nman

  • Although not required, there are some considerations. This changes affect the index version. So if you release this PR at the same time as the Lucene update, users will be satisfied (If the release is divided, the index must be reconfigured.). This PR does not include increments.

@eharris eharris added in: search/sort Issues in the searching/indexing of media. usability labels Sep 16, 2019
@tesshucom tesshucom force-pushed the improve-search-accuracy branch from aec5e75 to d069947 Compare September 24, 2019 13:41
@fxthomas
Copy link
Contributor

fxthomas commented Sep 25, 2019

A few questions about how this works:

  • Is it possible to not ignore stopwords but leave them at a low priority? This way one can still search for a stop word if we really want it. How do you think this would interfere with the accuracy?
  • How consistent is your PR with the "articles to ignore" field in Settings > General that specifically mentions that these articles will be ignored in the index? Is that setting still relevant?
  • Edge case related to the setting : depending on the language, some stop words might be different. For example los in German means either "go!" or "without" and should probably not ignored if your collection consists mostly of German artists/albums. That's the reason why I like the idea of keeping it configurable.

Otherwise I like what your screenshots show, it sounds reasonable to boost the priority of the artist field in the Artist column. Great job!

@tesshucom
Copy link
Contributor Author

  • Is it possible to not ignore stopwords but leave them at a low priority? This way one can still search for a stop word if we really want it. How do you think this would interfere with the accuracy?

We can create logic that lowers the boost for specific words.
That is technically possible.

I have never seen it.
Probably not common.
This is to complicate cost calculations.

Even if the boost value is lowered...
If it hits , false search increases when a song with a long title is input.
If there are more noisy words than important keywords, this has a negative impact on cost.

As a result, it will be difficult to display in the order the user wants.

  • How consistent is your PR with the "articles to ignore" field in Settings > General that specifically mentions that these articles will be ignored in the index? Is that setting still relevant?

It is statically defined in the following file:

stopwords.txt
stopwords_artist.txt

It can be changed dynamically, but the following points need to be considered.

  • Complex operation (full scan required)
  • Stopwords is part of the Analyzer mechanism. Especially strongly dependent on Tokenizer.
    (In other words, multilingual support is not possible completely only with this setting value.)

Generally, it is handled by the file unit of stopwords.txt as is done in this PR.

  • Edge case related to the setting : depending on the language, some stop words might be different. For example los in German means either "go!" or "without" and should probably not ignored if your collection consists mostly of German artists/albums. That's the reason why I like the idea of keeping it configurable.

i'll put in.


Happy if you can accumulate feedback as follows:

Candidates to remove from default Stopwards

los -> to be delete because German is not good.

la  -> can't find an artist called la la or a song called la la la.

Candidates to add

None for now

I tried to add but didn't

x  -> X means collaboration. I can't find an artist named X.

@fxthomas
Copy link
Contributor

fxthomas commented Sep 30, 2019

Complex operation (full scan required)

That's correct, although we could automatically start a rescan if that happens.

Stopwords is part of the Analyzer mechanism. Especially strongly dependent on Tokenizer (In other words, multilingual support is not possible completely only with this setting value.)

You're right on this. If we want to implement this we should probably add a Tokenizer setting in addition to the stop words, but not in this PR.

I will pull & test, will let you know if everything looks good for me.

EDIT: A couple of searches are already much, much better. I like the improvements!

Thinking about this we should probably remove the "articles to ignore" field from the UI if it's not used anymore.

@tesshucom
Copy link
Contributor Author

The currently "articles to ignore" is a setting item corresponding to index of domain.
(When creating the index on the left side of the screen, to remove the first article and properly classify it)

"articles to ignore" is included in the domain design and is also an output item of the REST API.
So this may not be deleted.

Search Stopward is not directly related to domain design.
Legacy implementations use defaults of Lucene and are not visible on Airsonic code.
The previous Stopward is defined in EnglishAnalyzer.ENGLISH_STOP_WORDS_SET.

@fxthomas
Copy link
Contributor

fxthomas commented Oct 1, 2019

Ah, I missed that it was never affecting Lucene at all (the "index" the field was mentioning was the left sidebar index, not the Lucene index). I don't have an issue with the changes if that is the case.

I found no issues so far on my server (was expecting none, but still it's nice to know).

@tesshucom tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from e762b06 to 4788992 Compare October 4, 2019 00:24
@tesshucom tesshucom changed the title [WIP] Improve search accuracy Improve search accuracy Oct 4, 2019
@tesshucom
Copy link
Contributor Author

Thank you very much.
Removed los and la from Stopward.

Language switching can be achieved by replacing Analyzer and Stopword if the language is similar to English.
I think fxthomas can probably do that.
A little annoying, but Legacy is moving forward because it couldn't.

Implementing a simple switch may be possible but impractical
because additional analyzers and data are required.

I saw in the previous issues that "plugins have little merit to hard work".
For this reason, there are currently no plans for language switching as a successor to this PR.

(Depending on the language, additional logic and dictionaries can increase the war size by 30Mb.)

Copy link
Contributor

@muff1nman muff1nman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nitpick, but otherwise looks fine to me. I agree with @fxthomas that there is future work that we can do here especially with regards to hardcoded stop words.

@tesshucom
Copy link
Contributor Author

For example, if you have an additional Stopward file in a specific directory, you might find it useful to load it.
(There is also a method to make it unofficial function for a while without releasing it to the user.)

Adding a setting to the screen may be a little strange.

  • Beneficial to some core users while beginners get confused.
  • Generally handled in files

There are few in English alone, but it is huge depending on the language.
Multilinguals tend to be huge.
Assuming that the content is complex, it will be easier to manage the files.

In any case, I agree with the idea that Stopward can be changed in some way.
It is very useful to have a way to test what Stopward is suitable for.

@tesshucom tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from bf08f3c to 63d39e2 Compare October 10, 2019 17:20
@tesshucom
Copy link
Contributor Author

You can combine two commits.
I will leave it to you.

In terms of content, I think this is a slightly different modification.
The boost value is small this time, but you can increase it a little more.
This value is adjusted according to the purpose.

@fxthomas
Copy link
Contributor

I do believe that this one is ready to merge as soon as conflicts are resolved?

@tesshucom
Copy link
Contributor Author

Plese note only the following points:

This changes affect the index version. If the (Update of Lucene)release is divided, the index must be reconfigured.

At the time this PR was created, it was impossible to predict when the release this PR would take place.

@fxthomas
Copy link
Contributor

Good point. If we release both Lucene upgrades and this PR at the same time, things should be fine without additional upgrades, am I right?

@muff1nman
Copy link
Contributor

Ah shucks this should have gotten pulled into the previous release. I'm sorry I forgot about this one. Well this will need to bump the index version now.

@fxthomas
Copy link
Contributor

fxthomas commented Dec 2, 2019

We just need to update INDEX_VERSION, right? If you are okay I think I can merge this one 😉

@tesshucom tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from b5439f9 to 858e7c3 Compare December 3, 2019 16:46
@tesshucom tesshucom force-pushed the improve-search-accuracy branch from cd312ac to dba8610 Compare December 3, 2019 17:09
@tesshucom
Copy link
Contributor Author

After rebase, INDEX_VERSION update merged with Stopword update.

@fxthomas
Copy link
Contributor

fxthomas commented Dec 3, 2019

Thank you so much for the help, merging now!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in: search/sort Issues in the searching/indexing of media.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants