Improve search accuracy #1235

tesshucom · 2019-09-15T15:35:42Z

Related to #1142.
This PR proposes the following two improvements.

Setting boost value. Legacy code was conscious of introduction, but it was dead code.
Airsonic-specific Stopward for music search.

In a search system, these are usually considered at design time.
The current search does n’t take into account the convenience of users who type short phrases.

Here is a simple example.

Stopward change example

Will cannot be used in legacy searches.

Example of boost value adjustment

It is reasonable to give priority to the leftmost item in the search results.
Since the boost value is assigned a very small value, the priority will be reversed if the cost of the Artist name is high.

before

after

Brainstorm is required for the value set in Stopward.
(Particularly the opinions of English speaking people are necessary. Because I am Japanese.)

Once this is done, maintenance is easy and anyone can do it. For example, if you add a composer, you can give it a new priority.
Stopwords are assigned differently for Artist and other fields. This is because the words you want to ignore, such as feat, with, are a little different.

@muff1nman

Although not required, there are some considerations. This changes affect the index version. So if you release this PR at the same time as the Lucene update, users will be satisfied (If the release is divided, the index must be reconfigured.). This PR does not include increments.

fxthomas · 2019-09-25T20:16:22Z

A few questions about how this works:

Is it possible to not ignore stopwords but leave them at a low priority? This way one can still search for a stop word if we really want it. How do you think this would interfere with the accuracy?
How consistent is your PR with the "articles to ignore" field in Settings > General that specifically mentions that these articles will be ignored in the index? Is that setting still relevant?
Edge case related to the setting : depending on the language, some stop words might be different. For example los in German means either "go!" or "without" and should probably not ignored if your collection consists mostly of German artists/albums. That's the reason why I like the idea of keeping it configurable.

Otherwise I like what your screenshots show, it sounds reasonable to boost the priority of the artist field in the Artist column. Great job!

tesshucom · 2019-09-27T15:12:41Z

Is it possible to not ignore stopwords but leave them at a low priority? This way one can still search for a stop word if we really want it. How do you think this would interfere with the accuracy?

We can create logic that lowers the boost for specific words.
That is technically possible.

I have never seen it.
Probably not common.
This is to complicate cost calculations.

Even if the boost value is lowered...
If it hits , false search increases when a song with a long title is input.
If there are more noisy words than important keywords, this has a negative impact on cost.

As a result, it will be difficult to display in the order the user wants.

How consistent is your PR with the "articles to ignore" field in Settings > General that specifically mentions that these articles will be ignored in the index? Is that setting still relevant?

It is statically defined in the following file:

stopwords.txt
stopwords_artist.txt

It can be changed dynamically, but the following points need to be considered.

Complex operation (full scan required)
Stopwords is part of the Analyzer mechanism. Especially strongly dependent on Tokenizer.
(In other words, multilingual support is not possible completely only with this setting value.)

Generally, it is handled by the file unit of stopwords.txt as is done in this PR.

Edge case related to the setting : depending on the language, some stop words might be different. For example los in German means either "go!" or "without" and should probably not ignored if your collection consists mostly of German artists/albums. That's the reason why I like the idea of keeping it configurable.

i'll put in.

Happy if you can accumulate feedback as follows:

Candidates to remove from default Stopwards

los -> to be delete because German is not good.

la 　-> can't find an artist called la la or a song called la la la.

Candidates to add

None for now

I tried to add but didn't

x 　-> X means collaboration. I can't find an artist named X.

fxthomas · 2019-09-30T20:45:28Z

Complex operation (full scan required)

That's correct, although we could automatically start a rescan if that happens.

Stopwords is part of the Analyzer mechanism. Especially strongly dependent on Tokenizer (In other words, multilingual support is not possible completely only with this setting value.)

You're right on this. If we want to implement this we should probably add a Tokenizer setting in addition to the stop words, but not in this PR.

I will pull & test, will let you know if everything looks good for me.

EDIT: A couple of searches are already much, much better. I like the improvements!

Thinking about this we should probably remove the "articles to ignore" field from the UI if it's not used anymore.

tesshucom · 2019-09-30T23:11:34Z

The currently "articles to ignore" is a setting item corresponding to index of domain.
(When creating the index on the left side of the screen, to remove the first article and properly classify it)

"articles to ignore" is included in the domain design and is also an output item of the REST API.
So this may not be deleted.

Search Stopward is not directly related to domain design.
Legacy implementations use defaults of Lucene and are not visible on Airsonic code.
The previous Stopward is defined in EnglishAnalyzer.ENGLISH_STOP_WORDS_SET.

fxthomas · 2019-10-01T20:30:58Z

Ah, I missed that it was never affecting Lucene at all (the "index" the field was mentioning was the left sidebar index, not the Lucene index). I don't have an issue with the changes if that is the case.

I found no issues so far on my server (was expecting none, but still it's nice to know).

tesshucom · 2019-10-04T00:57:42Z

Thank you very much.
Removed los and la from Stopward.

Language switching can be achieved by replacing Analyzer and Stopword if the language is similar to English.
I think fxthomas can probably do that.
A little annoying, but Legacy is moving forward because it couldn't.

Implementing a simple switch may be possible but impractical
because additional analyzers and data are required.

I saw in the previous issues that "plugins have little merit to hard work".
For this reason, there are currently no plans for language switching as a successor to this PR.

(Depending on the language, additional logic and dictionaries can increase the war size by 30Mb.)

muff1nman

One small nitpick, but otherwise looks fine to me. I agree with @fxthomas that there is future work that we can do here especially with regards to hardcoded stop words.

airsonic-main/src/main/java/org/airsonic/player/service/search/AnalyzerFactory.java

tesshucom · 2019-10-10T16:54:39Z

For example, if you have an additional Stopward file in a specific directory, you might find it useful to load it.
(There is also a method to make it unofficial function for a while without releasing it to the user.)

Adding a setting to the screen may be a little strange.

Beneficial to some core users while beginners get confused.
Generally handled in files

There are few in English alone, but it is huge depending on the language.
Multilinguals tend to be huge.
Assuming that the content is complex, it will be easier to manage the files.

In any case, I agree with the idea that Stopward can be changed in some way.
It is very useful to have a way to test what Stopward is suitable for.

tesshucom · 2019-10-10T17:34:49Z

You can combine two commits.
I will leave it to you.

In terms of content, I think this is a slightly different modification.
The boost value is small this time, but you can increase it a little more.
This value is adjusted according to the purpose.

fxthomas · 2019-10-20T13:35:18Z

I do believe that this one is ready to merge as soon as conflicts are resolved?

tesshucom · 2019-10-20T16:04:27Z

Plese note only the following points:

This changes affect the index version. If the (Update of Lucene)release is divided, the index must be reconfigured.

At the time this PR was created, it was impossible to predict when the release this PR would take place.

fxthomas · 2019-10-20T16:33:34Z

Good point. If we release both Lucene upgrades and this PR at the same time, things should be fine without additional upgrades, am I right?

muff1nman · 2019-11-12T06:18:47Z

Ah shucks this should have gotten pulled into the previous release. I'm sorry I forgot about this one. Well this will need to bump the index version now.

fxthomas · 2019-12-02T20:43:27Z

We just need to update INDEX_VERSION, right? If you are okay I think I can merge this one 😉

- Iterate index version.

tesshucom · 2019-12-03T17:14:16Z

After rebase, INDEX_VERSION update merged with Stopword update.

fxthomas · 2019-12-03T20:17:36Z

Thank you so much for the help, merging now!

eharris added in: search/sort Issues in the searching/indexing of media. usability labels Sep 16, 2019

tesshucom force-pushed the improve-search-accuracy branch from aec5e75 to d069947 Compare September 24, 2019 13:41

tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from e762b06 to 4788992 Compare October 4, 2019 00:24

tesshucom changed the title ~~[WIP] Improve search accuracy~~ Improve search accuracy Oct 4, 2019

muff1nman suggested changes Oct 6, 2019

View reviewed changes

airsonic-main/src/main/java/org/airsonic/player/service/search/AnalyzerFactory.java Outdated Show resolved Hide resolved

tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from bf08f3c to 63d39e2 Compare October 10, 2019 17:20

fxthomas approved these changes Oct 20, 2019

View reviewed changes

jvoisin approved these changes Oct 24, 2019

View reviewed changes

Apply boost values to search queries

bb464f1

tesshucom force-pushed the improve-search-accuracy branch 2 times, most recently from b5439f9 to 858e7c3 Compare December 3, 2019 16:46

Apply stopwords dedicated to music search

dba8610

- Iterate index version.

tesshucom force-pushed the improve-search-accuracy branch from cd312ac to dba8610 Compare December 3, 2019 17:09

fxthomas merged commit 2d30a37 into airsonic:master Dec 3, 2019

tesshucom deleted the improve-search-accuracy branch December 18, 2019 07:15

randomnicode mentioned this pull request Mar 18, 2020

Improve Lucene Search and upgrade Lucene (8.2.0 -> 8.4.1) airsonic-advanced/airsonic-advanced#140

Merged

Improve search accuracy #1235

Improve search accuracy #1235

Uh oh!

Conversation

tesshucom commented Sep 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stopward change example

Example of boost value adjustment

before

after

Uh oh!

fxthomas commented Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tesshucom commented Sep 27, 2019

Candidates to remove from default Stopwards

Candidates to add

I tried to add but didn't

Uh oh!

fxthomas commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tesshucom commented Sep 30, 2019

Uh oh!

fxthomas commented Oct 1, 2019

Uh oh!

tesshucom commented Oct 4, 2019

Uh oh!

muff1nman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tesshucom commented Oct 10, 2019

Uh oh!

tesshucom commented Oct 10, 2019

Uh oh!

fxthomas commented Oct 20, 2019

Uh oh!

tesshucom commented Oct 20, 2019

Uh oh!

fxthomas commented Oct 20, 2019

Uh oh!

muff1nman commented Nov 12, 2019

Uh oh!

fxthomas commented Dec 2, 2019

Uh oh!

tesshucom commented Dec 3, 2019

Uh oh!

fxthomas commented Dec 3, 2019

Uh oh!

Uh oh!

tesshucom commented Sep 15, 2019 •

edited

Loading

fxthomas commented Sep 25, 2019 •

edited

Loading

fxthomas commented Sep 30, 2019 •

edited

Loading