Skip to content
This repository was archived by the owner on Sep 8, 2021. It is now read-only.

Split SearchService #1130

Closed
wants to merge 15 commits into from
Closed

Conversation

tesshucom
Copy link
Contributor

@tesshucom tesshucom commented Jun 15, 2019

This is a draft of SearchService's split refactoring related to #1113.
I divided the class to withstand Lucene update and added the test cases to the high importance class.

The relationship between the divided classes is as follows.
(Filter and Tokenizer may be added after updating)

The granularity of the division prioritizes that differences in test results can be easily seen during Lucene updates, and that improvements can easily be made individually after the update.


class


Additional point of view

  • SearchService is interfaced and holds the previous methods completely.
  • The following is written in the SearchService comment: Performs Lucene-based searching and indexing.
  • Now, startIndexing index stopIndex can be deleted from SearchService. MediaScannerService refers to SearchService, but can be replaced with IndexManager (now, IndexManager is a deligate in SearchService).
  • This means that the SearchService design is completely split up in search and indexing.

It is not included in PR because it is not related to Lucene's update. If it is desirable to make a change, will commit at the same time.

Is there a cool name for SearchServiceTermination ...?

Test cases for class division and version upgrade.
Extract interface of SearchService.
And extract 2 classes from SearchService

 - Extract enum class of IndexType
 - Differentiate DocumentFactory from IndexType

As the package of IndexType changes,
fixed the class that referred to IndexType.
And correct the service reference setting of SonosHelper.
Because SonosHelper refers to the service entity, not auto inject.

DocumentFactory is heavily influenced by Lucene version.
These two classes have a major impact on index file management.
It is desirable to rebuild the index file
 at the time of design change due to requirements
 or implementation change of Lucene version upgrade.
Create a factory class of Analyzer.
Processing did not change, and test cases
for current status observation were added.
Create a factory class of queries.
Processing did not change, and test cases
for current status observation were added.
The lock mechanism is left in SearchService
to cope with the transition period.
After the update, it is unnecessary
because the framework manages locks.

The present condition is only extraction of processing.
There are many improvements in this class after the update.
Moved a lot of redundant scripts into one class.
Judgment processing specific to SearchService, conversion of Object,
refilling processing to list, etc.

The random function is refactored.
The list copy has been changed to a partial copy due to overprocessing.
This will eliminate the dependency on guava.

On the other hand, explicit declarations were added to pom
to use CollectionUtils for addIgnoreNull.
analyzer = new CustomAnalyzer();
} catch (Exception e) {
/*
* It is a transient description, and it is appropriate to detect an IOException at Lucene
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "temporary" :)

Copy link
Contributor Author

@tesshucom tesshucom Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.
Analyzer initialization process after update needs to catch IOException.
I removed it because it is confusing.


public static final class FieldNames {

private FieldNames() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

@tesshucom tesshucom Jun 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FieldNames can be package access.
It moved to another new class and deleted from enum.

Because the Id3 method was overloaded with createDocument.
Fixed in createAlbumId3Document and createArtistId3Document.
@tesshucom tesshucom force-pushed the separate-search-serv branch from 82ff245 to 36b704f Compare June 15, 2019 17:06
Change index field name definition to package access.
@tesshucom tesshucom force-pushed the separate-search-serv branch from 2943597 to 9672421 Compare June 15, 2019 17:36
Remove exception processing that is currently unnecessary.
Copy link
Contributor

@jvoisin jvoisin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to add a PDF file :)

@tesshucom
Copy link
Contributor Author

I erased.

@tesshucom
Copy link
Contributor Author

Currently Airsonic has a reproducible Path search bug.
The content is very related to this PR ... what next?
Do you want to handle here or issue ?

@jvoisin
Copy link
Contributor

jvoisin commented Jun 17, 2019

I would prefer to have it in a new issue :)

@tesshucom
Copy link
Contributor Author

If you scan with a specific pass pattern, the data is corrupted and incorrect search.
(I thought it was related to this PR, but it might not be related ...)
I'll Investigate a little more the playback procedure and then write a new issue.

@tesshucom
Copy link
Contributor Author

tesshucom commented Jun 18, 2019

As for the scan, it seems that my confirmation was missing.
It used to be confused with the problem of nesting of directories with reported examples and children not being updated.

I wrote a new issue for shuffle false search.

@tesshucom
Copy link
Contributor Author

tesshucom commented Jun 20, 2019

Earlier, I wrote that my clones contained some spec changes.
That is the problem above.
Folders are excluded from queries, and path comparisons exact-match.
This is a simpler solution as Japanese are traditionally plagued by native language pass strings.
There is a bug in path comparison, and folder specification is only a factor of false search.

Doing the same with Airsonic means that removes folder condition from query.
If legacy should be respected, There is a alternative proposal such as :

  • update as it is
  • change the analysis of the FOLDER field to PathHierarchyTokenizer or a new dedicated PathTokenizer.

The most reliable and robust is to modify the logic so that the path comparison will be done by exact-match.
In my clone, I deleted the folder from the search condition, but can keep this condition by using another index.

In such a design, boost value(current dead code) allows for score adjustments.
I think the first design idea was like that.

Currently legacy code is implemented to reduce index size, in exchange for inaccuracies.
This is a tricky approach that is difficult to understand without careful reading.

In any case there is a better solution.
The benefits of Lucene's version upgrade outweigh the slight increase in processing.
(Memory improvement, index compression rate, search speed improvement, etc.)
If that is determined to be necessary, it is good to change to a better process.

@tesshucom tesshucom changed the title WIP:Split SearchService Split SearchService Jun 20, 2019
@tesshucom tesshucom mentioned this pull request Jun 23, 2019
@jvoisin
Copy link
Contributor

jvoisin commented Jun 24, 2019

I like the idea of changing the analysis of the FOLDER field to PathHierarchyTokenizer or a new dedicated PathTokenizer.

Worse case, increasing the size of the index is completely acceptable in my opinion.

@tesshucom
Copy link
Contributor Author

tesshucom commented Jun 29, 2019

It solved by doing some experiments.

In my clone, I wrote that the folder contains specification changes.
Not SpanOr but Or.
When I verified this accurately, the current SpanOr implementation can not be migrated to 7.x due to technical limitations.
(Available with the expert option. . .)

Legacy saves the path as follows:
In 7.x, SpanOr can not be used for NO_NORMS.

doc.add(new Field(FieldNames.FOLDER, mediaFile.getFolder(), Field.Store.NO, Field.Index.NOT_ANALYZED_NO_NORMS

When Path is processed with Single Token by KeywordAnalyzer instead of NO_NORMS, Spanor can be used.
(Of course it is an undesirable implementation. For verification.)
In this case, if there are Music Folders such as Music, Music2 and Music3, misanalysis occurs.
For example, shuffle search will pick up the mess.
The result that can be observed by the user is a bug close to #1139.
But the cause is a potential defect different from #1139.


There is another legacy mysterious behavior.

"+((art:abc* f:abc*) (art:def* f:def*)) +spanOr([f:" + PATH1 + "])",

The path of Music Folder is specified in the main query, and comparison with the input value is performed.

I think this is not a meaningful specification.
(I do not understand unless I ask the author)

  • Folder is specified as NO_NORMS at the time of document generation. At least not related to scoring
  • However, since Folder is explicitly added to IndexType, it is analyzed by MultiFieldQueryParser. Due to the input is affected evil by analysis by MultiFieldQueryParser and there is no substantial meaning as a search of Folder

The correspondence method is either of the following:

  • The intention is unknown but to be implement as it is (It does not function as a proper search condition).
  • To be omit it because it causes false search.
  • Direct search of music folders is considered as a requirement for user, so to be fix.

In the case of the third policy, search will be realized using PathHierarchyTokenizer, KeywordTokenizer and PathTokenizer etc...


Besides replacing the simple API syntax, rewriting SpanOr to Or is a condition for updating.
UAX#29 needs some confirmation about the handling of the underbar.

I can create a PR keeping this in mind.

@jvoisin
Copy link
Contributor

jvoisin commented Jun 30, 2019

It's ok to break retro-compatibility: nobody is going to complain that the search is working way better :D

@tesshucom
Copy link
Contributor Author

tesshucom commented Jun 30, 2019

I will create a PR from this.

If you want to increase the variation of the test and apply it after and before the update, it is useful if the PR so far is merged into the master or the verification branch.
If you want to make a decision on discard including push after this, I will push on this PR.
(Which one is better?)


If you want to postpone the decision to discard , there are also ideas to improve the Reader and include it in the evaluation..

The overall performance by the upgrade tends to improve, but the search overhead time increases by about 1 ms.
It will be improved by reusing the reader.

Reader is open every search and not closed 😣
it is recommended to fix.
I'll push in succession if necessary.


MediaScannerService#
deleteOldIndexFiles() probably does not work.

It is a mechanism to change the generation of an existing index.
Database schemas also have generation management, but they can be thought of as similar roles.

This does not have to be immediate, but a fix in the future is ​​desirable.

@jvoisin jvoisin requested review from muff1nman and jooola June 30, 2019 14:03
@jvoisin
Copy link
Contributor

jvoisin commented Jun 30, 2019

I'm ok with merging this PR so you can work on top of the master, since it's making the code a bit more clear: everyone prefers to review small PR, instead of massive ones.

But I would like that at least someone else review this PR, since it's moving a lot of things around.

Please close the Reader and fix deleteOldIndexFiles method is possible :)

@tesshucom
Copy link
Contributor Author

For the time being, i will proceed locally.
Once the policy is decided, I'll PR it to the desired location.

I hope the tests are done as thoroughly as possible.
I will do my best as much as possible, but it would be better if different people would be involved.

* Enum that symbolizes the field name used for lucene index.
* This class is a division of what was once part of SearchService and added functionality.
*/
class FieldNames {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for the changing of the field names?

Copy link
Contributor Author

@tesshucom tesshucom Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value has been changed simply to shorten the query.
(I just want to reduce the line breaks in assert code in QueryFactoryTestCase.
I'll restore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised that according to your advice.

* @return result
* @throws IOException
*/
private final <D> List<D> createRandomFiles(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method needs to be renamed. How about createRandomDocsList?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised that according to your advice.

* so do not include exception handling in this class.
*/
@Component
public class SearchServiceTermination {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs a different name. It seems this class is mainly a holder for various utility functions related to searching. A generic name therefore might be SearchServiceUtilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised that according to your advice.

@@ -0,0 +1,574 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use 4 space indentation. @jvoisin please add this to the list in #1121

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised that according to your advice.

Copy link
Contributor

@muff1nman muff1nman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems quite clean and is mainly just moving things around which is a great first step here. In fact the only functional change I can find is the field names. Can we keep the field names the same for this PR so that testing wise we can be certain there should be no functional changes? We can do the field name changes in a separate PR and would help us test that this PR keeps identical behavior.

Other than that I added some comments about naming.

@muff1nman muff1nman removed the request for review from jooola July 1, 2019 17:48
@tesshucom
Copy link
Contributor Author

I`ll correct the point you pointed out.

There are currently inevitable CVE issues in Airsonic.
I may not be able to respond immediately.

spring-security-core-4.2.12.RELEASE.jar
(cpe:/a:pivotal_software:spring_security:4.2.12,
 org.springframework.security:spring-security-core:4.2.12.RELEASE) : CVE-2019-11272

spring-security-ldap-4.2.12.RELEASE.jar
(org.springframework.security:spring-security-ldap:4.2.12.RELEASE,
 cpe:/a:pivotal_software:spring_security:4.2.12,
 cpe:/a:pivotal_software:spring-ldap:4.2.12) : CVE-2019-11272

tomcat-embed-core-8.5.40.jar
(org.apache.tomcat.embed:tomcat-embed-core:8.5.40,
 cpe:/a:apache_software_foundation:tomcat:8.5.40,
 cpe:/a:apache_tomcat:apache_tomcat:8.5.40,
 cpe:/a:apache:tomcat:8.5.40) : CVE-2019-10072

jackson-databind-2.9.9.jar
 (cpe:/a:fasterxml:jackson:2.9.9,
  cpe:/a:fasterxml:jackson-databind:2.9.9,
  com.fasterxml.jackson.core:jackson-databind:2.9.9) : CVE-2019-12814

Jdom2.x exists in Airsonic classpath.

$ mvn dependency:tree | grep -i jdom
[INFO] +- org.jdom:jdom:jar:2.0.2:compile

In this case, jackson-databind causes a security issue.
There is no effective countermeasure at present.

The 2.9.9.1 or 2.10.0 release of jackson-databind is planned and is pending for several days.
It may be easier if this is released.

@muff1nman
Copy link
Contributor

@tesshucom id rebase with master. those cve issues should be fixed there.

@tesshucom
Copy link
Contributor Author

I understand.
It was already solved in upstream!

@muff1nman
Copy link
Contributor

Looks good. Can you rebase? (I can too if you want).

@tesshucom
Copy link
Contributor Author

I'll leave it to you which granularity of rebase is appropriate!

@muff1nman
Copy link
Contributor

Rebased in 767b39e

@muff1nman muff1nman closed this Jul 2, 2019
@tesshucom tesshucom deleted the separate-search-serv branch July 3, 2019 20:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants