Index.UN_TOKENIZED overrides other tokenized fields that share the same name

Description

Marking one field as un-tokenized causes all other fields with the same names to be un-tokenized.

i.e.

The resulting behaviour is that "simple_search" will be made up of un-tokenized 'string' and 'string2' values, even though 'string2' was specified to be tokenized.

Environment

3.4.0 Final

Activity

Show:
Sanne Grinovero
October 27, 2011, 12:12 AM

ok, you convinced me of it's usefulness Now we just need to find the time to fix it, a volunteer?

Hardy Ferentschik
November 10, 2011, 6:33 PM

After reviewing some code and digging a little deeper I am not so sure that we actually can do something. The problem is not so much on the Hibernate Search side, than on the Lucene API.

On the Search side we actually keep the metadata per field and use the right option when building the document. See assertValuesAreIndexedWithDifferentAnalyzeSettings of this test

The problem is that the analyzing step does not occur at the time the Document is built, but when we add the document is added to the index (see eg AddWorkDelegate -

)

Analyzers work per field name and we have our own implementation ScopedAnalyzer. Depending on a field name it returns an analyzer. In the case of non analyzed field we use PassThroughAnalyzer. To implement the described use case our ScopedAnalyzer would have to return in one case the PassThroughAnalyzer and in the other the StandardAnalyzer for the same field name. There is not enough information to make the distinction.

I think we are better of logging a warning or throwing an exception. Or does anyone have a better idea?

Sanne Grinovero
November 10, 2011, 6:48 PM

Right. Your explanation makes it sound a limitation of how we do it, but it's just a Lucene API limitation; sorry for not thinking about that right away.

I'd vote for the warning to be logged for now.

In the long term, we could actually work around Lucene's limitation implementing pre-index tokenization; something which is on my whish list to improve clustering.

Hardy Ferentschik
November 10, 2011, 9:52 PM

Right, let's go for a warning for now

I like the idea of taking care of the analyizing step. As you say, we could do this earlier similar to what we do in Analyzer utils. Would also make dynamic analyzer selection easier. Do you see any drawback in doing it ourselves?

Nevertheless, for now let's stick with what we have and add a warning.

Sanne Grinovero
November 10, 2011, 11:56 PM

Do you see any drawback in doing it ourselves?

Lucene can do some crazy performance optimizations, for example but not only by skipping String instance generation; we should make sure we don't lose such benefits so it's not trivial and introduces even more maintenance for every Lucene version chance.

Fixed

Assignee

Hardy Ferentschik

Reporter

John-Michael Au

Suitable for new contributors

Yes, likely

Pull Request

None

Feedback Requested

None

Components

Fix versions

Affects versions

Priority

Major
Configure