Support for the new Solr's character filters (Gustavo Fernandes)
Description
Attachments
is followed up by
Activity

Gustavo Fernandes April 5, 2010 at 12:58 PM
Enriched patch, with documentation changes, correct styling, modified declaration order in the @AnalyzerDef

Sanne Grinovero April 1, 2010 at 11:37 PM
ah that makes a lot of sense

Gustavo Fernandes April 1, 2010 at 10:32 PM
CharFilters sit between the Reader and the Tokenizers [1], thus they are supposed to filter the stream produced by the reader before the tokenization.
For an illustration of how the CharFilters are used in Solr, please refer to [2]
[1] http://issues.apache.org/jira/browse/LUCENE-1466
[2] http://issues.apache.org/jira/browse/SOLR-822
The order of application would be first the charFilters in their declaration order, and then all the tokenFilters also in their own order. Probably the @AnalyzerDef is better represented this way:
Thoughts?

Sanne Grinovero April 1, 2010 at 10:08 AM
I assume there might a need to define the order in which TokenFilter(s) and CharFilter(s) are applied?
Maybe filters should be made of type Object, not nice for typesafety and self-documentation, so that it could contain both types.

Gustavo Fernandes April 1, 2010 at 1:29 AM
Attached is a patch to support Solr's CharStream. A new kind of filter factory was introduced to AnalyzerDef:
Being a new annotation defined as:
That will allow the usage of MappingCharFilters as requested by the users:
Solr 1.4 introduced CharacterFilters [1], which are based on Lucene's CharStream. Those filters are currently incompatible with the annotation @TokenFilterDef, which accept only TokenFilterFactories:
Onde ideia is to keep the same annotation, "generalize" the token filter factory type in the annotation, and on SolrAnalyzerBuilder construct a TokenizerChain which will accept both type of filters [2]
[1] http://lucene.apache.org/solr/api/org/apache/solr/analysis/CharFilterFactory.html
[2] http://lucene.apache.org/solr/api/org/apache/solr/analysis/TokenizerChain.html