Issues

Creating a Custom Language Analyzer for Umbraco 8

To give a little bit of background to this topic, recently we had a customer who was having problems with the search functionality with special characters.

As a specific example, when a user was searching for a word that included 'æ' they weren't getting the results even though many pages had that word in the values of indexed fields. This is because the standard index creator uses the standard analyzer where all the special characters are tokenized, in this case as ‘ae’ when the indexes are created.

So we had 2 options

We could use a different language analyzer for indexing so special characters won’t be flattened or sanitize the query before the actual search happens.

You can either use the standard index creator with a custom analyzer or a custom index creator with either standard or custom analyzer to achieve the first option.

We decided to use a different analyzer with the default index creator, because it would make search work better for this language, ignoring filler words (the, and, it, etc.) and word endings (searching for "searching" can now return results containing the word "search"), so was the safest option for us.

If only it was V Next :)

With Examine V2 in Umbraco V9+, there are so many language analyzers that you could easily pick and swap. But with Examine V1.2 there are not as many. I had to create my own custom language analyzer for the task as it didn’t have one that I could use. 

Normally it doesn't have to be the same language as long as accents match. Although it wouldn’t improve the search functionality, it could have fixed the bug. I could have used a different one's analyzer for the task but no luck there either. 🤷‍♀️

What’s out there 👀

I dove very deep into the web to find a proper, suitable, easy-to-understand solution for this task. But there is so much noise and so many information bits and bobs over the Our's forum and StackOverFlow as well as so many personal blogs.

Nothing did the job but yet all were useful to read.

So I decided to write an article for a future reference for myself in the hope of being useful to someone.

With this article, we will be creating a custom Turkish language analyzer for Umbraco 8 with Examine 1.2 and using it with standard and custom index creators.

First, let's see what and why I have done the task.

Lucene.Net.Analysis.Common has so many different language analyzers but it's only for .NET Core, in the beta release for 4.8.0-beta00016 but I needed to use Lucene 3.0.3, so not yet available for me. So I checked an analyzer that Lucene.Net.Analysis has in .NET Framework to understand what I needed to add and/or modify to create a custom one.

Here's an analyzer example from the .NET framework, Lucene 3.0.3

using System;
using System.Collections.Generic;
using System.IO;
using System.Collections;
using System.Linq;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis;
using Version = Lucene.Net.Util.Version;

namespace Lucene.Net.Analysis.De
{
    public class GermanAnalyzer : Analyzer
    {
        private static readonly String[] GERMAN_STOP_WORDS = 
    {
      "einer", "eine", "eines", "einem", "einen",
      "der", "die", "das", "dass", "daß",
      "du", "er", "sie", "es",
      "was", "wer", "wie", "wir",
      "und", "oder", "ohne", "mit",
      "am", "im", "in", "aus", "auf",
      "ist", "sein", "war", "wird",
      "ihr", "ihre", "ihres",
      "als", "für", "von",
      "dich", "dir", "mich", "mir",
      "mein", "kein",
      "durch", "wegen"
    };

        public static ISet GetDefaultStopSet()
        {
            return DefaultSetHolder.DEFAULT_SET;
        }

        private static class DefaultSetHolder
        {
            internal static readonly ISet DEFAULT_SET = CharArraySet.UnmodifiableSet(new CharArraySet(
                                                                                                 (IEnumerable)GERMAN_STOP_WORDS, false));
        }

        private ISet stopSet;

        private ISet exclusionSet;

        private Version matchVersion;
        
        private readonly bool _normalizeDin2;

        [Obsolete("Use GermanAnalyzer(Version) instead")]
        public GermanAnalyzer()
            : this(Version.LUCENE_CURRENT)
        {
        }

         public GermanAnalyzer(Version matchVersion)
            : this(matchVersion, DefaultSetHolder.DEFAULT_SET)
        { }

         public GermanAnalyzer(Version matchVersion, bool normalizeDin2)
            : this(matchVersion, DefaultSetHolder.DEFAULT_SET, normalizeDin2)
        { }

        public GermanAnalyzer(Version matchVersion, ISet stopwords)
            : this(matchVersion, stopwords, CharArraySet.EMPTY_SET)
        {
        }

        public GermanAnalyzer(Version matchVersion, ISet stopwords, bool normalizeDin2)
            : this(matchVersion, stopwords, CharArraySet.EMPTY_SET, normalizeDin2)
        {
        }

        public GermanAnalyzer(Version matchVersion, ISet stopwords, ISet stemExclusionSet)
            : this(matchVersion, stopwords, stemExclusionSet, false)
        { }


         public GermanAnalyzer(Version matchVersion, ISet stopwords, ISet stemExclusionSet, bool normalizeDin2)
        {
            stopSet = CharArraySet.UnmodifiableSet(CharArraySet.Copy(stopwords));
            exclusionSet = CharArraySet.UnmodifiableSet(CharArraySet.Copy(stemExclusionSet));
            this.matchVersion = matchVersion;
            _normalizeDin2 = normalizeDin2;
            SetOverridesTokenStreamMethod();
        }

         [Obsolete("use GermanAnalyzer(Version, Set) instead")]
        public GermanAnalyzer(Version matchVersion, params string[] stopwords)
            : this(matchVersion, StopFilter.MakeStopSet(stopwords))
        {
        }

        [Obsolete("Use GermanAnalyzer(Version, ISet)")]
        public GermanAnalyzer(Version matchVersion, IDictionary<string, string> stopwords)
            : this(matchVersion, stopwords.Keys.ToArray())
        {

        }

        [Obsolete("Use GermanAnalyzer(Version, ISet)")]
        public GermanAnalyzer(Version matchVersion, FileInfo stopwords)
            : this(matchVersion, WordlistLoader.GetWordSet(stopwords))
        {
        }

        [Obsolete("Use GermanAnalyzer(Version, ISet, ISet) instead")]
        public void SetStemExclusionTable(String[] exclusionlist)
        {
            exclusionSet = StopFilter.MakeStopSet(exclusionlist);
            PreviousTokenStream = null;
        }

        [Obsolete("Use GermanAnalyzer(Version, ISet, ISet) instead")]
        public void SetStemExclusionTable(IDictionary<string, string> exclusionlist)
        {
            exclusionSet = Support.Compatibility.SetFactory.CreateHashSet(exclusionlist.Keys);
            PreviousTokenStream = null;
        }

        [Obsolete("Use GermanAnalyzer(Version, ISet, ISet) instead")]
        public void SetStemExclusionTable(FileInfo exclusionlist)
        {
            exclusionSet = WordlistLoader.GetWordSet(exclusionlist);
            PreviousTokenStream = null;
        }

       public override TokenStream TokenStream(String fieldName, TextReader reader)
        {
            TokenStream result = new StandardTokenizer(matchVersion, reader);
            result = new StandardFilter(result);
            result = new LowerCaseFilter(result);
            result = new StopFilter(StopFilter.GetEnablePositionIncrementsVersionDefault(matchVersion), result, stopSet);
            result = new GermanStemFilter(result, exclusionSet, _normalizeDin2);
            return result;
        }
    }
}

Here's an analyzer example from .NET Core, Lucene 4.8.0

There's an existing Turkish analyzer in v2, so we can take inspiration from how that works, but we will write it to fit the patterns of an existing analyzer in v1:

using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Miscellaneous;
using Lucene.Net.Analysis.Snowball;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.Util;
using Lucene.Net.Tartarus.Snowball.Ext;
using Lucene.Net.Util;
using System;
using System.IO;

namespace Lucene.Net.Analysis.Tr
{
    public sealed class TurkishAnalyzer : StopwordAnalyzerBase
    {
        private readonly CharArraySet stemExclusionSet;

        public const string DEFAULT_STOPWORD_FILE = "stopwords.txt";

        private const string STOPWORDS_COMMENT = "#";

        public static CharArraySet DefaultStopSet => DefaultSetHolder.DEFAULT_STOP_SET;

        private class DefaultSetHolder
        {
            internal static readonly CharArraySet DEFAULT_STOP_SET = LoadDefaultStopSet();

            private static CharArraySet LoadDefaultStopSet() // LUCENENET: Avoid static constructors (see https://github.com/apache/lucenenet/pull/224#issuecomment-469284006)
            {
                try
                {
                    return LoadStopwordSet(false, typeof(TurkishAnalyzer), DEFAULT_STOPWORD_FILE, STOPWORDS_COMMENT);
                }
                catch (Exception ex) when (ex.IsIOException())
                {
                    throw RuntimeException.Create("Unable to load default stopword set", ex);
                }
            }
        }

        public TurkishAnalyzer(LuceneVersion matchVersion)
            : this(matchVersion, DefaultSetHolder.DEFAULT_STOP_SET)
        {
        }

        public TurkishAnalyzer(LuceneVersion matchVersion, CharArraySet stopwords)
            : this(matchVersion, stopwords, CharArraySet.EMPTY_SET)
        {
        }

        public TurkishAnalyzer(LuceneVersion matchVersion, CharArraySet stopwords, CharArraySet stemExclusionSet) 
            : base(matchVersion, stopwords)
        {
            this.stemExclusionSet = CharArraySet.UnmodifiableSet(CharArraySet.Copy(matchVersion, stemExclusionSet));
        }

        protected internal override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
        {
            Tokenizer source = new StandardTokenizer(m_matchVersion, reader);
            TokenStream result = new StandardFilter(m_matchVersion, source);
            if (m_matchVersion.OnOrAfter(LuceneVersion.LUCENE_48))
            {
                result = new ApostropheFilter(result);
            }
            result = new TurkishLowerCaseFilter(result);
            result = new StopFilter(m_matchVersion, result, m_stopwords);
            if (stemExclusionSet.Count > 0)
            {
                result = new SetKeywordMarkerFilter(result, stemExclusionSet);
            }
            result = new SnowballFilter(result, new TurkishStemmer());
            return new TokenStreamComponents(source, result);
        }
    }
}

The only bits we need to worry about are stop words and the overridable Token Stream.

Custom analyzer

Let's create our custom analyzer.

The custom class needs to be inherited from Lucene.Net.Analysis.Analyzer.

Stop words

A stop word is a commonly used word (such as "the") that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We need to tell the analyzer which words, letters, characters, and accents it needs to stop before creating the index so it returns an unmodifiable instance of them, and .NET Core version has it as a document in the related analyzer’s folder. We need to hard-code all the words. Let's just add this to our class.

Turkish stop words: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Analysis.Common/Analysis/Tr/stopwords.txt

Token Stream

The next thing we need is to create the token stream. A token stream enumerates the sequence of keywords from the search term/document.

A tokenizer is a token stream whose input is a reader and a token filter is also a token stream whose input is another token stream. Tokenizers break text into legible pieces, usually words but really can be modified to categorize text in any number of ways. Filters will further work on the resulting tokens by modifying them in some way.

We'll be needing a standard tokenizer and then a few token streams that input from the tokenizer.

For this, we need a Stemmer for our language. The stemmer simply transforms a word into its root form. For example, the stem of the words eating, eats, eaten is eat, the word "doktoruymuşsunuz" means "You had been the doctor of him". The stem of the word is "doktor". 

Normally all languages also have their stemmers under the same folder, but for the ones that don't have analyzers by default, Snowball has some of their stemmers: https://lucenenet.apache.org/docs/3.0.3/d9/df6/namespace_s_f_1_1_snowball_1_1_ext.html

Custom analyzer using standard indexer

So here is what our custom analyzer looks like:

using System.Collections.Generic;
using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Snowball;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Util;
using SF.Snowball.Ext;

namespace ExamineProject
{
    public class CustomTurkishAnalyzer : Analyzer
    {
        private static readonly string[] TURKISH_STOP_WORDS = new string[]
        {
            "acaba", "altmış", "altı", "ama", "ancak",
            "arada", "aslında", "ayrıca", "bana", "bazı",
            "belki", "ben", "benden", "beni", "benim",
            "beri", "beş", "bile", "bin", "bir", "birçok",
            "biri", "birkaç", "birkez", "birşey", "birşeyi",
            "biz", "bize", "bizden", "bizi", "bizim", 
"böyle", "böylece", "bu", "buna", "bunda",
            "bundan", "bunlar", "bunları", "bunların",
            "bunu", "bunun", "burada", "çok", "çünkü",
"da", "daha", "dahi", "de", "defa","değil",
            "diğer", "diye", "doksan", "dokuz", "dolayı",
            "dolayısıyla", "dört", "edecek", "eden",
            "ederek", "edilecek", "ediliyor", "edilmesi",
            "ediyor", "eğer", "elli", "en", "etmesi",
            "etti", "ettiği", "ettiğini", "gibi", "göre",
            "halen", "hangi", "hatta", "hem", "henüz",
            "hep", "hepsi", "her", "herhangi", "herkesin",
            "hiç", "hiçbir", "için", "iki", "ile", "ilgili", "ise",
            "işte", "itibaren", "itibariyle", "kadar", "karşın",
            "katrilyon", "kendi", "kendilerine", "kendini", 
"kendisi", "kendisine", "kendisini", "kez", "ki",
            "kim", "kimden", "kime", "kimi", "kimse", "kırk",
            "milyar", "milyon", "mu", "mü", "mı", "nasıl",
            "ne", "neden", "nedenle", "nerde", "nerede",
            "nereye", "niye", "niçin", "o", "olan", "olarak",
            "oldu", "olduğu", "olduğunu", "olduklarını",
            "olmadı", "olmadığı", "olmak", "olması",
            "olmayan", "olmaz", "olsa", "olsun", "olup",
            "olur", "olursa", "oluyor", "on", "ona", "ondan",
            "onlar", "onlardan", "onları", "onların", "onu",
            "onun", "otuz", "oysa", "öyle", "pek", "rağmen",
            "sadece","sanki", "sekiz", "seksen", "sen",
            "senden", "seni", "senin", "siz", "sizden",
            "sizi", "sizin", "şey", "şeyden", "şeyi", "şeyler",
            "şöyle", "şu", "şuna", "şunda", "şundan",
            "şunları", "şunu", "tarafından", "trilyon", "tüm",
            "üç", "üzere", "var", "vardı", "ve", "veya",
            "ya", "yani", "yapacak", "yapılan", "yapılması",
            "yapıyor", "yapmak", "yaptı", "yaptığı", "yaptığını",
            "yaptıkları", "yedi", "yerine", "yetmiş", "yine",
            "yirmi", "yoksa", "yüz", "zaten"
        };

        internal static readonly ISet DEFAULT_SET = CharArraySet.UnmodifiableSet(new CharArraySet(TURKISH_STOP_WORDS, ignoreCase: false));

        public override TokenStream TokenStream(string fieldName, TextReader reader)
        {
            TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_CURRENT, reader);

            tokenStream = new StandardFilter(tokenStream);

            tokenStream = new LowerCaseFilter(tokenStream);

            tokenStream = new StopFilter(StopFilter.GetEnablePositionIncrementsVersionDefault(Version.LUCENE_CURRENT), tokenStream, DEFAULT_SET);

            return new SnowballFilter(tokenStream, new TurkishStemmer());
        }
    }
}

Custom Analyzer using a custom indexer

Our's has good documentation for this:  https://our.umbraco.com/documentation/reference/searching/examine/indexing/ 

Custom index creator uses the analyzer we pick to use to create all or some of the indexes we want to customize. With this one we create all the indexes’ fulltext fields using our custom language analyzer. We can add any field we want to any index we want this custom indexer to use.

using System.Collections.Generic;
using System.Linq;
using Examine;
using Umbraco.Core;
using Umbraco.Core.Logging;
using Umbraco.Core.Services;
using Umbraco.Examine;
using Umbraco.Web.Search;

namespace ExamineProject
{
    public class CustomUmbracoIndexesCreator : UmbracoIndexesCreator
    {
        public CustomUmbracoIndexesCreator(IProfilingLogger profilingLogger, ILocalizationService languageService, IPublicAccessService publicAccessService, IMemberService memberService, IUmbracoIndexConfig umbracoIndexConfig)
            : base(profilingLogger, languageService, publicAccessService, memberService, umbracoIndexConfig)
        {
        }

        public override IEnumerable Create()
        {
            var defaultIndexes = base.Create().ToDictionary(x => x.Name, x => x);

            var internalIndex = defaultIndexes[Umbraco.Core.Constants.UmbracoIndexes.InternalIndexName];

            var memberIndex = defaultIndexes[Umbraco.Core.Constants.UmbracoIndexes.MembersIndexName];

            var externalIndex = CreateExternalIndex();

            return new IIndex[]
            {
            internalIndex,
            externalIndex,
            memberIndex
            };
        }

        private IIndex CreateExternalIndex()
        {
            var index = new UmbracoContentIndex(
                Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexName,
                CreateFileSystemLuceneDirectory(Umbraco.Core.Constants.UmbracoIndexes.ExternalIndexPath),
                new UmbracoFieldDefinitionCollection(),
                new CustomTurkishAnalyzer(),
                ProfilingLogger,
                LanguageService,
                UmbracoIndexConfig.GetPublishedContentValueSetValidator());

            return index;
        }
    }
}

In Summary

When searching with Examine, special characters of some languages are actually flattened, and this causes wrong/missing results. I shared 2 different ways to fix this.

Hope I could be of help!

Stay friendly, stay safe!

Busra Sengul

Busra is Umbraco and .Net developer at Bump Digital, a Documentation Curator at Umbraco, working remotely from Turkey. She is also a Umbraco Certified Grand Master and an MVP x2 and also a meetup organizer. An active member of the Umbraco community, a general Umbraco lover. She has dedicated her life to lifetime learning and gets over-excited over cool projects. She also enjoys latin dancing, hiking, swimming, reading, knitting. She is the big sister of two and loves to listen and play the role of in-house therapist.

comments powered by Disqus