While browsing on the Internet just now I came across a bulletin-board discussion from last December about Google having started to use stemming. I was quite surprised by this since I hadn't seen any evidence of this while googling myself. I also seemed to recall that Google had stated that is it was not using stemming at all.

I little digging revealed that this earlier statement on The Basics of Google Search page:

Word Variations (Stemming)
To provide the most accurate results, Google does not use "stemming" or support "wildcard" searches. In other words, Google searches for exactly the words that you enter in the search box. Searching for "book" or "book*" will not yield "books" or "bookstore". If in doubt, try both forms: "airline" and "airlines," for instance.

had now been changed to:

Word Variations (Stemming)
Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for "pet lemur dietary needs", Google will also search for "pet lemur diet needs", and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.

Only certain terms seem to trigger the feature: "plural words", "snorkelling gear", and "inkjet cartridge" are among a few that I managed to discover.

Search results for 'plural words' on Google
Notice that not only words and plural but also word and plurals are highlighted.

Only a fixed set of words are treated in this manner and then only the plural form and -ing affix are conflated with the singular. This is clearly a simple Table Lookup stemmer and not a full-fledged stemming algorithm.

Its strange that they should revert their earlier policy of not using a stemmer ("[t]o provide the most accurate results") only to implement this kind of a stemmer. The Table Lookup approach has a few problems. The table itself has to be maintained manually, an enormous and endless task. Many domain dependent terms found in the Google database will not be represented in the table since it will most likely only cover everyday English. The table also requires some extra storage space.

Ps. Wildcard searches where the astrics can stand for a whole word are also allowed now: "rip * rup".



Lokað er fyrir ummæli.