When someone uses an ambiguous search on Google, the algorithm immediately needs to identify the language preferences of the user. This is necessary in order to determine which category of results to return, before ranking those results.
Although the word football is spelt the same by all English speakers, a human audience would not be able to determine which type of game is being referenced in a conversation unless they knew the geographical location of the person talking about the game.
There are a lot of similarities between the two games, like a lot of running, passing, and even goal kicking.
You can make it easier for a search engine to understand the language your web page is written in by choosing the right character set. There are other ways for a search engine to determine the language of a web page, but this is the simplest method.
But what about when someone types in a query?
A search engine uses a variety of techniques to determine the language of a query, including looking at the query language settings of the user’s browser and evaluating the contents of the query itself.
– How does it handle queries in different languages made on devices that might not be capable of creating some special characters outside of the Latin alphabet?
Do webpages using a specific character set have a higher chance of being easily identified by a search engine?
Google Patent Applications on the language of a Query
Google has recently published four patent applications that explore the language of a query and how to handle the uncertainty of language when processing search queries and documents over the web.
A search engine’s job is to index and search documents written in various languages, as well as multiple documents that are expressed in multiple languages.
Keyboards Without Non-Latin Characters
Different devices can have trouble displaying some of the characters used in different languages.
when people are typing on a small keyboard or a handheld device, they often use characters that are similar to the ones they actually want to use, such as an unaccented character
If a search engine were to process content it has indexed to remove accents and convert special characters into a standard set of characters, it would result in losing information from the search index and making it impossible to retrieve content when a searcher uses their natural language in a query that includes non-Latin characters.
Search Engines Learning the language of a Query and Documents
Under the approach in these patents, a training model is created to identify the language used in documents to be searched.
The training model focuses upon a specific body of documents when training, and those can be a mix of different types of documents, such as:
- Text documents,
- Word processing documents,
- Usenet articles, or;
- Any other kinds of documents having text content, including metadata content.
The Web or a snapshot of the web should be represented in these documents.
The ideal body of documents for this study would include representatives from all the languages used on the internet, with a sufficiently large number of documents from each language so that they would be likely to include a significant proportion of all the words used in that language on the internet.
The Role of Character Encoding
The system would work best if the documents were encoded in a character encoding that is known and consistent, such as an 8-bit Uniform Transformation Format (UTF-8).
If you look on the internet, you will find many pages that do not have a character set defined, or have a completely different character set. Here’s what the code looks like in the HTML for a page using UTF-8:
If a page is not using UTF-8 and this language determination process is, then documents using some other encoding might get converted into UTF-8. If you convert the text, there is a risk that some strange characters will appear in the results.
Language Detection on Pages, Using Probabilities
The document detection process uses theories and models to detect the language of a document.
The class of a text page is most likely determined by the text on the page and the URL of the page.
This can be done by analyzing the words in the text and comparing them to words in different languages to predict which language the text is in.
So, for a page where the word “Hello” occurs frequently, and in the training model, it appears most frequently on English and then German pages, there is a higher probability that the page is in English, followed by German.
Looking at certain characters can be helpful, too. If characters don’t appear frequently in some languages, then pages containing words with those characters might be less likely to be in those languages.
The Use of Character Mapping
One of the keys to this process is creating character maps that may be more unique to one language than to others. Language-specific character maps allow for more precise translations. This means that a common form of a word in a specific language may contain accented characters.
Query patents discuss character mappings in great detail, including how they can be used in different ways.
One is to help identify languages for some queries.
When a searcher is unable to use specific characters, particular queries might be able to be simplified into a word. The device they are using is not equipped to use those characters. This is demonstrated in many patent applications.
Levers to Determine Language Preferences
Despite having gathered a lot of data on users, there will still be many instances where past history will not be helpful.
In order to interpret a query, Google looks at five different areas.
The Adwords support page states that it only uses user settings for Adwords. However, other language ads may be displayed with the determined language of the query.
User Account Preferences
If a user signs up for a Google account, they are either forced to choose a language and location during the sign-up process, or they are automatically assigned one.
Google will assume that the likely language of any query will be American English if the user has declared their preferences to be English and US. The preferences you set also apply to the default search settings, which you can find on a Google search page under search settings.
If a user wants to see results in another language on their Google account, they need to change their language preferences manually.
The settings for your Google account can be changed to affect either all Google products, or just the search function. Changing your language and location preferences on one device will affect your search settings on all other devices where you are logged in.
Google’s first backup for account level language settings is a similar setting at the browser level. This is for people who don’t have Google accounts or are not always logged in.
In all modern browsers, there is a default setting that declares the user’s language preferences. Google will use a clues like a browser’s location and location preference to determine a user’s language intent.
The language of the browser is usually set to the language the user selected when they installed the software.
The language and location of a browser is typically set based on where the browser was downloaded from. So, if a browser was downloaded in English from a United States mirror, it is likely that the browser will be set to English and the United States.
Adjusting these settings can be done at the browser level for Chrome and Firefox, but for IE and Safari, it needs to be done at the system level.
Just using either a Google account or browser settings isn’t always enough to convince Google’s algorithm of the language you want to use. To increase certainty, they will see where the user is physically located.
Google relies heavily on a user’s physical location to improve the accuracy of their search results.
San Francisco users searching for “Giants” are more likely to see results for the baseball team, even during the offseason, while users on the East Coast are more likely to see results for the football team.
There is usually not a big difference in the search results on Google.com from different locations, but there are some queries that see significant changes.
A search for the word “football” will produce similar results in the US, Canada, and the UK; while a search for the word “holiday” will produce different results in the UK than in the US.
TLD of Google Domain
Although physical location can give clues about a user’s language intentions, it is rarely more important than account or browser language settings. The TLD that the query was conducted on can override Language settings.
If a user is logged into their Google account, they will usually see Google.com as the default homepage even if they are in a different country. A user who isn’t logged in will be redirected to a local version of Google based on their country code, even if their browser is set to English and the United States.
The Top Level Domain (TLD) is a very important factor in determining the language in which to return results. If there was a hierarchy in Google’s language determination processing, it could either be first or simply go hand-in-hand with location targeting.
Google can best interpret a user’s language intent by looking at which top-level domain (TLD) they have chosen. TLDs can provide clues as to whether a user is intentionally targeting a specific language.
A user in the US who conducts a search on Google.com.br is more likely to prefer Portuguese results. On the other hand, it can be a poor clue if the user was simply directed to that TLD by their location as a traveler might have been. This is because the user’s location might not be indicative of their nationality.
A US resident Conducting a Google search while logged out from their account would see Google.de by default because of their location.
If Google only looks at the top-level domain when determining a user’s language intent, it might not provide accurate results.
The word “handy” in German refers to a mobile phone. Therefore, if a German user searches for the word “handy”, they will see results related to mobile phones.
If the user had wanted to see results from Google US, they would have had to picked the right language.
Google will typically assume the primary language of a country when utilizing a TLD.
In Canada, a search for the word “baguette” would return English results even though the word is French.
If someone queried Google in Switzerland, google would automatically assume that they were speaking German, even though German, French, and Italian are all widely spoken languages in Switzerland.
Query Parsing and Matching
Lastly, Google breaks down the word itself looking for any clues as to the language. The algorithm searches for the word in the most common languages.
This means that if you use a keyword to search for results, the results will probably be in that language. This is simple when the word is spelled correctly and only matches one popular language. If it’s not an exact match, it’s a bit more complicated.
In these cases, Google will look for statistically likely misspellings in a specific language.
Although the word “football” can be spelled various ways, Google will use other information to try and determine if the user made a spelling mistake or if they are looking for results in another language.
For readers who are interested in the technical details of this process, Google has published a patent on the topic.
SEO specialists typically concentrate on the elements of Google’s calculation that choose where a site page should be positioned. Google’s algorithm is not just a simple ordering of content based on scores.
Every time a user submits a query, the search engine needs to determine the user’s language in order to start retrieving results from its index and ranking them accordingly.
This brief overview of how Google determines a queries language should give you some insights into how hard Google works to satisfy a user and provide high-quality results.
I have not found any Google source which explicitly states how they determine ranking, but from my own research I have found the following. If you know of something different or have just discovered something, please let me know. I’m always interested in hearing more.
If you work with websites written in non-Latin characters, you may find these patent applications interesting and worth learning more about.
There is another patent application, not yet published, that looks like it goes into even more depth on this topic.
Some of the languages and conversion maps created for those languages discussed in the patent filings include:
There are many different languages that can be learned, including Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, French, German, Greek, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovakian, Slovenian, Spanish, Swedish, Finnish, Turkish, and Ukrainian.