Mon, 09 May 2011

Google's New Quality Ranking Standards

Google on Friday posted a new blog entry, "More guidance on building high-quality sites". While they don't disclose any of the direct signals their algorithm uses, they do point out some questions you should consider about whether your site represents quality content. (Rather than rehashing them here, I suggest you check out the article)

In reviewing the questions they suggest, one can back into the kinds of signals their algorithm likely uses. For example, because they ask the question "Would you trust the information presented in this article?", one can infer the kinds of things that Google's indexing algorithm is able to detect via "machine learning". Given their questions, here are a few elements (most already known or suspected) that make sense not only for your visitors, but for Google to attempt to determine about your site and apply to your results:

  • How long has the domain been around?
  • How many other trusted sites also link to this domain or this page?
  • Does it have an https address, and is the certificate self-signed or signed by a third party?
  • What kind of related keyword density is there in the article? For example, if it's talking about zebras, does it just talk about black and white stripes, or does it mention words and phrases like "Equus Dolichohippus", "open‐grassland grazers" and "gregarious herd animals" indicating a greater depth of knowledge of the author?
  • Is the topic breadth significant of the consistent set of topics discussed, or is it the same topic rehashed in different ways?
  • Do checkout pages use https?
  • Do checkout pages reference resources from off the site (or that would generate warnings in default settings of browsers)?
  • Does it pass the basic MS Word style grammar and spelling checks?
  • If there are verifiable facts in the article, do they add up? For example, if you end up with something other than 306 parsecs per millenia as the speed of light, you might not be seen as an authority. (Google the following, as an example: "speed of light in parsecs per millenia")
  • To what extent is your page/site bookmarked or shared via social networks and revisited directly, rather than served by search queries?
  • How often do people revisit the site at large, directly or via broad terms, and how deep do they go, rather than coming in via search queries?
  • Does the article coin any phrases that were seen here first, and parroted elsewhere?
  • How much time on-page do users spend compared to others that have similar keywords/density?
  • For a given topic, are there any glaring omissions of keywords indicating a particular kind of bias?
  • Do sites that fit into certain content clusters considered high value link to this site, or are they linked from this site?
  • Is the structure of the document concise and follow patterns of effective communication? (That is, an establishing paragraph, followed by one of exposition, a detailed list of items, followed by a summary paragraph or two, or is it a long, rambling screed with no thought or organization to its content?
  • Is the information density high, or is it diminished by ads? Are there a superabundance of distracting ads that take over the user experience?
  • Are there indications that the reading level of the article is higher than average?
  • Does the length of the article match the keyword density?
  • Does the HTML of the page validate? Are there glaring errors in coding that would cause the user's experience to suffer using various browsers? Did the author take pains to choose a valid title, description and section headings that make sense?
  • Are there any bait-and-switch tactics going on that would frustrate users clicking on the search results?
  • Does the page load quickly?

All of these factors are automatable, machine learnable, and semantically deducible factors. Whether using explicit factors from its own products (like the datastream Google collects from products like its Toolbar, Chrome, or Buzz), from data feeds they have with specific companies (like Twitter or news outlets), or by crawling the Web directly, or implicit factors (calculated rankings like PageRank or other indicators like which and how often other sites link to you), the questions posed in Google's newest blog post are worth contemplating to determine how, precisely, their software engineers would have been able to instruct a computer to automatically determine and assign a relative quality value to your content.




Khan Klatt

Khan Klatt's photo