a blog about news and politics by steve janke
 

HTML validation and Google rankings

I've been doing some housekeeping today, focused on something I've wanted to tackle for some time, and that is cleaning up the HTML on my pages. Why? Because you'd be surprised just how important it is for getting ranked well.




Reviewing my Google Analytics stats, I see that 60% of my visitors are first time visitors. Over 10% of my referrals are from search engines. That's a significant chunk of visitors coming to my site for the first time from search engines.

Still, it would be nice to do better. Of that 10% from search engines, 6% comes from Google and 4% from Yahoo. But Google controls just under 50% of the search market, and Yahoo had 28%. Clearly Google is underperforming for me. Yahoo is ranking me higher for the same stories and so I get more traffic from Yahoo than the market share suggests I should be getting compared to Google.

Or to put it another way, I'm not getting enough traffic from Google.

Why is that? It could be any number of reasons, and many are hard to control. But there is one aspect that is always under my control, and that is getting my content pages (pages with actual stories or story excerpts that are the most important to get ranked) to validate properly. You'd be surprised just how awful most web pages are for delivering syntactically correct HTML. The reason is that browsers make a huge effort to render bad HTML in a manner that is useful to the reader. A browser can afford to make the effort because it is working in "people time" -- the extra half-second or so of uber-fast processing required to fix the results of a <div> lacking a closing tag is hardly to be noticed by the individual user. A web developer might see that his page is rendering correctly, and figure his work is done. He is writing for people, after all.

But that's not true. He is also writing for search bots from Google, Yahoo, MSN, and so on. A search engine bot is trying to parse and analyze as many pages as possible. If a Googlebot comes across a page that makes no sense because of bad HTML, then after a half-hearted effort it'll just give up and move on:

Bad HTML can hurt your site in the search engines without you ever realizing it.

What exactly is HTML validation? It's the process of checking the syntax of your HTML code to find places where you've violated the rules of the language. The official rules for writing HTML are defined by the World Wide Web Consortium (W3C). Those rules include strict definitions stating which HTML tags are legitimate parts of the language, and how you should structure your HTML documents.

HTML errors that violate these rules include things like badly nested tags (where you incorrectly close one element before another), content model violations (where you nest tags that aren't allowed inside one another), and badly formed tables.

It helps to think of a search engine spider as a web browser - just like a browser, the spider needs to interpret your page and figure out what you're saying. Only then can it properly index your page. Search engine spiders also care about the structure of your Web page because they give extra weight to keywords placed inside certain HTML tags.

I have direct experience with bad HTML hurting a search engine ranking. A few months ago I helped a webmaster who had lost his Top 10 ranking because of a simple typo in his HTML. One badly placed angle bracket kept Googlebot from correctly parsing the home page, causing it to fall completely out of the index. The page displayed correctly under all the major browsers, but it still caused problems for Googlebot.

So how do you make sure that your HTML is correct? The most popular way is to use the HTML Validator at the World Wide Web Consortium site. Provide the URL of the page in question, and within moments you'll be shocked at the dozens, perhaps hundreds, of errors that come up.

Of course, many errors are cascades, so like any debugging effort, your best approach is to fix the first couple of errors, and then revalidate. I discovered all sort of silly little bugs that have crept in over my various redesigns -- paragraph tags not closed off, dangling div tags, misspelled attributes, and the bane of HTML developers everywhere, the unescaped ampersand.

The biggest challenge for me was putting in correct XHTML code for my embedded videos. The classic "embed" tag is actually not valid HTML, but an artifact from Netscape. Its wide use meant that most browsers could interpret it, and it is also almost certain that search engine spiders simply ignored it without any ill effects. More than that, though, is that the use of the embed tag was legally problematic because of the lawsuit between Microsoft and Eolas. Notice how you have to click on the control to activate it? That was part of a kludge to try and satisfy Eolas which held the patent on embedded controls:

Interactive controls are ActiveX controls that provide user interfaces. When a web page uses the APPLET, EMBED, or OBJECT elements to load an ActiveX control, the control's user interface is blocked until the user activates it. If a page uses these elements to load multiple controls, each interactive control must be individually activated.

This control-by-control click is there for legal reasons, not technical ones.

Frankly, I didn't care all that much, but it was always frustrating to have these ugly HTML validation errors.

So I've swapped out the old "embed" video control on my page with one that uses a Javascript program called SFWObject. Now the entire page validates without any trouble. I've also confirmed the last twenty or so entries so that they all individually validate.

I doubt I'll leap up the Google rankings, but at least it's one less thing to be wondering about.

But there is one more reason to validate well -- I'm thinking more and more about readers who are not reading my content through a traditional browser. Mini-browsers are popping up everywhere -- on cell phones and PDAs and Blackberries. The CPU constraints on these devices means that they are more like search engine bots than traditional browsers in their lack of tolerance of HTML errors. A bad HTML page that looks fine under Internet Explorer or Firefox on my PC might look like gibberish on a Blackberry. By trying to maintain 100% HTML validation, I hope that my pages will be legible on any browser on the market now, or in the future.


Skew my story on Skewz.com
Rate political news for their bias, read related stories, and leave your own skewed commentary


Search for more opinions from Canadian bloggers on these related keywords
 HTML  XHTML  embed  Microsoft  Eolas  W3C  World Wide Web Consortium  validation  Google  Yahoo  MSN  search engine  ranking  search bot  spider  googlebot  ActiveX  SFWObject 


Sphere presents related news articles and blog posts
Sphere It!


Trackbacks
URI: http://haloscan.com/tb/agwnblog/220232

Trackback Submission Form



 

Comments

I didn't know blogging was a competitive sport! When will it be an olympic sport? (OOPS. I geuss I better send the I.O.C. some money, I didn't get permission for using the name, olympic)

Posted by: Dave at March 24, 2007 06:37 PM



Technorati rankings, the Ecosystem, Blogshares -- oh yeah, it's competitive!

Posted by: Steve Janke at March 24, 2007 07:13 PM



If you use Firefox when you're developing, there's a terrific extension at http://users.skynet.be/mgueury/mozilla/ for keeping HTML clean. Any errors in your code are highlighted and easily corrected.

Posted by: Darrell at March 26, 2007 08:48 AM