Rants, rambles, news and notes from another geek

Cleaning Up Badly Generated Html

After reading this post by The .NET Guy, I realized that many people out there aren’t using Html Tidy to clean up nasty HTML. Outlook and Word are notoriously bad HTML generators, and if NewsGator is using the built-in editors, no one should be surprised at the quality of the markup.

I use TidyATL, which is a COM port of the original Tidy code maintained by Charlie Reitzel. He also has some .NET wrappers available on his site, but since they are just wrappers around the ATL library anyway, I just let VS.NET do the Interop for me.

Using TidyATL to cleanup bad markup is a breeze. If you want, it will even convert your markup to XHTML if you want. And, it is spectacular at removing bad MS-Word markup.&nbsp_place_holder;Here is a sample:

public string TidyDocument( string html )  
{  
&nbsp_place_holder;Tidy.Document doc = new Tidy.DocumentClass();  
&nbsp_place_holder;doc.ParseString( html );  
&nbsp_place_holder;doc.SetOptBool( TidyATL.TidyOptionId.TidyWord2000, 1 );  
&nbsp_place_holder;doc.SetOptBool( TidyATL.TidyOptionId.TidyBodyOnly, 1 );  
&nbsp_place_holder;return doc.SaveString();  
}

In this example I am telling Tidy that I want it to clean up Word markup and that I want it to produce only the body content. If I didn’t provide that last option, it would have created the required <html>, <head>, and <body> elements.

As you can see, cleaning up bad HTML is easy. I don’t know why more people aren’t doing it.