My last post talked about XML repositories and content management in terms of separating content from format. The idea is that XML content can easily be reformed and reused. But in order to reuse content, you must first refind it (a word?). Thus another important feature of CMS is structuring content for search in a variety of ways.
For a lot of people, search engines mean full-text search. And almost every web application we build includes a full-text search function. Full-text search typically means the ability to use complex algorithms to find words and phrases within the body of the content we are looking for.
We also look toward systems like Ektron, which provide facilities to create extensible content management. That is, to create structures and metadata (data about data) to store contextual or categorical information in addition to the body of the content. This provides additional fuel for the search engine to burn to find what you are looking for. This is what we call Findability.
For a different view on how structured metadata can assist in the complex world of searching enormous digital data stores and finding answers to complex question, take a look at this month’s Technology Review. Editor Jason Pontin (@jason_pontin) discusses this topic in his article called "On Answers, Four kinds of search engines."
Content metadata also is enhanced by user participation. I’ve talked before about tagging and user generated content (UGC) and mentioned two excellent books on the subject, “The Long Tail” and “Everything is Miscellaneous.” Systems that support the addition of UGC metadata add meaning to content and allow systems to grow organically over time.
As an example of the ways we use systems to enhance findability, here are screen shots from a recent web-to-print solution. The problem to be solved in this case is how a large number of users can find the assets they need out of hundreds of templates and thousands of images. We used Ektron CMS 400.Net as an extensible XML-based repository, content management and search platform to help.

The first method of finding content is to allow users to select from complex categories of metadata assigned to each item. The meta-select function returns assets that match the criteria specified. The metadata assignments are for appropriate departments, uses or product categories that a typical user is most likely looking for.

The second method is a flexible categorical assignment select in the form of a tag cloud. Administrators and users can associate assets to keywords that make sense to them. The tag could show users a large number of keywords and the relative number of assets that have been associated with them. Over time, these assignment get richer and richer, allowing the system to dynamically improve.

The third method is traditional full text search. Search phrases are compared by the engine to asset name, descriptions, meta-data and taxonomy categories. Items are returned in order of weights of similarity to the phrase entered.

The final method is a list of favorites that individual users have previously selected. This allows the user to select most-used assets in the quickest possible way. This is similar to bookmarking web pages in a browser.
For internal applications, these techniques can be controlled to achieve our desired result. But what does CMS do for external system that we cannot control? The structured content managed by CMS can be used to generate structured XHTML and CSS that enhance the ability of external search engines to find meaning or semantics in a web page.
We store meaning in addition to raw content in CMS. For managed XML documents, there is structure in the form of tags that turn text into data. For instance, in our site we store contact information for news releases as structured content. When the page is generated, we construct the HTML with specific CSS references:
<div class="adr">
<div class="street-address">134 N Main St, Suite 101</div>
<span class="locality">Rockford</span>,
<span class="region">IL</span>
<span class="postal-code">61101</span>
</div>
<div class="tel">866.799.2879</div>
This is known as microformatting. Adding microfomats to web content is a big part of creating the semantic web and enhancing content for Search Engine Optimization (SEO). By the use of standard structures in content, search engines can begin to understand meaning and relationships that humans interpret naturally.
All of the major search engines are supporting microfomats. Google for instance, uses them to support Rich-Snippets. Yahoo supports microfomats in SearchMonkey via Common Tag. And Microsoft Bing encourages structured data use for SEO. Trekk uses structured content to support many of the SEO best practices, from microfomats to HTML meta-data, to enhanced titles and XML sitemaps.
The best way to check how well your content is tagged and indexed is to search the three major engines. BlindSearch (http://blindsearch.fejus.com) allows you to enter a search phrase and it will compare Google, Bing and Yahoo results without identification. It also allows you to be the judge and vote on which of the three works best for you. At last check of overall voting, Google was in the lead with 44%, Bing was in second with 32% and Yahoo was last at 24%.
As more and more of all of the information in the world is online, structured content is essential. Trekk’s use of CMS to enhance findability and search is a crucial part of our arsenal. So how is your findability?
Posted by JA Stewart at 07/12/2009 01:30:35 PM |