Irony on the rocks with a twist...

2.12.2005

Finding a Needle in a Stack of Needles: The “Googlization” of the World Wide Web

Once upon a time, most people were not particularly interested in access to information. Those that were knew where to find it, and those who weren’t, didn’t bother too much about it. With the dawn of the Information Age, this began to change. Suddenly, information was readily available to anyone who wanted it simply by accessing the Web and doing a browser search. As the value of data increased, so did the methodology designed to make it easier to find.

Search engines started out as mostly human culled databases, meaning people determined the relevancy and timeliness of web sites that were linked to search results. As the web multiplied, this practice became highly impractical as no human being could possibly hope to properly sort the available amounts of information, and it has continued to grow exponentially. This led to the use of intelligent programs that are often called “bots” or “agents” to sift and sort the various search categories and criteria. The language used to write web pages, HTML, included a method for providing search engines with the needed datum quickly in the form of “meta-tags.” Simply put, web page authors would include two sets of information at the beginning of a web page, one was a set of key words, and the other was a set of subjects. By tuning search programs to look for these keywords and subjects, it would be easier to determine whether the search criteria matched the result criteria. This worked for a period of time, and meta-search engines flourished. This is about the time that Google first appeared on the scene.

Google was different from the human or meta-search engines; in fact, the method it used to produce results is the result of a set of programming patterns known as algorithms that remain a closely guarded secret to this day. One of the most striking differences between other search engines and Google is their program based its criteria more on popularity rather than relevance. Regardless of the reasons why, Google has rapidly become the search engine of choice for the vast majority of people using the Web. The use of popularity has had the effect of skewing the result patterns on the Web towards sites which are visited regularly rather than sites that may have true value to them. The result of this has been sites of commercial, scholarly, and personal value are missed whereas sites that are more popularly accessed are presented first. Google’s patented methodology for ranking web sites has led to a blooming industry in methods designed to “hack” or fool Google into ranking certain websites higher than others (it’s not nice to fool with Mother-Google). Google has responded to these attempts by removing any website whose designers resort to trickery to promote themselves. This “Google Effect” upon certain websites has helped to create another phenomenon discussed in an earlier post to this blog. The existence of a totally submerged World Wide Web underneath the one we interact with every day, often referred to as the Deep Web. The Deep Web consists of sites that might have intrinsic value, but they aren’t included in Google’s search criteria or ranking system. Many of the sites that are ignored by search engines lack the meta-tagging scheme that provided many websites with the ability to be selected by other search engines, and thus these sites would never enter Google’s ranking system. Other sites just never reach the threshold of popularity needed to be counted.

As a student of information (who isn't?) I have found myself using Google less and less due to the fact that it has been clogged with irrelevant, duplicate, and obsolete sources of information. I am certain this observation is nothing new, and considering the many new features and options available via Google Scholar and the digitalization of some of the more well-known university libraries in the United States; I know that this is an issue of concern for your company. Google has a unique position in the World Wide Web as the primary "Gopher hole" of its day, and should it not be able to continue to innovate and adapt to the needs of the times, it could very well end up as much of an anachronism as the University of Minnesota's once famed portal has become. Having used Google and the beta of Google Scholar for a good portion of my graduate work, I can say that it is possible for someone with experience in Boolean logic and skills with information sorting to find valuable research data from both, however, I also believe I am not the "norm," nor should the bulk of Google's marketing be aimed at people such as myself, rather, one would assume your resources would be focused on providing search services that fall towards the center of the bell curve rather than its ends.

My greatest concern with using Google is that, even with the advanced search options, it is difficult to narrow the criteria down sufficiently to prevent a large number of "false positive" hits. Many of these hits are dated, or they contain irrelevant information that just happens to contain text that somehow nearly matches the search criteria itself. Being able to limit the search to the last three months does not prevent sites that are far older from being presented. Another issue is that the language options do not make it possible to limit English from the US from English in the UK, or its provinces. There are succinct differences in language, economy, and culture that often make the use of Anglo-English sites less attractive from American English. It seems the value of meta-searching has reached a limitation it cannot overcome. There is also the issue of the "deep web," or sites which have been completely submerged below the many layers of popularity.

Google Scholar is an interesting concept in theory; however, it seems to fall short in application. While I have yet to do a more quantitative study on the matter, I can say with confidence that a greater number of the hits on any given topic fall behind some proprietary information portal that requires the user be a paid member to access. As my research usually has something to do with technology, computers, and information systems, I see ACM and IEEE sites most often. While I do have access to such portals via other means, again, I assert I am not a part of the "norm." For more "mainstream" scholars, who are more than likely undergraduates without affiliation to many resources, this would seem a serious limitation to using Google Scholar. One of the things that would seem crucial in a world where information travels at the speed of light is as open access to that information as possible. Unavailable information is the breeding ground for ignorance.

My thinking as to a solution to the problem is an expansion upon the meta tagging idea, or perhaps a sort of "manifest" similar to the one that is compiled with .NET programs that provides extremely succinct query information that can then be used as the basis for the construction of a temporary relational database entirely based upon the user search criteria. The whole thing could be run in a manner very similar to Microsoft Access, and perhaps even be stored onto the users computer rather than elsewhere. The international consortium on all things Web: W3C is working on something along these lines with its XML Query Project; however, not even this will be able to solve the entire problem. The problem with providing information relevancy on the web seems to be two-fold: First, setting a set standard of query criteria within the languages used to write web pages that can be more flexible and adaptive to change. Thanks in great part to Google, meta-tagging is rapidly becoming an obsolete method of search result sorting, mere subjects and keywords cannot hope to narrow search results down enough to create a more succinct output. This standardized language must be embraced by the search engine community as a whole. To do the job right, a newer and standardized method must be envisioned. Secondly, just creating a standard query method is not enough; convincing those that publish their work to the Web must be willing to use the standard query language in order for it to be useful. This means it must be succinct, easy to work with (not just for programmers), and create the context for more relevant search criteria to be used. This seems like a tall order, but I am convinced it must be considered in order for search engine optimization to become a reality. What must be a key component of a search engine is a more complex discrimination scheme that matches search criteria with result criteria is far more focused and concise than it currently possible.

TANSTAAFL!

0 comments

Irony on the rocks with a twist...

2.12.2005

Finding a Needle in a Stack of Needles: The “Googlization” of the World Wide Web

TANSTAAFL!

0 Comments:

Site Feed(via ATOM)

LINKS

ARCHIVES