Why Censorware Can't Work

by Michael Sims
April 6, 1998

This is a longer, and hopefully completed, version of an essay which has undergone a few changes with time. I am attempting to explore the minimum requirements that an "honest" censorware company would have to have in order to be able to fulfill their marketing statements. My thesis is that the money required - in hardware, bandwidth, and human salaries - is too great for any product sold at $29.95 to be able to effectively review the internet.

In December 1997, I participated in an effort to evaluate the accuracy of one commercially available censorware product, CyberPatrol. In our report, we showed that CyberPatrol, which has received good press for accuracy, is grossly overbroad. We showed that CyberPatrol blocked well over a million individual user homepages under their "Full Nudity" and "Sexual Acts" categories. Imagine if you had written your Great American Novel, and found it banned from public libraries because it contained full frontal nudity and graphic description of sexual acts. But it didn't. In fact, it was about biochemistry, or pet grooming, or Nike shoes, or mathematics, or mirrors, or the Army Corps of Engineers, or any of the many other sites we found banned for containing full frontal nudity, when they never had and never would. And this gross error couldn't be corrected. The library told you they subscribed to a list, managed by a private company, which even they couldn't look at, which determined what books they would make available, and it seems that your novel wasn't on it.[1] Every morning, the company employees came in with two big black bags. They put some books on the shelves from the one bag, and removed unknown books from the shelves and took them away in the other, and wouldn't respond to any queries about why YOUR book was called "pornography" and had been taken off the shelves.

You would be fighting mad in short order, I imagine.

This is precisely the situation with implementing censorware in libraries today. CyberPatrol's accuracy can best be called abysmal. If you defined "accuracy" as

"number of correctly banned web pages divided by total number of webpages banned"

then Cyberpatrol is rather less than 1% accurate - if you come across a banned webpage, the odds are much less than 1% that it actually contains any "pornography" by any definition. But you won't be able to evaluate that for yourself - since you'll be prevented from viewing the site at all - so there's no way for the average user to remedy the inherent flaws of companies which try to categorize hundreds of millions of webpages using a staff of a dozen part-time, minimum-wage employees.

Companies rely upon computers to determine what should and shouldn't be blocked. The vast size and rates of both growth and change of internet websites make human evaluation impossible - if a company used humans to evaluate all banned sites, they would have to employ THOUSANDS of people and the software would cost thousands of dollars per copy consequently. The search website Yahoo.com has graphically demonstrated this. Yahoo employs less than fifty people to maintain its directory of websites. They don't have the search the internet for new sites - all they have to do is categorize submissions from people who want to be listed in the directory (who must request a specific category to be placed in to begin with). Yet Yahoo is deluged. They are swamped with requests to be put in the directory. Most submissions are discarded without action, since the number of submissions so far outstrips the abilities of this group of people to categorize them.

Imagine now that you had to search the internet to find all potential sites. The problem becomes hundreds of times worse. These companies employ only a few humans and some computer search tools to automate the process of scanning the WWW for potentially offensive sites. Many sites get added to the blacklist without ever being seen by a human. For example, a page discussing Windows Emulation software was banned under Cyberpatrol's category for alcohol. Why? Because the acronym used for the software was WINE. No human ever viewed this page filled with densely packed text about obscure software, despite CyberPatrol's obviously false claim that humans see all banned sites. A computer "viewed" it, a computer banned it.

Let us look now at the case of the AltaVista Search Engine versus the World Wide Web.

A recent study[2] reported in the prestigious journal Science came to the conclusion that there are roughly 320 million separate web pages - 320 million unique HTML files - available on the World Wide Web. They estimate this comes to some 15 billion words of information available over the web. For comparison, the Library of Congress, which has been collecting books since 1800, possesses 17 million books and 95 million other works.

But the web is different than the Library of Congress. Web pages are easier to catalog - feed them through a computer database indexer, and you can quickly create a full-text searchable index of all the pages you can suck through your system. But on the other hand, it's harder to find pages to feed, because they're not confined to your library storage archives, indeed they are scattered throughout the world with no easy way to locate them all, and by the time you're finished looking through them, a great many of them have changed, moved, and been deleted or created. This is why the Census Bureau employs a great many people during a very short period to attempt to take a snapshot of the American populace.

Taking a snapshot of the WWW doesn't require survey-takers, it requires computing power and bandwidth.

AltaVista is one of the fastest search engines, perhaps the fastest. Created as a showcase for Digital Equipment Corporation's high-end line of superfast computers, the search engine processes a mammoth amount of data daily to create an up-to-date, searchable index of web documents.

To do this, it takes money, and hardware, and bandwidth. AltaVista is connected to the rest of the net by links totalling 25 Gigabytes/second of bandwidth. For comparison, this is enough bandwidth to support 87 million phone lines. If downloading that new game demo or the latest version of Netscape took two hours with your 28.8 kilobytes/second modem, it would take .008 seconds for the AltaVista search engine to download that same file.[3]

AltaVista runs on a tremendous amount of hardware, as well. Currently 28 of Digital's top-of-the-line AlphaServer 8400 5/440's form the backbone of the search engine[4], with several other computers for various specialized tasks. The total value of the hardware that runs AltaVista is something over $35,000,000.[5]

Yet with all this hardware, all this bandwidth, and no human interaction in the indexing of webpages (adding a new page takes a tiny fraction of a second), AltaVista still can't index the entire web. According to the _Science_ study, they have about 28% of the web in their database. Three quarters of all web pages are NOT in the AltaVista database.

Let's compare this achievement to the claims of censorware makers. All of them claim to have scanned more or less the entire web for naughty words, homosexual references, and anything else they don't want you to see. Not only that, which would be remarkable, but they claim that each page is reviewed by a human before being added to the blacklist.

CyberPatrol, for instance, employs about a dozen people to review sites for their list.[6] Other censorware makers seem to have even fewer, relying (though they deny it) on a computerized search to determine which pages are fit or unfit for human perusal.

But how effective can their blocking be? AltaVista covers a quarter of the web. The most complete (and newest) of the search engines studied is HotBot, covering some 34% of the web. Yet somehow, running some home-grown software on a few Pentiums connected to the net by a T-1 line at best - a T-1 connection is .0062% of the speed of AltaVista's connection to the net - these companies assert that they index the ENTIRE web. No censorware company has more than a tiny fraction of Altavista's bandwidth. No censorware company has more than a tiny fraction of the hardware, or technical expertise, of the people at Digital Equipment Corporation. And supposedly these pages are reviewed by a human, taking a minute or more instead of a split second?

Don't delude yourself.

Not only are the criteria for adding pages to the blacklists arbitrary and vague, the companies simply cannot, it is mathematically impossible, even *view* any significant portion of the web. A T-1 line sucking data full-time 365 days a year does not take in enough data to index the entire web, at its current size and rate of change, yet the claim is still made, with a straight face.

It doesn't take more than basic math to determine that their claims of human review cannot even approach the truth. They can't even do a *computerized review* of the whole web, they simply haven't got the hardware/bandwidth/processing power for it. And even if they tried to do so, how could they do better than the major search engines which have *at best* indexed 34% of the Web?

The American Library Association said, in a statement to a Senate committee:

"When a library installs commercial filters or blocking software, it transfers the professional judgement about the information needs of the community from the librarian to anonymous third parties - often part time workers with no credentials and no ties to the community - who evaluate sites for the software manufacturer."

They are wrong. More correctly, the librarian transfers their judgement to a computer program, which may or may not be superficially overseen by a few humans and which cannot even approach "completeness". This is no substitute for a human with a Master's in Library Science. Yet people trust these programs to do what they say they do, block all the "evil" material and let through all the "good". The companies haven't even viewed most of the web! How can they possibly block it?

Censorware companies have tried to extoll the virtues of their squad of dedicated porn-reviewers, slogging through the day looking at disgusting porn sites to keep you safe. The reality is that the only way to take an effective look at the Web is with computers, taking in a tremendous amount of data from the constantly-changing web and constantly updating a database of tremendous size. These techniques still only gather a quarter to a third of the entire Web. Bring human review into the equation and the impossible claims of censorware companies become still more ludicrous.

Parents, libraries, and companies should all recognize censorware for what it is: snake oil. A product sold more to soothe the mental state of the purchaser than to achieve any real effect. Those demonstrations which purport to "show" that Product X blocks 99.44% of the porn? Completely staged. If a censorware company did as well in reviewing the Web as AltaVista (hah!), three-quarters of all the "evil" pages out there would never have been seen by them, could NOT have been banned, and will not be stopped. These facts can't be overcome by more ludicrous claims - "OUR new program technigue lets us index the entire Web in only 2 seconds, so we don't have those problems!" - but censorware makers will continue to make them until confronted with cold hard facts.

If you're ever speaking with a censorware company representative, ask them what kind of hardware they're running to spider the web. $35,000,000 worth? (where'd they get that much venture capital? Suuuure.....) Ask them about their connection to the net? T1 line? That won't get far. .0062% of AltaVista's connection speed, remember. Ask them how many people they employ to check sites? A dozen? A tiny fraction of what would be needed.

Don't be misled by snake oil vendors. At an AltaVista-level of expenditure, you could properly computer-index a quarter of the Web. I have no idea what it would cost to make a try for 99+% indexing, but it would be much, much more than Digital has spent on AltaVista. If they wanted to have humans review the all the webpages found by searching on keywords, you'd see advertisements in your local paper just like the Census Bureau does every ten years - thousands and thousands of workers needed. Except this would be an ongoing thing - they'd need all those workers all the time. When you see that level of expenditure - and remember, the Web is growing at a rate of 30% per year, so each day that passes increases the expenditure needed - THEN you can say, "Here's a censorware company that has the minimum requirements needed to be able to index the Web and search out all the undesireable elements." Until then ... don't be suckered.

-- Michael Sims,
April 6, 1998

[1] Ironically, this is exactly what is happening in the workplace. On one webpage: http://www.intellectualcapital.com/issues/98/0101/icopinions1.asp , which contains an essay by Nadine Strossen called "Filtering out the Truth", the following comment was submitted by a reader:

12/31/97 Rick emactkd@aol.com
Recently I was blocked from a site at work. It was Project MAC, the mathmatics and Computation project at the AI lab at MIT. When I complained to our Net police, they said the site had nudity, full frontal. When pressed they said the software contractor for the service had determined this, and I was SOL. This type of censorship is hurting PROFESSIONALS, not just limiting access by children.

This website is one of the ones we pointed out was Banned by CyberPatrol in our report. Someone at this person's workplace was trusting in CyberPatrol's description of the site as containing full frontal nudity, and obviously they were mistaken - but still unwilling or unable to make the difficult configuration changes needed to unban the site. The software says it's nudity, so it's nudity.

[2] Reported at: http://www.news.com/News/Item/0,4,20728,00.html , among other places.

[3] Assuming the originating computer can handle sending the file at that rate, of course.

[4] http://www.zdnet.com/pcweek/news/0406/06mic.html

[5] Retail. Estimated from retail prices of Digital's servers in standard configurations. Considering the extreme amount of disk space and memory used in the search engine (which isn't in the standard configurations and doesn't come cheap), this estimate of total hardware value may be substantially low.

[6] See: http://www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/1998/01/15/BU34919.DTL&type=printable . Microsystems, recently purchased by The Learning Company, is the maker of CyberPatrol.