IN THE UNITED STATES DISTRICT COURT
FOR THE EASTERN DISTRICT OF VIRGINIA
Alexandria Division

 

______________________________________________________  
                                                      )
MAINSTREAM LOUDOUN, LOREN                             )
KROPAT, MARY C. DUCHATEAU,                            )
and JOHN S. WHITE,                                    )
                                                      )
                      Plaintiffs,                     )
                                                      )
               v.                                     )        Case No. 97-2049-A
                                                      )
                                                      )
BOARD OF TRUSTEES OF THE                              )
LOUDOUN COUNTY LIBRARY                                )
                                                      )
                      Defendant.                      )
                                                      )
______________________________________________________)
                                                      )
THE SAFER SEX PAGE; BANNED                            )
BOOKS ON-LINE, owned and operated                     )
by JOHN OCKERBLOOM; AMERICAN                          )
ASSOCIATION OF UNIVERSITY                             )
WOMEN MARYLAND; ROB MORSE;                            )
BOOKS FOR GAY AND LESBIAN                             )
TEENS/YOUTH PAGE, owned and                           )
operated by JEREMY MEYERS;                            )
SERGIO ARAU; RENAISSANCE                              )
TRANSGENDER ASSOCIATION, and                          )
THE ETHICAL SPECTACLE,                                )
THE SAFER SEX PAGE, et. al.,                          )
                                                      )
                     Plaintiff-Intervenors,           )
                                                      )
            v.                                        )
                                                      )
BOARD OF TRUSTEES OF THE                              )
LOUDOUN COUNTY LIBRARY                                )
                                                      )
                     Defendant.                       )
                                                      )
______________________________________________________)

 

 

EXPERT REPORT OF MICHAEL WELLES

INTRODUCTION

I was asked by the Plaintiff-Intervenors to write a report on the feasibility of using technological methods, either with or without human review, to create a system that blocks certain subject matter on the Internet that the blocker wishes to block without blocking subject matter that the blocker does not wish to block. In my view, it is not possible to achieve that result. The technological and human limitations on such an effort are such that it will inevitably be seriously flawed both as a result of significant overblocking (blocking of sites that are not intended to be blocked) and significant underblocking (failing to block sites that are intended to be blocked).

I was retained by counsel for the Plaintiff-Intervenors in this case on the basis of my experience and expertise with respect to computer functions. I am currently a senior system engineer in the Multimedia Department of Associated Press. A copy of my curriculum vitae is attached to this report. With the exception of my vitae, I do not currently plan to use any exhibits as part of my testimony. I am not charging a fee for this work.

Counsel supplied me with copies of the Plaintiffs' Complaint, the Complaint of the Plaintiff Intervenors, the Loudoun County library policy on Internet access (Policy), and a disk with a copy of the X-Stop Librarian II software (X-Stop).

Although I was supplied with the Policy, it was not part of my assignment, nor is it part of my expertise, to evaluate the terms used to define the material sought to be blocked. I was asked to assume that the material sought to be blocked was defined by subject matter, not only by specific words. Although I am not evaluating the parameters of the subject matter defined by the Policy, I am fully aware of the Policy. I am aware that the goal of the Loudoun County library is to block certain sites that are considered "pornographic." I am also aware that the primary if not exclusive concern of the Library is pictorial, not textual.

Further, although I have loaded and utilized X-Stop, it is not my purpose to evaluate the particular product. Instead, I was to evaluate if products such as X-Stop could be used to assist in establishing a blocking system that blocked material defined by subject matter with any degree of reliability. In addition to the other materials, I reviewed some of the literature on the various blocking and filtering software available on the market. Among the sites I reviewed are those of Peacefire and the Ethical Spectacle.

Finally, I was asked to assume that the only blocking would be of material on the World Wide Web (Web). I was instructed to ignore any problems that would arise in an effort to block other parts of the Internet such as newsgroups, mail, chat rooms, and email.

BASIC FACTS ABOUT THE WEB

The amount of material on the World Wide Web is vast. This fact has been repeated enough to have become a cliche, but it remains true. It is a simple matter to add a computer, anywhere in the world, to this network. All that is required is a wire, and an electronic address, called an "I.P. address".

The address of a page on a web site, such as "http://www.yahoo.com/Reference/Codes/index.html", is not the I.P. address of the site. It is instead what is called the Universal Resource Locator (URL) of the site. The portion of the URL between the first slashes ("www.yahoo.com") is the name of the machine ("host name") that the web site is on. It is also sometimes called the "domain name." The rest of the URL is the location and the filename of a document on the machine. ("[H]ttp" which appears before the name of the machine and "html," which appears at the end of the full URL describe the software used to create and access web pages.) Web sites are accessed using a form of software called a browser. The browser uses the Domain Name System (DNS) which provides the actual I.P. address of the machine (in this case, "204.71.200.75"). The browser then contacts the machine at that address and asks it for the document identified by the portion of the URL after the slashes ("Reference/Codes/index.html").

As the Yahoo example indicates, I.P. addresses currently consist of 4 sets of numbers separated by "dots." Each set of numbers can range from 0-255. Thus, there are 4.3 billion I.P. addresses available -- and thus room for about as many computers on the current Internet.

Recent surveys suggest that there are 122 million people who have access to the Internet and are thus "online." (http://www.nua.ie/surveys/how_many_online/index.html) Every person who uses a computer to "read" the Internet can use the same computer to "write" to it. This is the essential nature of a network -- it allows machines to "speak" to each other.

Large as it may be, the Internet is still growing. Netcraft, a company that tracks webserver usage, found over double the number of webservers in June 1998 than it did in June 1997 (http://www.netcraft.com/). There is no indication that this rate of growth is decreasing. In fact, current expectation is that the Internet will expand 1000% over the next few years. (J.M. Barrie and D.E. Presti, Science, (October 18, 1996)).

Steve Lawrence, and C. Lee Giles, researchers for the NEC corporation, used the existing search engines to estimate the size of the indexable World Wide Web (as will be explained later on, the indexable World Wide Web is only a subset of the actual World Wide Web), by running a long list of searches on each of the major search engines, and comparing the results.

Their results were published in the April 3, 1998, issue of Science Magazine, and their conclusion was that there were at least 320 million pages of material available on the World Wide Web. A web page is not necessarily the same size as a page of commonly typewritten text. For example, web pages can be a paragraph or a single photograph; they can also be the equivalent of hundreds of typewritten pages or photographs. They can also include audio and/or video. As of March 1997, the best published estimates had the size of the web at 50 million pages. This represents almost a sixfold growth per year.

RATE OF CHANGE

None of the 320 million existing web pages is permanent. Any web page can be changed or removed or any time, by typing a few keystrokes or by turning off the webserver. The founder of the Internet Archive project (which actually attempts to archive the entire Internet), Brewster Kahle, estimated that the average lifespan of a document on the World Wide Web is 75 days. (Brewster Kahle, "Archiving the Internet", Scientific American, March 1997. Article also available at http://www.archive.org/sciam_article.html).

I have been administering webservers since early 1994 (when the size of the web was not much more than 500 sites). I know of no other statistics about the rate of change on existing sites. In my experience, Mr. Kahle's estimate is fairly accurate.

The web site I currently administer for the Associated Press has over 1,600 changes on it during the course the course of an average day. This is fairly high, as the nature of news is transitory and dynamic. While this may be an extreme example, I've never administered a webserver whose content remained static: The first webserver I administered, on a University system, changed whenever any of the 3,500+ users that had an account on it decided to change their personal web space -- and change was pretty much constant. When I was working for a company that designed and hosted web sites for about twenty corporate customers, we generally did two or three bulk "publishes" from our development systems to our live systems in a day. In other words, two or three times a day, we added material or changed material on our online sites. The least active webserver I administered, for a major financial company, which served mostly as a flat "billboard" for their services, was changed with less frequency -- once a week, as all material added had to pass legal review before it could be published.

AMOUNTS OF DATA, LINE SPEEDS

The 320 million pages on the World Wide Web contain a great deal of data. A single byte represents a single character of data -- such as the letter "a". A kilobyte represents 1024 bytes. The average web page, according to Kahle, is on the order of 30 kilobyte. Extrapolating from this, the 320 million pages on the indexable web represents 9,600,000,000kb -- 9.6 billion kb. An average typed page contains approximately one kilobytes. Thus, the indexable web, if printed on ordinary paper, would require 9.6 billion pages.

The average 28.8 home modem can retrieve or "pull down" about 3 kilobytes per second. At that rate, it would take over 1,000 years to pull down a current snapshot of the Internet. After the first 75 days, much of it would have been removed or replaced however, so the millennium of effort would be wasted.

The fastest common connection to the Internet backbone (ordinary computers connect to more high powered computers that comprise the "backbone" of the Internet) that is available currently is a "T3" link, which has a peak rate of 45 megabits ( about 5,760 kilobytes) a second. If the entire Internet ran at this speed, pulling down the contents of the web would take far less time.

However, the Internet does not run at this speed. Many, if not most web sites are served by "T1" (1.5 megabit/s, 192kilobyte/s), or "56k" (56 kilobit/s, 7 kilobyte/s) connections. In other words, even if the user could pull down the data at a faster rate, the web site cannot handle that faster rate and will make the data available more slowly. In addition, all data has to pass through across a common, congested backbone. In practice, the maximum sustained transfer speeds I've personally seen have been on the order of 200-300 kilobytes/second, and rates are usually much slower. Given that fact, the best case estimate (assuming optimal throughout) for pulling down all 9.6 billion kilobytes of the data on the indexable web would be almost a year.

Considering that the current growth rate of the web is about 600% per annum, by the time the indexable web was pulled down, the amount of data available would have grown sixfold. If the user started again immediately, the next effort to pull down the entire web would take almost 7 years at current optimal transfer rates. It is safe to assume that in that time, the rate by which data can be pulled down would have increased. However, even assuming an almost a tenfold increase in the expected transfer rates, pulling down the entire web on this second effort would still take almost a year.

With this stated, we can proceed to the basic question: Given a particular subject matter, can one block access to material which contains that subject matter without blocking access to material that doesn't. The only assured way would be to have the perfect judge screen all the information on the Internet. But how to get this information?

Unlike television, which pushes content out to the viewer, the World Wide Web is primarily a "pull" medium. Information is only displayed upon request - by either clicking on a link or typing in the name of the site one wishes to visit. Although this ensures that people get access to content which is relevant to them (and, inversely, also ensures that people rarely view content that they have no interest in -- a person searching the Internet for information on astronomy would be very hard pressed to stumble upon "pornography"), it also means that it is difficult, if not impossible to get any sort of coherent picture as to what is available.

COMPREHENSIVE SURVEY

Just finding out what information is available is a daunting task. As stated earlier, there are 4.3 billion possible I.P. addresses. (http://www.nw.com/zone/WWW/new-survey.html). The best estimates are that there are only about 25-30 million of these addresses in use. However, the only way to identify those in use is to try to contact each address.

One method of attempting to find all the sites on the web that have "pornography" on them would be to look at every site. It is possible to program a computer to go through every possible I.P. address, searching for ones in use. If one is in use, the job isn't done, since there are over sixty five thousand "ports" that the computer at that I.P. address can offer services on. A good analogy would be that a computer at a certain address is a post office, and each "port" is a P.O. Box inside that post office. There could easily be hundreds of webservers running on a single machine, each one at a different port, just the way there could be hundreds of P.O. boxes in use at a Post Office, and each could be entirely unrelated to the other. A comprehensive survey to see what servers are available would have to try every assigned port on every available address -- which means over 280 trillion address combinations to try. At the rate of 1000 queries a second (an extremely high rate), it would take almost 9,000 years for the survey to be complete. And only then would the task of pulling down the data that is available begin.

Finding all the pages that are available on these servers would be impossible as well, as there is no way for a remote computer to find the complete contents of a site without that site explicitly allowing it to do so.

BEST GUESS METHOD -- CRAWLING THE WEB

A comprehensive survey not being possible for the above reasons, the next step would be to have a "best guess" list of pages to evaluate . One possibility would be to evaluate pages by their URL. Some URL's, such as ACLU.org, for example, give an indication of the organization or individual who hosts that page. However, many URL's are not so descriptive and some are deceptive. For example, "whitehouse.com" is a site containing "pornographic" images. Thus, it would be essential to search more thoroughly than is possible from URL names.

There are several ways of doing this, but for the remainder of this discussion we can assume that the list was generated by a robot of some sort. Commonly (and interchangeably) known as robots, spiders, or crawlers, these are software programs that wander the web looking at documents. According to a web page that authoritatively describes robots, robots are "...a program that automatically traverses the web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. . . . Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them." (http://info.webcrawler.com/mak/projects/robots/faq.html). Various web robots traverse the web in various ways, but all use essentially the same technology. Robots look at pages on the web, starting at a certain given point, and then following every link that they come across, storing certain information from every page they visit.

One of the most common methods of finding sites on the web is to use a search engine. There are several leading search engines on the Internet (Lycos, AltaVista, Excite). A search engine user enters one or more words and the search engine identifies and lists all web sites that reference the word(s) entered. Search engines generate their databases with robot programs.

Robots do manage to generate a large amount of data, but there is much more that they do not find. Robots can only examine sites they are told about, either through a link from another web site that they examine, or by being explicitly programmed to look at a certain site. They act like a human being with a mouse, clicking on every available "link", and saving some subset of the text on the page they visit for later review.

There are several categories of web sites that robots cannot read and therefore cannot add to any list or index. A robot cannot get a site that no other site links to -- my personal home page, for example, is not on any of the search engines, since no one has created any "links" to it, and I haven't told any of them about it.

A robot cannot read a site that depends on user input. In this way, it is much like a human being with a mouse, but without a keyboard. For example, some sites begin with a search program similar to that provided by the search engines and require the user to insert a word or words for the search. A robot will not be able to feed in any terms to search for (and would have no idea what to search for). Other sites require registration; they provide a fill-in-the-blank form and ask the user to enter certain information before that user can have access to the site. The New York Times is an example of such a site. If a site requires registration, a robot will not be able to access it. If a site requires a username and password to get in, a robot will not be able to access it.

There is a limit to the frequency with which a robot can revisit a site. Thus, a robot cannot adequately read and index work with dynamic content. At the news site of the Associated Press (http://wire.ap.org), where I work, there are over 800 new stories added daily, and about as many removed. A robot visiting our site would find its index made obsolete in a matter of hours.

Even if a site is robot-readable, the operator of a web site can program it to prevent the robot from being able to index all or part of its content.

But if the idea is to generate a "best guess" list of sites out there, a robot is the best of the bad alternatives. However, for the reasons just discussed, it is by no means comprehensive. Empirical evidence comparing the information available on the major search engines supports this conclusion. The researchers at NEC, while guessing the size of the web, found that there was very little overlap in pages that the search engines knew about. The best of them all, HotBot (http://www.hotbot.com), only knew about 34% of the total documents returned from all the sites. The worst, Lycos, knew only 3% of the total documents returned from all the sites. Even the combined total of all the sites found by all the search engines is not the total of all that is out there -- not even the indexable portion of it. The inherent limitations in robot technology mean that there are large numbers of web sites that cannot and have not been located by robots. There is no way to say how big the actual number is, because, as stated before, a comprehensive survey would take thousands of years.

AUTOMATED CHECKING

The list of sites produced by robots is incomplete, but it is still large -- 320 million pages is the "best guess", and growing -- at current rates of growth, there will be 1.9 billion pages on the indexable web by April 1999.

Given this amount of material, how can it be reviewed to determine if it fits the parameters of the subject matter sought to be blocked? Ideally, it would be possible to write a program that would identify the appropriate subject matter. But programs cannot recognize subject matter -- they can only recognize words.

A program could be written to block access to sites containing certain words or combinations of words. A search for "sex", on AltaVista, for example, gives almost ten million results. One could have a program block this entire list, but not all of these ten million pages was sexually explicit -- one of the early ones contained a link to "The Holy letter : a study in Jewish sexual morality." There is no way for the computer to know that that site was different from the others without knowing the meaning of the words -- and the interaction between them. Not a computer exists that could understand and flag the fundamental differences between the phrases "study in Jewish sexuality" and (for example) "study live Jewish sex!," while the average human would conclude that the former is more likely to include educational speech about sex and the latter to include "pornography." Some programs were written to block all sites using certain words, but those programs have proven troublesome. For example, a program that blocked "breast" was found to block sites discussing treatment for breast cancer.

If the library's primary concern is pictures, a program that could recognize and judge pictures when it saw them would be a help -- unfortunately, such a technology doesn't exist. The most sophisticated software that I know of that is available currently has quite a hard time distinguishing a close up picture of a face from a landscape -- let alone the niceties of distinguishing a pornographic picture from an artistic nude or a medical text. Blocking all sites with pictures would block an enormous percentage of the web as the vast majority of sites now contain pictures.

The same problem applies to video and audio. No program exists that can distinguish a video or audio clip by subject matter. No program can distinguish a clip that contains "pornography" from one that contains landscapes much less pictures of the nude body that are of educational or scientific or artistic value. Each video or audio clip would need to be screened by a human being.

HUMAN CHECKING

A human being has to be the judge, but a human being can't possibly review all the material available. As stated before, just to get a truly comprehensive list of sites would take a few thousand years.

Even accepting that a comprehensive survey is impossible and using the assistance of a robot or robots, the numbers of sites are still too daunting. As a back of the envelope figure, if one were to hire a team of 25 people, and ask them to review all of the estimated 320 million documents that are currently on the indexable web, it would take them 58 years to finish the current list (assuming a 35 hour, 5 day week per person, and 30 seconds a page to review, an assumption that is deliberately optimistic given the difficulties of sitting at a computer steadily and given the time necessary, for example, to view a video). Moreover, in order to assure that human judgments were being made to a sufficient degree of reliability (i.e. each of the 25 people applied the same standards), something like 10% of the sites would have to be read by more than one person, further increasing the time necessary. After the 58 years, the staff would only be finished with the current list then. Not only would most of the pages they were supposed to check probably no longer be available or have changed, but if the growth rate of the Internet continues, then the number of pages on the indexable web in the year 2056 would be around 4.8 times 1052 pages, and the 320 million that they had already checked would be far less than 1% of it. In addition, every single one of the 320 million pages would have to be rechecked since most would have been deleted or changed a few hundred times.

TECHNOLOGY ASSISTED HUMAN CHECKING, WORD SEARCHING AND CRAWLING

To try and screen out sexually explicit material on even today's web, some combination of technology and human intervention would be needed, so that the technology can pre-screen some of the content. But how to screen effectively? Some word matching to screen is in order -- to search through the list of available material and only flag suspect sites for humans to review. Of course, sites that do not contain any words will not be identified. Thus a "pornographic" site of pictures of people identified only by their names will not be picked up by a robot or search designed to find specific words about sex.

Technological pre-screening has all of the same difficulties discussed earlier; it also has its own difficulties. The amount of data that even a selective search returns is still overwhelming. It would still take the same team of 25 people approximately two years to search the entire list of sites that matched "sex" on AltaVista. By the time they were done, the list would (if "sex" matching sites grow at the same rate as the rest of the Internet) have grown to 12 times its size, and the original ten million pages would, in all likelihood, have changed or disappeared, as the average 75 day lifespan for a page would have been passed more than 6 times over. Moreover, as indicated in the discussion of robots, a large number of sites would not even have been in the original AltaVista data base from which the search started.

The only way to make the list manageable would be to search for more than just a single term, so that the list of sites that the human reviewer has to view shrinks -- to "fine tune" the list. For example, refining the search for "sex" on AltaVista to be "sex and hardcore" gave a list of 6,500,000 pages. Still huge, and still changing and growing daily.

The problem is that the narrower your search is, the more you miss. While the page regarding the history of Jewish sexual morality no longer was in the list of pages returned, neither were quite a few hundred thousand probably "pornographic" links -- and this is only an example of the most coarse grained tuning of parameters.

The only sites returned were in English, as well. How to find "pornographic" material that is in Japanese? Or German, French, Russian or Swahili? A comprehensive search would have to know every word with a sexual connotation in just about every language known to human beings, and be fine tuned enough so that it would return a list of "mostly pornographic" sites, without missing any. Not an easy task at all -- in fact, impossible, as a broad search will always return too much, and a narrow search will always allow "pornographic" material to pass. There is no perfect answer, and there cannot be, given the amount of information there is to process.

One needs a human judge, and a human judge or a team of human judges cannot work quickly enough to process the amount of material that is out there -- and either offensive material will slip through the cracks, or benign material will be blocked. It is simply too large a task, and the size of the task is growing in proportion with the Internet itself.

For all these reasons, I have concluded that it is not possible to find a technological method by which a person seeking to establish a blocking system could reliably block sites that must be identified by their subject matter without blocking sites that contain a different subject matter. Indeed, as a result of the facts discussed above, any such effort will inevitably result in significant overblocking and significant underblocking. It will therefore only poorly accomplish the goal set for it and, in the meantime, will prevent people from accessing speech that even the blocker thinks should be accessible.

ADDITIONAL FLAWS

Even if blocking software cannot be developed that would reliably block sites it wants to block without blocking sites that it does not want to block, such software may block some sites that the software wants to block. Proponents of such a solution might argue that despite the over- and under-blocking problems, even a solution with these flaws is better than no solution at all. In order to evaluate this argument, it is necessary to determine whether the software can be easily defeated. If so, the argument loses considerable force.

In order to evaluate this question, I installed X-Stop on a PC at my office. I did a search for "sex" at Yahoo!. I clicked on the ad that appeared, for a site called "fastporn". I entered this URL into my browser (http://www.fastporn.com), and X-Stop blocked my access to the site.

However, X-Stop blocks sites by listing those sites that the user cannot access. It is does not do the opposite, that is list those sites that are the only sites a user can access. As long as X-Stop functions this way, it can be worked around.

From my office PC, using a program (telnet), that is shipped on every Windows 95 machine, I connected to another computer on the Internet that I have an account on. Such accounts, called "shell accounts" are available for about $15 a month at most of the thousands of Internet Service Providers in the country. I typed in a single command ("ssh -R 8888:www.fastporn.com:80 localhost") and that computer began forwarding connections from it over to fastporn.com. I went back to my browser and typed in the address for this forward (http://mmdevel.ap.org:8888/). X-Stop, knowing nothing about the new address, and following its "default allow" policy, let the request through, and within a few seconds there were some very explicit photos displayed on the terminal in my office on which X-Stop was running and from a site X-Stop attempted to block.

Every computer on the Internet has the capability to do this type of forward. As long as blocking software lets through connections to unknown machines, anyone with a smattering of technical knowledge or the ability to follow instructions will be able to access everything that's out there. It is as simple as call forwarding.

                              __________________________________
                              Michael L. Welles