Wednesday, May 26, 2010

Reprint of DocuBurst and the Future of the Book Article

You are looking at an open book
Created as a course project by a U of T student, this conceptual pinwheel could change the way we read and retrieve information
June 10, 2007 | RYAN BIGGE | Toronto Star



Books are wonderful things, but they tend to release information rather slowly. Bound ink and paper remains stubbornly linear, as sentences unspool across the page in an orderly but time-consuming fashion.

Those seeking a shortcut often rely on book reviews (and, as it so happens, this paper prints many excellent ones each week). But if you're an academic or a lawyer or anyone else with a specialized area of interest, reviews are unlikely to address your info niche. The solution to assessing unfamiliar books or articles might reside in the colourful pinwheels pictured here.

Aesthetics might not be the first thing that comes to mind when thinking about how to navigate the ever-increasing overload of information. Google Book Search and Amazon's Search Inside feature are great tools – provided you've already read the book in question. For a journalist trying to remember who once boasted about being able to float off the floor like a soap bubble, these databases are a godsend. (O'Brien, as it happens, in Orwell's Nineteen Eighty-Four).

If, on the other hand, you're a first-year university student working on an essay about surveillance, neither Book Search nor Search Inside is likely to steer you toward Big Brother.

For those who spend a significant portion of their lives sifting through the alphabetic silt, DocuBurst (the software prototype that produced these pinwheels) may become as valuable as the scanned pages of Amazon and Google themselves.

At the risk of oversimplification, building an enormous library (be it real or virtual) requires mainly money and time. Effectively navigating the result necessitates innovative approaches to information retrieval and display.

Or, to put it another way, massive amounts of digitized knowledge are useless if they cannot be accessed and assessed quickly and meaningfully.

DocuBurst – the name is a mash-up of document and sunburst – is the brainchild of 28-year-old University of Toronto student Christopher Collins. It's a new method of information visualization (a.k.a. InfoViz) that allows a person to quickly determine the cumulative theme(s) of a given book or document, while at the same time allowing specific keyword searches.

If you're in a hurry, DocuBurst can instantly tell you if a book is sufficiently obsessed with quantum physics, while bookworms can perform detailed literary analysis on a single word, like "nosegay."

"It's a really beautiful book," says Collins, in reference to Jacques Bertin's classic Semiology of Graphics, an exploration of information design that helped inspire DocuBurst. "He talks about the inherent meaning that can be conveyed through data graphics."

The power of information made visual is the difference between an Al Gore speech and an Al Gore PowerPoint presentation. (Of course, in the wrong hands, InfoViz can make things worse, not better, as our indecipherable hydro bills demonstrate.)
Despite their beauty, these pinwheels at first appear strange and opaque, at least without a little explanation. Don't panic. In only a few short paragraphs their mysteries will be fully revealed.

Collins, who is a year away from completing his Ph.D. in computer science, created DocuBurst as a final course project. He began by pouring the contents of a textbook from 1912 called General Science by Bertha Clark into a Java-based toolkit called Prefuse. (General Science, with 20,000 other books, is available as a free e-text download from Project Gutenberg.)

Next, Collins made the textbook searchable. "DocuBurst works like a regular index," explains Collins, "except it's interactive and is able to jump you to whatever part of the document you're interested in."

In normal mode (that is without the pinwheels), a DocuBurst keyword search will include a small window along the bottom of the screen that shows the half-dozen words before and after your search term. This is not unlike the book-scan snippets offered by Google Book or Amazon Inside.

None of this, of course, explains the pinwheels. That's because DocuBurst is two things in one: an interactive index and a method of determining the overall theme of a given book or document. The funky pinwheels will be explained next, after a brief but necessary tangent.

A "treeware" index often includes the infamous "See Also." So if you look up the word Energy, you might be told to See Also: Heat, Radiation and Electromagnetic Radiation.
The structure of DocuBurst pinwheels rely upon an exhaustive type of See Also list generated from WordNet. WordNet is a sophisticated database that groups words into distinct sets of synonyms, a hybrid of thesaurus and dictionary developed by Princeton's Cognitive Science Laboratory. Specifically, WordNet provides "is a" linkages. A cat "is a" kind of pet. Heat "is a" type of energy. This linguistic genealogy provides second, third and fourth cousins removed for any given word.



In the pinwheel pictured on the cover of this section, this results in a See Also list that starts out general and becomes increasingly specific with each additional orbit ring.

Collins isn't the first person to take advantage of WordNet, but he is the first person to transpose the "is a" relationships of a given word across a "radial space-filling graph," that being the technical term for the pinwheels.

From herein, things start getting simpler. Enter a search term like energy and DocuBurst colours the pinwheel based on the frequency of words that are related to the central term. This allows you to visualize the overall theme of a particular document or book.

Without DocuBurst, you're stuck skim-reading or schlepping through the index stuck in the back of a book. Let's say you're interested in intuition, and you picked up Malcolm Gladwell's book Blink. There are about nine entries for intuition, with See Also's for introspection and decision making. Which means you'd need to spend a few minutes flipping back and forth between index and book proper to see if the book met your needs.

Now imagine Blink as a DocuBurst. Given the central search-term intuition, WordNet would create a pinwheel based on synonyms of that word, and then indicate the approximate frequency of related terms and concepts (including, perhaps, ESP) through colouration.

"Our perception of colour ranges is culturally based," explains Collins. "Does red come before blue? I don't know. There's no inherent meaning there. But we do assume that a light colour means less than a dark colour."

Utilizing the same rapid-cognition and thin-slicing skills Gladwell discusses in his book, DocuBurst makes it very easy to determine whether Blink is worth reading.

"We're not particularly interested in people being able to immediately read specific numbers off this graph," explains Collins, in reference to DocuBurst's radial view. "We want to give more of an impression or theme."

The radial space-filling graph can also expand and contract at the click of a mouse. Like most informational visualization projects, DocuBurst is guided by three key principles: overview, zoom and filter, and details-on-demand.

For Collins, information visualization is a tricky mixture of art and science. He borrows techniques from hobbies such as painting, especially colour mixing. In a nod to his M.Sc. in Computational Linguistics, Collins is wearing a red T-shirt with white lettering that reads "I'm a noun!" on the day we meet.

The killer app for these pinwheels will be side-by-side document comparison. Imagine trawling through a legal database, searching for articles about file-sharing. With DocuBurst, you'd be able to see, at a glance, which articles lit up the relevant area of the pinwheel.

This is similar to the relationship between a map of Canada and weather patterns. Superimpose a storm front atop Hamilton and you can tell at a glance that it's a bad day to wash your car. The map is static, the weather patterns fluctuate. In the same way, DocuBurst can take the temperature of a given book or document, thus facilitating rapid pattern recognition.

A cross-comparison feature will require another four to six months of coding, however. In the meantime, DocuBurst's keyword search feature will be road-tested at the University of England this fall, when Collins will be asked to help analyze Victorian literature to determine how often authors used rare words, and in what context. Collins also just entered the Future of the Book contest (www.futureofthebook.org) in which he converted MacKenzie Wark's open source book Gamer Theory into DocuBurst format.

Google anticipates it will take about a decade to scan an estimated 30 million books for its comprehensive library. Which, coincidentally, is about how long it typically takes to implement an interface like DocuBurst so that the general public could comfortably use it at their local library.

Not that Collins is sitting on his hands waiting. He has spent the past year jumping from conference to conference, and is doing a three-month internship at an IBM research centre in New York.

His Ph.D. project, meanwhile, will allow users to compare two-dimensional graphs in a three-dimensional space. Imagine two pie charts able to talk to each other and compare notes, while suspended in space like plates stacked in a dishwasher, and you get some hint as to Collins' ambition.

To try to describe the project further would require many more words that would ultimately fail to convey the elegance and power of the software prototype that Collins let me sneak preview.

Which, come to think of it, perfectly demonstrates the limitations of language and the power and strength of information visualization.

If a picture is worth a thousand words, then DocuBurst creates colourful information graphs whose pinwheel patterns are comprised of thousands of words.

(link).