A similar concern, less commercial but equally avaricious in the information sense, lies behind one of the projects from Ximian (now Novell) to generate the most buzz: founder Nat Friedman's Dashboard project. Despite a promising prototype, Dashboard implementation turned out to involve a lot of deep and difficult questions, but its supporters believe they have a way foreward. A future version of Dashboard will be reconstituted on the Beagle project, led by Jon Trowbridge.
The GNOME foundation is treating Dashboard and Beagle as extremely important. Trowbridge gave an informal keynote-like talk on them today at the 4th GNOME Developer's Summit.
The issue does not concern GNOME alone. Dashboard and Beagle are desktop-independent; they could be accessed by KDE as well. And Microsoft has announced a similar system that automatically indexes your entire computer system and turns up everything related to some topic of importance to you.
Now I'm trying to figure out whether I saw this survey. Was it a Web site I visit regularly, something on an RSS feed, an email sent by a colleague, or a hallucination induced by listening to too much modern jazz this week?
I don't think either Dashboard or Longhorn will help me search that last category any time soon. But they are supposed to help turn up results from all the other categories--and (thanks to real-time indexing) turn them up nearly instantly, even on a hard disk with multiple gigabytes of information in a variety of formats.
Trowbridge complains that your computer throws away a lot of information you give it (such as the fact that you saved a file from an email message). But I wonder about the push to save so much metainformation. True, we now have the processing power and storage space to save all kinds of junk. But can we predict what information will really be useful? I'll return to this question at the end of this article.
Fast search depends on an up-to-date database, whether one is talking about the spidering done all the time by Internet search engines, or a repository of terms used by files on your own hard disk. Thanks to events generated by the new D-BUS interface being developed for Linux, a kernel subsystem called inotify can collect changes to files as they happen and pass them to interested userspace tools.
Beagle depends on an indexing tool called Lucene to keep track of what's in various files on the system. It essentially checks everything except files in dot directories and others that traditionally contain throw-away data. As I already mentioned, it records the contents of Web pages you visit. It can also search your email, your IM logs, and anything else that exists as a file.
The next step is to associate store the metainformation collected in various ways with the files. Microsoft's Longhorn will theoretically involve an entirely new filesystem called WinFS. (When this will happen is anybody's guess, but it won't happen soon.) One of Linux's strengths is its support for multiple filesystems, and Trowbridge doesn't expect them all to be enhanced just to support Beagle. However, many filesystems contain files called "extended attributes," often used to implement Access Control Lists and other new features. Beagle can use these to store its metadata.
For each file format or type of information (email, for instance) Beagle will have a back-end API to do searching. The developers are even looking for ways to associate metainformation with pictures. Beagle combines all the results and presents them in a single front-end API. Applications that want to do system-wide searches, therefore, will need to understand just the Beagle API in order to access all types of data on the system. The current utility used to demo Beagle is called best, for Bleeding Edge Search Tool.
Privacy fears come to mind when one considers a tool that does instant searches. Remember that (currently) Dashboard and Beagle are meant for use by an individual on his or her personal data. One approach to the issue is to say "Privacy is overrated" and assume that one is doing the user a favor by presenting his or her entire disk contents on demand. Another approach would be to divide information into categories, such as to separate work data from personal data. But that's hard to do: asking the user to distinguish them is adding work, while trying to do it from context risks oversimplifying the complex lives led by users.
But I worry about clever schemes to track and save information--and not for privacy reasons. I just wonder whether we'll know what we'll want in the future.
Archaeologists have found marvelous ways to deduce ancient people's lifestyles from the facts they turn. They make deductions based on whether an artifact is upside-down or right-side-up, and from chemical traces found nearby. Still, we often wish people in the past left more clues.
We also do archaological searches on our computer's data, which is just as strangely organized and off-balance as the MIT Stata Center that hosts today's GNOME summit. Once again, the data we left behind on our computers proves frustratingly inadequate for today's purposes. And I would guess that increasing the data we collect will do little to close the gap.
When one starts creating filesystem attributes and instrumenting applications, one makes choices that will continue to have impacts thirty years later. What new application will arise just a year or two from now that will make the Beagle developers kick themselves because they forgot to prepare for it?
So I'm not saying full system search is unfeasible. I'm just asking how long it takes to prepare a system for the search, in comparison to how it takes for the system to become obsolete. I'd like to try the results, in any case.
Andy Oram is an editor for O'Reilly Media, specializing in Linux and free software books, and a member of Computer Professionals for Social Responsibility. His web site is www.praxagora.com/andyo.
oreillynet.com Copyright © 2006 O'Reilly Media, Inc.