I ordered an 8GB flash disk last week (turns out there is a 16GB one around, but I am modest ;-) ). Since I don't have a whole lot of media content to put on this fat-stick, I will instead end up putting all my work over the past few years on it. If I factor in the emails, I should easily fill up 8GB with a couple of years' worth of data. Wow. I remember having a hard time filling a DSDD 5.25 inch 576 kB diskette back in the early 90s.
The low down is that we have *lots* of data. 7MP pictures, podcasts, email archives, documents and web-downloads, not to mention audio/visual media - all this can quickly add up. Fortunately storage has kept up, or perhaps the pace of storage encourages more data generation in the first place? Whatever the truth, we have a situation where we have a whole lot of data sitting in our computers.
There have been many instances of large volumes of data in non-PC computer scenarios. For example, real databases have routinely run into terabytes. The key difference between user-PC data and these databases is the heterogeneity and the lack of structure in the former. User-PC data comes in various formats and is generated by completely different applications. Even when the same application generates the data (e.g. an email mailbox file) the goal has never been to store the data in a way extract global information later.
Desktop search software is the first step in mining information from User-PC data. But search is really a very preliminary tool because it only flags the existence of the information sought via the specified key-words. There is very little cognizance of the bigger picture. Data mining - that power tool which works so beautifully for databases and other highly structured data - does not exist yet for User-PC data.
Isn't it time we started building algorithms beyond just search to help users extract useful information from their gigabytes of data?