Data Is Finer Grained Than Documents
Consider Google's task. Google has to index every document on
the web. That's fairly honerous. But, indexing means that they
just keep track of which words are in which documents. They don't
have to care where they are in the document. They get to return
the whole document when they find a hit.
For our part, imaging data may be much the same. We can index
the images. Then, we can return the whole image when it matches
someone's search.
But, other data is much trickier. Suppose, for some algorithm,
we'd like to know the ground temperature at the time the data
was collected. If we kept an index of all surface temperature
information for all collecting stations, then we'd be entirely
duplicating databases that already exist. In those cases, there
is no portion of the data that we wouldn't be indexing.
Of course, some such data sets cost money to use. That's a whole
different problem. It would be great if we could index them in
such a way that we're not giving away their store, but we can still
say with certainty that the data the scientist is seeking is in
there if she's willing to pay for it.
So, this leads to the idea of indexing data content without indexing
the content of the data. Did that make sense? Suppose that rather
than indexing all of the temperatures for all of the locations, we
somehow indexed just what places the data covered and which days it
covered it. Maybe with a list of exceptions or something... so that
we know no data was collected at station YYY between 4pm and 7pm on
April 8th, 2002.
Now, when someone does a search on the temperature data for a place,
they get a result that tells them how to get the data, but doesn't
actually give them the data. And, it comes with a certainty metric
that tells them how certain we are that their exact data is actually
available from said spot.
There could be more than one hit. Maybe some quad-hourly data is
available for free. Maybe some hourly data from a place a little
further away is cheaper or more complete.
--
PatrickStein - 25 Mar 2005