Data Is Finer Grained Than Documents

Consider Google's task. Google has to index every document on the web. That's fairly honerous. But, indexing means that they just keep track of which words are in which documents. They don't have to care where they are in the document. They get to return the whole document when they find a hit.

For our part, imaging data may be much the same. We can index the images. Then, we can return the whole image when it matches someone's search.

But, other data is much trickier. Suppose, for some algorithm, we'd like to know the ground temperature at the time the data was collected. If we kept an index of all surface temperature information for all collecting stations, then we'd be entirely duplicating databases that already exist. In those cases, there is no portion of the data that we wouldn't be indexing.

Of course, some such data sets cost money to use. That's a whole different problem. It would be great if we could index them in such a way that we're not giving away their store, but we can still say with certainty that the data the scientist is seeking is in there if she's willing to pay for it.

So, this leads to the idea of indexing data content without indexing the content of the data. Did that make sense? Suppose that rather than indexing all of the temperatures for all of the locations, we somehow indexed just what places the data covered and which days it covered it. Maybe with a list of exceptions or something... so that we know no data was collected at station YYY between 4pm and 7pm on April 8th, 2002.

Now, when someone does a search on the temperature data for a place, they get a result that tells them how to get the data, but doesn't actually give them the data. And, it comes with a certainty metric that tells them how certain we are that their exact data is actually available from said spot.

There could be more than one hit. Maybe some quad-hourly data is available for free. Maybe some hourly data from a place a little further away is cheaper or more complete.

-- PatrickStein - 25 Mar 2005

This topic: LIAS > WebHome > LiasArchitecture > LarchHurdles > LarchHurdleFineGrainData
History: r1 - 24 Mar 2005 - 22:17:08 - PatrickStein
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback