The truth is out where?
February 21, 2007
I would be fairly surprised if any of the readers of this blog didn’t also read Jon Udell, but just in case: Jon Udell has been blogging about the two fundamentally approaches to maintaining metadata about files.
One way is to use metadata embedded in the file (or potentially in a sidecar file). The other is to “stand-off” the metadata in a database. This debate will sound familiar to many, and to anyone who followed the early DSpace 2 discussions in 2005, where this was one of the most contentious issues.
Now to my mind if you’re just talking about a tagging photos then the correct approach is to keep the data in the file – most OS’s have mechanisms to observe file changes, so it seems sensible to let all the applications that index tags just observe changes.
Isn’t this the case for a repository application like DSpace? It would make for really great decoupling if we could just let multiple update components independently write to a filesystem and have reading components observe the changes. It would also probably scale well with respect to system complexity and load, since any work done by the observers wouldn’t block the update.
The big wrinkle comes when you want to constrain the metadata you write. If you write a “part of” field referring to another resource in the repo, does that resource exist? Are the controlled vocab fields valid?* If you want to maintain consistent state like this you need a single point of access for storage updates.
Of course you don’t have to maintain consistent state. You can accept that state might be corrupted and make sure that anything dealing with relationship metadata is fault tolerant. The web has scaled well, because of (rather than despite) the ability to create broken links, and to break them by moving content.
I’m undecided about what this means for repository systems. Are the rigours of worrying about transactions, correctness and constraints more of a pain than cleaning up after (hopefully infrequent) conflicts occur?
* A couple of years ago I was chatting to Mick Bass (now head of the SIMILE project) and wondered how you could do this using RDF. “How closed-world of you” was the reply. Why is the answer always slightly zen when the question involves semantic web?