Open Text Mining Initiative

November 22, 2006

The Open Text Mining Interface (OTMI) is an initiative from Nature Publishing Group (NPG). It aims to enable scientific, technical and medical (STM) publishers, among others, to disclose their full text for indexing and text-mining purposes but without giving it away in a form that is readily human-readable.

There’s an argument that this is a solution to the wrong problem, the problem being closed or partial access. But limited access is a fact of life, and this looks like a great idea. Now we need a way for search engines to negotiate for the OTMI representation of a resource. One more step towards RESTful nirvana.

Dynamic Browse Prototype

November 21, 2006

Richard Jones has recently updated the Wiki page on his Dynamic Browse Prototype. The functionality is interesting; the prototype aims to increase the scalability (w.r.t. #items) of the browse interface by adding pagination of long result sets, and to increase the flexibility of the browse system, so any metadata field can be indexed and browsed. This would be a really great enhancement with a good wow factor – like a simple interfaced, pre-configured version of the kind of functionality that Longwell offers.

I’ve packaged up the SPECTRa crystallography tool as a 1.0b5. SPECTRaX is a super simple web interface that takes CIF files, converts them to CML, packages them up with a METS manifest, and deposits them into DSpace via the DSpace LNI, all in the course of very few user clicks.

This is the first public release of the code, which you can download from the codebase website.

From techcrunch: Yahoo, Google and MSN have agreed to standardize on a sitemaps protocol. The new standard apparently it looks a lot like Google sitemaps, but is numbered 0.90, so perhaps there are a few features still to go in.

Why is this important for me in the data/repositories world? Well, It’s becoming increasingly clear that getting search engines to harvest metadata, or getting them to crawl metadata-only splash pages doesn’t work and we should be directing them straight to the full text if we want our content to be indexed effectively. Sitemaps allow us to build search optimized representations of our content (this applies double for data without a default textual representation) and point the engines straight to them.

What Heather Morrison said.

Before the technical review, a survey went out to the DSpace community in order to help the review group prioritize what users actually wanted improving. The results have been back in for a week or so, and Charles Bailey has posted a summary of the results.

Steven Chabot has posted an analysis of the DSpace project and software (Full report in PDF).

As has been addressed, there are some problems with DSpace. In the first place, the software is open source. While this does come with its own benefits, it also comes with its own problems. Commercial support for the software does not exist at this time, neither for installation nor for later technical issues. Libraries used to working with commercial software or ILS vendors may find implementation difficult. Furthermore, some who have previously implemented the software have had problems with performance while updating files and with the structure of the communities, although these may have been fixed in successive releases of the software.

The major difficulty we have found is with DSpace’s handling of metadata. While we feel that the number of fields in Dublin Core is adequate for most if not all uses (DCMI Usage Board 2006), we are troubled by the lack of authority control when completing its fields. Without some control over uniform titles, authors and subjects accessing the items in the future will very problematic. However, this could be solved at an institutional policy level, with guidelines for submission and librarians or faculty having roles in the “workflow” overseeing metadata. While there is no scope in this paper for a discussion of necessity of controlled vocabulary, we will stress that this necessity does not just apply to paper documents, but to digital ones as well.

Comment: Steven focuses primarily on published literature, so the analysis is a little out of date in places.

The point on lack of visible commercial support is an interesting one; I know a couple of small to medium sized companies who might offer technical support (at least, possibly repository setup support also) for DSpace. Perhaps now is the time the community should be helping these companies to promote themselves?

Corinne Mist has posted some thoughts on DSpace. Exerpt:

It seems that while a goal of shared access to scholarly materials is met by DSpace, it is not successfully integrated with the resources already in place. Rather than creating a new platform in which to find an entirely different range of intellectual output, it seems we might consider integrating these materials with the current library system, search engine and catalogue in place.

Comment: This is a fair comment. Integration with existing information systems is an active area of R&D in the DSpace community, for example in the establishment of the spir@l IR at Imperial College, London.

DSpace analysis at Toronto

November 14, 2006

John Ellis in Toronto has posted some Conclusions on DSpace: –

My main concerns about the use of DSpace in our library are copyright issues and the initial setup costs.

All things considered, the DSpace software would be beneficial to our university community.

Open Scholarship Presentations

November 13, 2006

The Open Scholarship 2006 presentations are now available online. In all the excitement of the technical review, I forgot to mention here that my update on DSpace is also available at DSpace@Cambridge.

Which raises an interesting point – shouldn’t all those presentations be on their respective IRs and referenced from Glasgow’s 😉 ?