An extremely interesting post by Jim King (PDF architect and Senior Principle Scientist at Adobe), explaining why getting text back out of PDF is a pain. In his words: –

To extract text from PDF documents is a rather difficult and a highly technical task…

We’ve had plenty of problems with PDF in our data mining efforts. Leaving aside the basic problem that most of the data we care about can’t be encoded in PDF at all, we’ve had plenty of problems extracting text too. Jim explains some of the reasons why and concludes: –

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

PDF is not designed primarily for text. In PDF, typography trumps text.

I’m going to blog some more substantial notes on last Friday’s RepoCamp as and when time permits. In the meantime, a cool idea and a plea for collaborators.

The RepoCamp involved the announcement of not one, but two developer challenges in the style of the one at Open Repositories 2008. The first is a general challenge (for which I can’t easily find a reference: help please, WoCRIG!) to do something cool involving interoperating systems. The second challenge is specific to the OAI-ORE specification, and involves creating a prototype that makes the usefulness of ORE visible to end-users.

I’ve got a cool idea for this, but I’m going to need to collaborate to get it done in time, so I’m blogging it in the hope that someone with a bit of time on their hands will get in touch.

The idea: a javascript library (or userscript) that follows all the links on a page and if the link is an ORE Resource Map, or if a Resource Map can be auto-discovered from it, the link is decorated with an ORE icon. Clicking the ORE icon pops up a display of the contents of the ORE aggregation, a la Stacks in OS X 10.5.

There are some fun bells and whistles in there;  including making the interface super shiny and minimizing bandwidth.

Anyone want to help out? I was planning to use John Resig’s jQuery and HTML parsing libraries and possibly processing.js.

Since the doco doesn’t give quite enough help to be fully useful, here’s some incremental additional guidance. To add custom reports to trac you’ll need to monkey with the database – the report queries are stored there.

  • Start by backing the db up.
  • If you’re using sqlite (the default) and you don’t want to stop the trac server to use one of the client tools, you’ll probably want to access the db through the python api, as per the advice here.
  • You need to execute an insert statement to make your query. You’ll need to do the triple quote dance, since you’re adding a SQL query as a value in a SQL query, inside a string parameter to the cursor.execute() method. The doco on creating reports has advice on the schema.
  • Remember to call db.commit(), since sqlite seems to lock reads when there are outstanding write transactions open.

Free advice, and worth what you paid.

Last Thursday I attended a JISC workshop on repository architectures. It was a thought provoking day, and I learned a lot. Firstly I learnt that I need to pay more attention to context when quoting people on twitter (sorry, @paulwalk).

Paul Walk kicked off the day by presenting his work on a next generation architecture for repositories. His presentation started off with a number of starting principles and moved on to some diagrams illustrating a specific architecture based on them. As Paul mentions in that blog post, his diagrams and the principles behind them were “robustly challenged”. As far as I remember, the diagrams were challenged more robustly than the principles.

To cut a longish story short, the discussion and the workshop exercises brought up some interesting ideas, but did relatively little to either validate Paul’s architecture diagrams, or to provide a working alternative. Chatting about it over lunch and later over a pint, I was persuaded that we were looking for an abstraction that doesn’t exist, and that the desire for a single generic repository architecture might have led us down the garden path.

Software engineering, being a field that values pragmatic epistemology, has a couple of empirically derived laws that might help to explain why. Firstly, Larry Tesler’s law of the Conservation of Complexity states (in a nutshell) that complexity can’t be removed, just moved around. A natural way to manage this complexity is find abstractions that hide some of it. This, fundamentally, is what the repositories architecture is trying to do – reduce the multiplicity of interests, politics and data-borne activities of HE into a single abstract architecture.

A second empirical law, The Law of Leaky Abstractions, states that all non-trivial abstractions leak. Some of the complexity cannot be hidden behind the abstraction and leaks through. It feels to me that this is what’s happening with repositories at the moment. Our abstraction (centralization, services provided at point of storage etc) fails to cope with real, current complexities. The problem itself is extremely complex, and if anyone really has their head around it, they’ve still got the hard task of communicating it to the whole community so a good shared abstraction can be developed.

I found myself going back to Paul’s starting principles, and concluded that they were a much more constructive framework for thinking about repository issues than the concrete architectures in the diagrams. Paraphrasing the principles: –

  • Move necessary activity to the point of incentive
  • [Terms of reference for IRs]
  • Pass by reference, not by copy
  • Move complexity towards the point of specialisation
  • Expect and accept increasing complexity on the local side of the repository with more sophisticated workflow integration.

With the exception of the point on IRs, they are all forms of guidance on complexity, either where to move it (“Move [metadata] complexity towards the point of specialisation”), or which trade-offs to make (“Pass by reference, not by copy” => “Prefer to deal with the complexities of references than the complexities of duplication”). The reason I like this approach is that different disciplines, institutions and activities (e.g. REF, publication, administration) all have different complexities and different drivers. Perhaps we need a number of different architecture abstractions based on constraints and drivers. Perhaps the idea of an architecture abstraction is premature in this community and we should focus on local solutions (in the sense of ‘minima’ rather than geography). This needn’t end in technical balkanization; the repositories architecture is driven by business models, and focusing on interoperability and the web architecture allows more of the technical discussion to happen in parallel.

To get the ball rolling, I’d like to add a caveat to Paul’s “Move [metadata] complexity towards the point of specialisation”: “… unless it’s there already and it’s harder to recreate than maintain”. Any more?

Update

Andrew McGregor has posted extensive minutes and notes from the meeting.