Our Technical Approach: Aggregation Vs. Federation

(Image: It’s Our Flag © IWM (Art.IWM PST 12254) )

Making a Choice

In my last post, I discussed some search API technologies, and a harvesting one, plus the possibility of being given offline metadata by some content providers.

At this point, the contrast between searching other people’s live APIs, versus harvesting everyone’s data and re-serving it via our own API made us move ahead to point 6 in our preliminary strategy, and stop to think a little more deeply about what we are trying to do on the project. It would be possible to harvest everyone’s metadata and serve it all back up under one search interface, and certainly this is what aggregators like Europeana and Culture Grid have had great success with already, in the case of Culture Grid with the help of people like K-int and their open data aggregator software. However, there is another option which would meet the goals of our project: federation.

In the aggregation scenario, everyone wanting to do something with the whole range of metadata which is out there has to grab what they want to use, and then build their application on top of it. If it ever changes at source, then a re-grab and reload is required.

In the federation scenario, live, up-to-date data is obtained on-the-fly from every source who maintains an open API to the world. On the other hand, if a source goes down for whatever reason, all that data suddenly vanishes from your application, and a scenario like this runs the risk of not performing as well as copying everything to your own system.

When we scrutinised the ethos of our project, we felt it came down to the fact that whilst discovery seeks to demonstrate the benefits of people making their data available in general, our project’s specific contribution to the programme is to demonstrate the benefits of people signing up to the discovery statement and offering up their metadata via an API. This implies that we are going to re-use their APIs in place, which really implies federation and carries the interesting challenge of making use of all the relevant API offerings in one end product. Additionally, we were aware of so many good examples of aggregation in the space that we federation would be an interesting thing to look at as a counterpoint!

…and then Aggregating Anyway!

In some respects, we will still have to aggregate to a certain extent, because some providers who do not have APIs already have agreed to hand over their metadata for us to make use of in our application. However, our aim with this metadata is to make each set API enabled and document the process and decisions we used. It will then be possible for providers to pick up what we have done and replicate it themselves, or alternatively, for us to act as some sort of partitioned API service provider for people without the infrastructure or desire to get into hosting their own API. From an approach perspective, this means we are still using federation, it is just that some of the sources are ones that we are ‘coincidentally’ hosting in house. To find out more about how we brought this aspect of the project to life, I have written a more detailed technical article about our approach helping institutions towards APIs.

Coming back to point 4 from our preliminary strategy, we looked back to the API requirements gathered from phase one (restructured and presented below). Search is right there at number one, so acting as some kind of federated harvester is out. Solr is so common in the field, and the search input and result output formats are so standard, that it became clear we would able to act as a Solr federator quite quickly. Solr’s CSV import also makes it a really good fit for loading the metadata we have obtained from API-less providers into Solr instances which we could house and then plug straight into a solr federator as well. Once we had something up, we could then look into expanding it to cope with SRU and opensearch queries and responses too.

All in all, this was a strategy which looked quite realisable, and so I set out to produce a proof of concept, which if successful, would give us enough confidence to commit to it as an approach and put out to tender for a couple of developers to work with our API and make something pretty! You can read more about the proof of concept API in my API Syntax post, or my next article on setting up Solr APIs goes into more detail about how we made metadata from institutions searchable so that we could include it in our API.