Here Today, Gone Tomorrow, 10/20
Posted by Valerie on October 21, 2008
James Jacobs (Stanford), Molly Bragg (Internet Archive – IA)
MB: [presentation slides]
overview of IA. (interesting – designated as a library by the state of CA in 2007)
3 ways to partner: national libraries – domain crawls; curated crawls for large collections – iraq war, elections; archive-it (automated harvesting)
have done congressional harvests for LC since 107th.
for nara: EOT 2004, congressional election harvest, 2006, 2008
collaborative EOT this year
overview of archive-it
average life of a web page – 44 days. (where does this come from?)
see NC archives site for more info on their partnership
James Jacobs – Stanford’s experience [presentation slides]
archive-it collecion – archive.org/home/ssrg
part of admin interface – edit metadata for collections
demo of crs reports site as part of stanford’s archive-it – search at –
collections are also part of the archive-it search
why are we doing this? (part of an overall digital strategy that we’re trying at stanford, and hopefully the library community)
stanford also is harvesting mats from agency e-foia reading rooms
lockss & internet archive are in talks to collaborate
IADeposit – delicious project to tag items for IA’s govdocs collection (refer to James’ presentation from…Annual?)
what’s next?
would like to add more robust DC metadata to seeds
build new collections (send ideas to James)
Farmington Plan redux (ARL plan? to collaboratively collect int’l pubs) – concept could work really well for digital environment
- better access – ingesting into opacs, (stanford is using vufind); google site maps; more/better metadata
- better open source-digital mgmt tools (mentioned web-at-risk as an example of a non-archive-it tool)
- open metadata standards and repositories
q’s:
q: cost of archive-it, learning curve
a: various subscription levels, ranging from 11-16K; they do work w/individual budget needs
learning curve – designed to be user-friendly; training is provided at the beginning. you don’t need to know what an api is. really designed for non-tech savvy people.
crawler can be very broad or focused, depending on the need. there are some pre-scoping features in the works. certain kinds of content can be blocked. may be easier to block certain files
q: re: intellectual property. how are you dealing with that?
a: James: my work is part of public domain, educational aspect of that. if it’s a gov’t web site, it’s in the public domain. with .orgs, etc., james takes a liberal aspect of intellectual property. haven’t been approached by an org wanting something removed.
q: re: gao legislative histories:
a: carl malamud is trying to obtain the data; when that is, it’ll be part of the lockss project to preserve his publicresource.org site