Robin Haun-Mohammed (RHM)
overview of web harvesting at gpo [slides]
assumptions:
gpo will continue to participate in web harvesting efforts to obtain in-scope material for the fdlp and the cataloging and indexing program as required under 44 USC
- gpo is bound by congressional appropriations for the S&E funding requirements for the FDLP C&I programs
- all materials identified for inclusion in the fdlp must be brought under bibliographic controls direct by the C&I program
- gpo does not have the authority to either give funding or gifts or to receive them. all partnerships must represent a contribution of an equal exchange between all parties.
- automated web harvesting initiatives wil become systematic as part of release 2 of fdsys
- materials harvested under the epa pilot project are being made available as staff time and processing permit. completion of the processing of this material will necessarily require an automated metadata extraction process that does not yet exist.
harvesting efforts:
- focus is on pdf files; focus is on publications. re: databases: we would prefer to partner with agencies producing databases, rather than try to capture
semi-manual harvesting efforts:
- use a tool to schedule content capture and re-harvest
sample of results:
18% were already cataloged
3% previously distributed in tangible format
2% not within scope
62% new publications
processing times breakdown:
scope determination take 17 minutes
conser standard record creation: 2.5 hours
[others on slides]
also, outline of workflow document
need to find web harvesting page for notes, numbers, etc.
scope determination takes a lot of time
list of some questions considered:
- is it published by a us gov’t agency
- is the information covered by copyright
- who is the copyright holder?
- is the primary source of funding for it gov’t money?
- does it contain data that may violate a citizen’s privacy?
acquisitions staff are being trained to create brief bib records for monographs in ILS
cataloging librarians are creating CONSER records
special materials:
- news releases, transmittals, forms, announcements
- included in list of special materials in the monthly catalog
- investigating methods to provide bibliographic access to this material
trying to figure out what to do with this – a LOT on epa site
cataloging guidelines are going to be updated this year
examples of special material – screen capture on slide
step-by-step overview – what happens to monographs and serials
[note to self: are we adding new items to our marcive profile???]
related projects:
metadata extraction:
- odu project. using 1K pubs from the epa pilot project
- currently developing software rules and designing templates
special material:
- demo project to examine depository participation in harvesting activities
- assist gpo in creating brief bib records for items in the special materials category
- opportunity for 5 depository librarians
- more information on fdlp-l as plans develop
- develop basic criteria for metadata, what would be useful; look at workflow for dealing with special material
partially harvested publications:
- demo project to examine depository participation in harvesting activities
- locate and harvest all the parts of partially harvested publications from epa pilot project
- complete the harvest of 150 publications
- 3 month project
- opportunity for 5 dep libns
- must have time to devote from june to beginning of september
- must have ftp ability
- call for volunteers will be posted to fdlp-l
Q&A:
Ken Wiggin (KW): appears to be a pilot that didn’t work. where do you plan to go now that you know how difficult it is? (What is lesson learned?) shouldn’t you start with metadata extraction and then harvest?
Ric Davis (RD): didn’t turn out the way we’d hoped. defined parameters and ‘threw technology at it.’ this was a beta test to further define rquirements for fdsys, so that we don’t procure a harvesting tool that gives us this messy data.
RHM: it’s not that it didn’t work – we didn’t know enough to make specifications very clear. some metadata was separate, some was matched with pub. we learned a lot from this, and we know that we need to apply these lessons to the fdsys, but we need to be more careful in applying our requirements. we’ve also been able to pursue the discussion of automated metadata extraction. one caution: additional harvests create a huge backlog of materials that will require time to go through.
Laurie Hall (LH): benefits to harvest – lots of new documents that we didn’t have before. leveraged a lot of the learning experience to help our internal workflow. opportunities to train staff on new tasks such as creating brief bibs.
Peter Hemphill (PH): given the results, web harvesting is most effective on well-structure sites with well-structured content. what’s the ROI? working thru this vs. working with fdsys/epa/api – giving them tools to submit content. (as opposed to trying the web harvest approach)
RHM: ROI is pretty costly for what we’ve already gathered, but we can’t put it aside. as far as working with agencies, i hope that’s what fdsys will do for us. changing nature of the web prohibits an absolute. while we want to deal with these things now, we are looking at a development of templates, metadata extraction, etc., that we hope will deal with the bulk of the material.
Chris Greet (CG): this was remarkably successful in terms of positive yield. 62% new publications is a good thing! peter’s point is well-taken – this suggests that there is a lot out there, but there’s so much that the manual approach is too labor-intensive. a fundamentally different approach is necessary, but the technology at this stage is inadequate.
Mark Sandler: hear parallel discussions going on in ARL libraries re: scholarly communication. new types of communications, data, etc. and concern is that it only becomes more and more – finished publication is one thing, but the process/communication is happening ‘further and further upstream.’ huge backlog is only going to grow – and is going to get worse when dealing with political and gov’t information as well. continuing to explore this issue is important.
tim Byrne: when council made web harvesting one of our priorities, it was because that every day materials disappear. if you aren’t going to start harvesting immediately as much as you can, we’re going to lose it.
KW: there are harvesting tools out there. we need to learn quickly from this project, how we better do this. retrieved a lot of pieces of information. some human intervention is required. having depository libraries volunteer is not a long-term solution.
John Shuler: sign of hope, involving depository libraries. being able to distribute this burden throughout all dep libraries is a good thing (?????????) begins talking about GIO.
RHM: want to reiterate that we do continue to work with agencies on harvesting methods and best practices. are a part of the CENDI group on web harvesting, etc.
CG: at what point does this become completely unrealistic?
RHM: what are our options for the future? is the development of a bib record for a publication an approach that we continue to take, or do we have to look for a broader approach? use the archivists’ model?
RD: tie back to tim’s point. we don’t have the luxury of giving up. a lot of work went into defining requirements for the vendors. early proponents of harveting that mentioned they could do this decided not to bid. we can’t have all of this dumped in our lap & look to the library community – we need to further develop technology to aid this.
Katrina: Robin, you suggested earlier that you have to catalog it to comply with title 44.
RHM: cataloging & indexing requirements would have to be changed in title 44. if we create a brief bib record now, that is bib control. what is a cataloging record?
tim: you just have to make it better than the old monthly catalog.
Geoff Swindells (GS): go back to scholarly communication. libraries aren’t trying to sit back and rolls in the door (or doesn’t); are being partners with faculty, involved in understanding how communication is changing, etc. one approach to take is working with agencies – setting standards for publication & organization, and tools and processes, etc.
audience Q&A:
what has epa done to assist you in this work?
rhm: epa allowed us to harvest their sites and databases, behind firewalls, etc. they made no commitment, has been no follow-up. they are participating in cendi group and have strong discussions on ROI of dealing with this material.
kathy hale: this is a federal and state problem; thanks for putting templates, etc., out there.
rich gause: old problem with fugitive docs. still need to put teeth in t44.
RD: allows us to take the discussion out of the theoretical level, and present actual numbers.
Joanne Beezley (Pittsburgh state): how can i get these records in my catalog? cgp and not oclc
LH: talk to linda resler re: z 39.50 access from your catalog to the cgp. that record set can be pulled out. will be a session tomorrow.
arlene weible: exact same issue on state level. these are not publications as have been traditionally. need to look at differently. we’ve been looking at cataloging process – what do we still need to do in online environment? classification, etc. so what are you talking about in terms of your process?
LH: brief bib is pretty brief. but still doing sudoc and item numbers. would we have anarchy if we removed them? sudoc and item number scheme is very restrictive, but aren’t sure that we can remove.
arlene: these are tools we’ve used in a published publication environment. do we need the same tools in a different environment? still need human intervention re; scope. automated systems can improve this, but still need judgment.
mary martin: web harvesting is the answer, but what’s the question? we’re not necessarily interested in providing access to this information. do we know what kinds of documents are more likely to be used? can we narrow the scope to this?
RHM: is part of the discussion we still need to have. t 44 still says wide net. that’s what we go for.
Jane Kelsey: another state library person. reality: it does say throw the wide net. as a historical society person, if we get too selective, what are we going to lose that we’re going to need? something that needs to be taken into consideration.
sandee mcaninch: some regional libraries would like to provide access to a good portion of the wide net, so i would encourage that technique.
question: are not ever going to see the brief bibs for monographs in oclc? that’s the cost-effective way for us as a regional to gather these records.
LH: not going to answer that right now. is an additional step in the process. needs to be considered in terms of resources, etc. won’t say yes or no yet.
Peter Kraus: 2 years ago utah passed legislation that required agencies to make information available in a preformatted way to the state library.
GS: what kind of compliance are you getting? MO has a similar law.
PK: in utah, the legislature is more powerful than the governor. [did not give figure]
Jeff Bullington: agrees that we need to cast a relatively wide net. reason: changing ways of communication, how information is packaged, etc.
Barbie Selby: following on sandee’s comment re: brief records: how can they be identified in cgp – in bulk?
LH: have been talking w/oclc re: batch loading. still waiting for info from oclc.
GS: difference between a broad net and everything. take lessons from archivists.
Bill Olbrich: for years we have had a choice re: items we can select and not select. have also had mocat filled w/nondepository items. don’t worry about batch-loading into oclc; give us the titles and let us have the choice of whether to add them or not.