‘Bama Docs

A look at government information from the Yellowhammer State.

Archive for the 'web harvesting' Category


DLC: Web Harvesting, 3/31 p.m.

Posted by Valerie on April 1, 2008

Robin Haun-Mohammed (RHM)

overview of web harvesting at gpo [slides]

assumptions:

gpo will continue to participate in web harvesting efforts to obtain in-scope material for the fdlp and the cataloging and indexing program as required under 44 USC

- gpo is bound by congressional appropriations for the S&E funding requirements for the FDLP C&I programs

- all materials identified for inclusion in the fdlp must be brought under bibliographic controls direct by the C&I program

- gpo does not have the authority to either give funding or gifts or to receive them. all partnerships must represent a contribution of an equal exchange between all parties.

- automated web harvesting initiatives wil become systematic as part of release 2 of fdsys

- materials harvested under the epa pilot project are being made available as staff time and processing permit. completion of the processing of this material will necessarily require an automated metadata extraction process that does not yet exist.

harvesting efforts:

- focus is on pdf files; focus is on publications. re: databases: we would prefer to partner with agencies producing databases, rather than try to capture

semi-manual harvesting efforts:

- use a tool to schedule content capture and re-harvest

sample of results:

18% were already cataloged

3% previously distributed in tangible format

2% not within scope

62% new publications

processing times breakdown:

scope determination take 17 minutes

conser standard record creation: 2.5 hours

[others on slides]

also, outline of workflow document

need to find web harvesting page for notes, numbers, etc.

scope determination takes a lot of time

list of some questions considered:

- is it published by a us gov’t agency

- is the information covered by copyright

- who is the copyright holder?

- is the primary source of funding for it gov’t money?

- does it contain data that may violate a citizen’s privacy?

acquisitions staff are being trained to create brief bib records for monographs in ILS

cataloging librarians are creating CONSER records

special materials:

- news releases, transmittals, forms, announcements

- included in list of special materials in the monthly catalog

- investigating methods to provide bibliographic access to this material

trying to figure out what to do with this - a LOT on epa site

cataloging guidelines are going to be updated this year

examples of special material - screen capture on slide

step-by-step overview - what happens to monographs and serials

[note to self: are we adding new items to our marcive profile???]

related projects:

metadata extraction:

- odu project. using 1K pubs from the epa pilot project

- currently developing software rules and designing templates

special material:

- demo project to examine depository participation in harvesting activities

- assist gpo in creating brief bib records for items in the special materials category

- opportunity for 5 depository librarians

- more information on fdlp-l as plans develop

- develop basic criteria for metadata, what would be useful; look at workflow for dealing with special material

partially harvested publications:

- demo project to examine depository participation in harvesting activities

- locate and harvest all the parts of partially harvested publications from epa pilot project

- complete the harvest of 150 publications

- 3 month project

- opportunity for 5 dep libns

- must have time to devote from june to beginning of september

- must have ftp ability

- call for volunteers will be posted to fdlp-l

Q&A:

Ken Wiggin (KW): appears to be a pilot that didn’t work. where do you plan to go now that you know how difficult it is? (What is lesson learned?) shouldn’t you start with metadata extraction and then harvest?

Ric Davis (RD): didn’t turn out the way we’d hoped. defined parameters and ‘threw technology at it.’ this was a beta test to further define rquirements for fdsys, so that we don’t procure a harvesting tool that gives us this messy data.

RHM: it’s not that it didn’t work - we didn’t know enough to make specifications very clear. some metadata was separate, some was matched with pub. we learned a lot from this, and we know that we need to apply these lessons to the fdsys, but we need to be more careful in applying our requirements. we’ve also been able to pursue the discussion of automated metadata extraction. one caution: additional harvests create a huge backlog of materials that will require time to go through.

Laurie Hall (LH): benefits to harvest - lots of new documents that we didn’t have before. leveraged a lot of the learning experience to help our internal workflow. opportunities to train staff on new tasks such as creating brief bibs.

Peter Hemphill (PH): given the results, web harvesting is most effective on well-structure sites with well-structured content. what’s the ROI? working thru this vs. working with fdsys/epa/api - giving them tools to submit content. (as opposed to trying the web harvest approach)

RHM: ROI is pretty costly for what we’ve already gathered, but we can’t put it aside. as far as working with agencies, i hope that’s what fdsys will do for us. changing nature of the web prohibits an absolute. while we want to deal with these things now, we are looking at a development of templates, metadata extraction, etc., that we hope will deal with the bulk of the material.

Chris Greet (CG): this was remarkably successful in terms of positive yield. 62% new publications is a good thing! peter’s point is well-taken - this suggests that there is a lot out there, but there’s so much that the manual approach is too labor-intensive. a fundamentally different approach is necessary, but the technology at this stage is inadequate.

Mark Sandler: hear parallel discussions going on in ARL libraries re: scholarly communication. new types of communications, data, etc. and concern is that it only becomes more and more - finished publication is one thing, but the process/communication is happening ‘further and further upstream.’ huge backlog is only going to grow - and is going to get worse when dealing with political and gov’t information as well. continuing to explore this issue is important.

tim Byrne: when council made web harvesting one of our priorities, it was because that every day materials disappear. if you aren’t going to start harvesting immediately as much as you can, we’re going to lose it.

KW: there are harvesting tools out there. we need to learn quickly from this project, how we better do this. retrieved a lot of pieces of information. some human intervention is required. having depository libraries volunteer is not a long-term solution.

John Shuler: sign of hope, involving depository libraries. being able to distribute this burden throughout all dep libraries is a good thing (?????????) begins talking about GIO.

RHM: want to reiterate that we do continue to work with agencies on harvesting methods and best practices. are a part of the CENDI group on web harvesting, etc.

CG: at what point does this become completely unrealistic?

RHM: what are our options for the future? is the development of a bib record for a publication an approach that we continue to take, or do we have to look for a broader approach? use the archivists’ model?

RD: tie back to tim’s point. we don’t have the luxury of giving up. a lot of work went into defining requirements for the vendors. early proponents of harveting that mentioned they could do this decided not to bid. we can’t have all of this dumped in our lap & look to the library community - we need to further develop technology to aid this.

Katrina: Robin, you suggested earlier that you have to catalog it to comply with title 44.

RHM: cataloging & indexing requirements would have to be changed in title 44. if we create a brief bib record now, that is bib control. what is a cataloging record?

tim: you just have to make it better than the old monthly catalog.

Geoff Swindells (GS): go back to scholarly communication. libraries aren’t trying to sit back and rolls in the door (or doesn’t); are being partners with faculty, involved in understanding how communication is changing, etc. one approach to take is working with agencies - setting standards for publication & organization, and tools and processes, etc.

audience Q&A:

what has epa done to assist you in this work?

rhm: epa allowed us to harvest their sites and databases, behind firewalls, etc. they made no commitment, has been no follow-up. they are participating in cendi group and have strong discussions on ROI of dealing with this material.

kathy hale: this is a federal and state problem; thanks for putting templates, etc., out there.

rich gause: old problem with fugitive docs. still need to put teeth in t44.

RD: allows us to take the discussion out of the theoretical level, and present actual numbers.

Joanne Beezley (Pittsburgh state): how can i get these records in my catalog? cgp and not oclc

LH: talk to linda resler re: z 39.50 access from your catalog to the cgp. that record set can be pulled out. will be a session tomorrow.

arlene weible:  exact same issue on state level. these are not publications as have been traditionally. need to look at differently.   we’ve been looking at cataloging process - what do we still need to do in online environment? classification, etc. so what are you talking about in terms of your process?

LH: brief bib is pretty brief. but still doing sudoc and item numbers. would we have anarchy if we removed them? sudoc and item number scheme is very restrictive, but aren’t sure that we can remove.

arlene: these are tools we’ve used in a published publication environment. do we need the same tools in a different environment? still need human intervention re; scope. automated systems can improve this, but still need judgment.

mary martin:  web harvesting is the answer, but what’s the question? we’re not necessarily interested in providing access to this information. do we know what kinds of documents are more likely to be used? can we narrow the scope to this?

RHM: is part of the discussion we still need to have. t 44 still says wide net. that’s what we go for.

Jane Kelsey:  another state library person. reality: it does say throw the wide net. as a historical society person, if we get too selective, what are we going to lose that we’re going to need? something that needs to be taken into consideration.

sandee mcaninch: some regional libraries would like to provide access to a good portion of the wide net, so i would encourage that technique.

question: are not ever going to see the brief bibs for monographs in oclc? that’s the cost-effective way for us as a regional to gather these records.

LH: not going to answer that right now. is an additional step in the process. needs to be considered in terms of resources, etc. won’t say yes or no yet.

Peter Kraus: 2 years ago utah passed legislation that required agencies to make information available in a preformatted way to the state library.

GS: what kind of compliance are you getting? MO has a similar law.

PK: in utah, the legislature is more powerful than the governor. [did not give figure]

Jeff Bullington: agrees that we need to cast a relatively wide net. reason: changing ways of communication, how information is packaged, etc.

Barbie Selby:  following on sandee’s comment re: brief records: how can they be identified in cgp - in bulk?

LH: have been talking w/oclc re: batch loading. still waiting for info from oclc.

GS: difference between a broad net and everything. take lessons from archivists.

Bill Olbrich:  for years we have had a choice re: items we can select and not select. have also had mocat filled w/nondepository items. don’t worry about batch-loading into oclc; give us the titles and let us have the choice of whether to add them or not.

Posted in DLC, FDLP, web harvesting | Tagged: | No Comments »

Alabama State Publications

Posted by Valerie on February 22, 2008

The Alabama Department of Archives and History (ADAH) is working on a variety of projects designed to make Alabama state publications more accessible to the citizens of this state. I am most excited about their electronic collection of Alabama State Publications. This collection contains “the State of Alabama Comprehensive Annual Financial Report and various publications from the Office of the Governor. Also included are annual reports, monographs, and periodicals from various state agencies.” All of these publications are born-digital in nature (either received in an email from the agency, or harvested from an agency web site).

Another approach that ADAH is taking to preserve web-published state documents is Archive-It, a service of the Internet Archive. This tool attempts to capture an entire agency web site, not just their publications. Visit the Archive-It collections page to see all of the different Alabama sites that are available.

(For documents librarians this is quite exciting - the state of Alabama hasn’t had much of a state publications program in the print environment, so it’s nice to see folks trying to preserve what’s published digitally!)

Posted in Alabama, state docs, web harvesting | No Comments »

DLC: Web Harvesting

Posted by Valerie on April 27, 2007

Web Harvesting [Matt Landgraf/Kathy Brazee - GPO]

major issues:

  • assignment of purls or successor system [handles]
  • sudocs policies
  • cooperative cataloging
  • harvesting complete publications

PURLs/handles

  • publishing agenies prefer the purl be directed to the live copy on their web sites, which increase the visibility of their web sites
  • current policy leads to much more work in terms of purl maintenance

SuDocs Policies:

cooperative cataloging:

  • exploring the use of cooperative cataloging partnerships as an additional method for completing bib records for web harvested content. Procedures & quality control mechanisms need to be in place
  • also need to test the z39.50 gateway to allow for easy transfer of bib records

harvesting complete pubs:

  • at least 25% of the in-scope content represents only a section or a portion of a complete publication

ongoing technology discovery

  • continuing to develop more fully automated harvesting tools & methodologies in prepartion for full implementation under the FDSys

related issues:

  • grouping of portions of docs into one
  • cataloging
  • determine in or out of scope
  • SOD 304 policy statement states that agency permission is needed
  • provides guidance and instruction for harvesting of publications
  • changes may be made as related policies are reviewed & developed that affect the management of the harvesting process as well as the files themselves

brainstorming related policies & procedures

  • scope
    • fdlp
    • cataloging & indexing
    • online access to pubs
    • publishing agency guidance
      • omb circ a-130
      • e-gov’t initiatives
  • cataloging priorities (in cat guidelines)
  • collection development

See also GPO’s Web Publication Harvesting White Paper

Posted in DLC, FDLP, web harvesting | No Comments »