'Bama Docs

A look at government information from the Yellowhammer State.

Archive for April 1st, 2008

DLC: Web Harvesting, 3/31 p.m.

Posted by Valerie on April 1, 2008

Robin Haun-Mohammed (RHM)

overview of web harvesting at gpo [slides]

assumptions:

gpo will continue to participate in web harvesting efforts to obtain in-scope material for the fdlp and the cataloging and indexing program as required under 44 USC

- gpo is bound by congressional appropriations for the S&E funding requirements for the FDLP C&I programs

- all materials identified for inclusion in the fdlp must be brought under bibliographic controls direct by the C&I program

- gpo does not have the authority to either give funding or gifts or to receive them. all partnerships must represent a contribution of an equal exchange between all parties.

- automated web harvesting initiatives wil become systematic as part of release 2 of fdsys

- materials harvested under the epa pilot project are being made available as staff time and processing permit. completion of the processing of this material will necessarily require an automated metadata extraction process that does not yet exist.

harvesting efforts:

- focus is on pdf files; focus is on publications. re: databases: we would prefer to partner with agencies producing databases, rather than try to capture

semi-manual harvesting efforts:

- use a tool to schedule content capture and re-harvest

sample of results:

18% were already cataloged

3% previously distributed in tangible format

2% not within scope

62% new publications

processing times breakdown:

scope determination take 17 minutes

conser standard record creation: 2.5 hours

[others on slides]

also, outline of workflow document

need to find web harvesting page for notes, numbers, etc.

scope determination takes a lot of time

list of some questions considered:

- is it published by a us gov’t agency

- is the information covered by copyright

- who is the copyright holder?

- is the primary source of funding for it gov’t money?

- does it contain data that may violate a citizen’s privacy?

acquisitions staff are being trained to create brief bib records for monographs in ILS

cataloging librarians are creating CONSER records

special materials:

- news releases, transmittals, forms, announcements

- included in list of special materials in the monthly catalog

- investigating methods to provide bibliographic access to this material

trying to figure out what to do with this – a LOT on epa site

cataloging guidelines are going to be updated this year

examples of special material – screen capture on slide

step-by-step overview – what happens to monographs and serials

[note to self: are we adding new items to our marcive profile???]

related projects:

metadata extraction:

- odu project. using 1K pubs from the epa pilot project

- currently developing software rules and designing templates

special material:

- demo project to examine depository participation in harvesting activities

- assist gpo in creating brief bib records for items in the special materials category

- opportunity for 5 depository librarians

- more information on fdlp-l as plans develop

- develop basic criteria for metadata, what would be useful; look at workflow for dealing with special material

partially harvested publications:

- demo project to examine depository participation in harvesting activities

- locate and harvest all the parts of partially harvested publications from epa pilot project

- complete the harvest of 150 publications

- 3 month project

- opportunity for 5 dep libns

- must have time to devote from june to beginning of september

- must have ftp ability

- call for volunteers will be posted to fdlp-l

Q&A:

Ken Wiggin (KW): appears to be a pilot that didn’t work. where do you plan to go now that you know how difficult it is? (What is lesson learned?) shouldn’t you start with metadata extraction and then harvest?

Ric Davis (RD): didn’t turn out the way we’d hoped. defined parameters and ‘threw technology at it.’ this was a beta test to further define rquirements for fdsys, so that we don’t procure a harvesting tool that gives us this messy data.

RHM: it’s not that it didn’t work – we didn’t know enough to make specifications very clear. some metadata was separate, some was matched with pub. we learned a lot from this, and we know that we need to apply these lessons to the fdsys, but we need to be more careful in applying our requirements. we’ve also been able to pursue the discussion of automated metadata extraction. one caution: additional harvests create a huge backlog of materials that will require time to go through.

Laurie Hall (LH): benefits to harvest – lots of new documents that we didn’t have before. leveraged a lot of the learning experience to help our internal workflow. opportunities to train staff on new tasks such as creating brief bibs.

Peter Hemphill (PH): given the results, web harvesting is most effective on well-structure sites with well-structured content. what’s the ROI? working thru this vs. working with fdsys/epa/api – giving them tools to submit content. (as opposed to trying the web harvest approach)

RHM: ROI is pretty costly for what we’ve already gathered, but we can’t put it aside. as far as working with agencies, i hope that’s what fdsys will do for us. changing nature of the web prohibits an absolute. while we want to deal with these things now, we are looking at a development of templates, metadata extraction, etc., that we hope will deal with the bulk of the material.

Chris Greet (CG): this was remarkably successful in terms of positive yield. 62% new publications is a good thing! peter’s point is well-taken – this suggests that there is a lot out there, but there’s so much that the manual approach is too labor-intensive. a fundamentally different approach is necessary, but the technology at this stage is inadequate.

Mark Sandler: hear parallel discussions going on in ARL libraries re: scholarly communication. new types of communications, data, etc. and concern is that it only becomes more and more – finished publication is one thing, but the process/communication is happening ‘further and further upstream.’ huge backlog is only going to grow – and is going to get worse when dealing with political and gov’t information as well. continuing to explore this issue is important.

tim Byrne: when council made web harvesting one of our priorities, it was because that every day materials disappear. if you aren’t going to start harvesting immediately as much as you can, we’re going to lose it.

KW: there are harvesting tools out there. we need to learn quickly from this project, how we better do this. retrieved a lot of pieces of information. some human intervention is required. having depository libraries volunteer is not a long-term solution.

John Shuler: sign of hope, involving depository libraries. being able to distribute this burden throughout all dep libraries is a good thing (?????????) begins talking about GIO.

RHM: want to reiterate that we do continue to work with agencies on harvesting methods and best practices. are a part of the CENDI group on web harvesting, etc.

CG: at what point does this become completely unrealistic?

RHM: what are our options for the future? is the development of a bib record for a publication an approach that we continue to take, or do we have to look for a broader approach? use the archivists’ model?

RD: tie back to tim’s point. we don’t have the luxury of giving up. a lot of work went into defining requirements for the vendors. early proponents of harveting that mentioned they could do this decided not to bid. we can’t have all of this dumped in our lap & look to the library community – we need to further develop technology to aid this.

Katrina: Robin, you suggested earlier that you have to catalog it to comply with title 44.

RHM: cataloging & indexing requirements would have to be changed in title 44. if we create a brief bib record now, that is bib control. what is a cataloging record?

tim: you just have to make it better than the old monthly catalog.

Geoff Swindells (GS): go back to scholarly communication. libraries aren’t trying to sit back and rolls in the door (or doesn’t); are being partners with faculty, involved in understanding how communication is changing, etc. one approach to take is working with agencies – setting standards for publication & organization, and tools and processes, etc.

audience Q&A:

what has epa done to assist you in this work?

rhm: epa allowed us to harvest their sites and databases, behind firewalls, etc. they made no commitment, has been no follow-up. they are participating in cendi group and have strong discussions on ROI of dealing with this material.

kathy hale: this is a federal and state problem; thanks for putting templates, etc., out there.

rich gause: old problem with fugitive docs. still need to put teeth in t44.

RD: allows us to take the discussion out of the theoretical level, and present actual numbers.

Joanne Beezley (Pittsburgh state): how can i get these records in my catalog? cgp and not oclc

LH: talk to linda resler re: z 39.50 access from your catalog to the cgp. that record set can be pulled out. will be a session tomorrow.

arlene weible:  exact same issue on state level. these are not publications as have been traditionally. need to look at differently.   we’ve been looking at cataloging process – what do we still need to do in online environment? classification, etc. so what are you talking about in terms of your process?

LH: brief bib is pretty brief. but still doing sudoc and item numbers. would we have anarchy if we removed them? sudoc and item number scheme is very restrictive, but aren’t sure that we can remove.

arlene: these are tools we’ve used in a published publication environment. do we need the same tools in a different environment? still need human intervention re; scope. automated systems can improve this, but still need judgment.

mary martin:  web harvesting is the answer, but what’s the question? we’re not necessarily interested in providing access to this information. do we know what kinds of documents are more likely to be used? can we narrow the scope to this?

RHM: is part of the discussion we still need to have. t 44 still says wide net. that’s what we go for.

Jane Kelsey:  another state library person. reality: it does say throw the wide net. as a historical society person, if we get too selective, what are we going to lose that we’re going to need? something that needs to be taken into consideration.

sandee mcaninch: some regional libraries would like to provide access to a good portion of the wide net, so i would encourage that technique.

question: are not ever going to see the brief bibs for monographs in oclc? that’s the cost-effective way for us as a regional to gather these records.

LH: not going to answer that right now. is an additional step in the process. needs to be considered in terms of resources, etc. won’t say yes or no yet.

Peter Kraus: 2 years ago utah passed legislation that required agencies to make information available in a preformatted way to the state library.

GS: what kind of compliance are you getting? MO has a similar law.

PK: in utah, the legislature is more powerful than the governor. [did not give figure]

Jeff Bullington: agrees that we need to cast a relatively wide net. reason: changing ways of communication, how information is packaged, etc.

Barbie Selby:  following on sandee’s comment re: brief records: how can they be identified in cgp – in bulk?

LH: have been talking w/oclc re: batch loading. still waiting for info from oclc.

GS: difference between a broad net and everything. take lessons from archivists.

Bill Olbrich:  for years we have had a choice re: items we can select and not select. have also had mocat filled w/nondepository items. don’t worry about batch-loading into oclc; give us the titles and let us have the choice of whether to add them or not.

Posted in DLC, FDLP, web harvesting | Tagged: | Leave a Comment »

DLC: FDSys, 3/31 p.m.

Posted by Valerie on April 1, 2008

Mike Wash (MW), GPO CIO [slides]

what changed? – gpo reached agreement with harris to restructure the contract in mid-feb 2008. gpo has assumed overall program responsibilities and redefined Harris’ role.

why change? system design and development progress was falling behind expectations. gpo felt that their program mgmt experience was greater than that of harris. felt like it’d be a lower risk to take over more responsibility.

now what? gpo has overall program mgmt responsibility. has contracted subject matter experts to assist in system design activities and program support tasks. harris is providing software development resources. restructure team is proceeding to deliver the first release.

what subject matter expertise? (hired on contract); subjects include searching, what else? [see questions/answers below]

Salene Dalecky (SD) – program manager for fdsys

release development:

last summer – release 1b. prototype in break room

release 1c – first phase targeted for 2008 (3 phases altogether); subsequent releases every 6 months. 1c is first public release.

3rd phase – year later. allows electronic content submission

releases 2&3 will complete functionality. additional search, submission enhancements. style tools for content creation will be available.

harvesting and preservation will kick in

release 1c overview:

includes functionality of a public release – scaling the system infrastructure, enabling the submission of content to the system, building a digital repository that conforms to the oais reference model and enables the mgmt of content and metadata, providing modern search and access tools.

users will be able to select content collections, basic and advanced search. trying to make it comparable to existing, easy sites.

authorized gpo users will be able to monitor and refine search functionality. will allow indexing by search engines.

release ic, first phase:

manage content and metadata in content packages

exchange descriptive metadata between fdsys and ils [will be automated? sync between cataloging and metadata? hmm]

begin replacement of gpo access system and migration of existing content collections [top 25 collections on gpo access]; systems will be parallel until full functionality exists on fdsys. then gpo access [wais] will go away. will there be any other backup??

authentication will continue.

2nd phase (mid-2009):

enhance search and access functionality

provide congressional submission of content and jobs

assign persistent names (handles) to content

3rd phase: late 2009

- provide government agency submission of content and jobs

- provide a documented interface (api) to allow search by non-gpo systems

- continued access enhancements (enable navigation of relationships between publications, rss/email notifications)

detailed design review scheduled for mid-june 2008

developing key milestones and dates for release 1c. will work with SMEs on plan to engage stakeholders in the areas of usability, testing, and training

updated documentation on fdsys web site

Carrie Gibb:

communications update:

proof-of-concept demo given to more than 20 groups of stakeholders; participation in events held by industry leaders

also make demos available on fdsys site: http://www.gpo.gov/projects/fdsys_status.htm

plan to begin using the fdsys blog more – in order to exchange ideas with stakeholders, etc.

email pmo@gpo.gov if you’re interested in participating in fdsys activities [focus groups, etc.]

questions from council:

- change in integrator function is pretty significant. has there been a substantial change in staffing? process of software acceptance – how are you doing that review? is it worth having an independent evaluator? what’s your recourse if …?

MW: have several different layers of software acceptance. on the development side, harris is responsible for designing own test plans and design validation testing. gpo is responsible for overall systems testing, user acceptance testing, and basic design testing ?(?); IT&S staff within gpo has a test organization that is responsible for writing overall systems test cases. program mgmt office will be writing user acceptance specs and beta testing. independent validation and verification testing will be auditing results of the test to make sure that all aspects have high integrity.

recourse: overall system responsibility is GPO’s. manage configuration & change mgmt w/in system, so that GPO has complete visibility into change requirements.

Chris Greer (CG): api: that piece will determine overall success of venture and partners – most important piece of venture. said it has search capability, but would imagine that various partners would want to mesh their deposition with your acquisition systems, that would use tools way beyond search. what are goals of api layer? role of partners (agencies and fdls) in process?

SD: refers to Lisa LaPlant, lead planner for access portion of FDsys. very involved with api.

LL: something slated for 3rd phase b/c we want to have the system foundation in place, other tools integrated before building out apis and connections. want to work with community to flesh out more api goals on public and agency sides. we started discussions w/some folks last summer in the library community and will continue to discuss.

CG: encourage careful discussion of goals and would like to further discuss this.

Peter Hemphill (PH): outreach efforts by gpo to various agencies, methods of communications. have you been able to gauge the level of commitment on behalf of agencies, willing to send information to gpo

Kirk Knoll (KK) – handles submission process. working very closely- interagency council for digital content submission. gone through system, sys requirements, asked what they want, if they’d willing to use. they’re pretty excited, and have been looking forward to this for a while.

PH: what about other agencies?

KK: carrie gibb has done more outreach to agencies & gotten positive feedbac.

Carrie Gibb: met with agencies across the U.S. disappointed that it’s still in progress; very excited about

PH: able to provide them advance materials so that they’re prepared to send?

CG: have provided overall information, done some beta testing.

SD: early on in Release 1C is congressionalsubmission, so that clerk and house and sec of senate have been working with gpo. feedback and work from congress is allowingfolks to learn more and apply lessons tofed agency partners.

PH: concerned about schedule if detailed design is [couldn't hear]
SD: in stage leading to detailed design review. move out of analysis and into design and development. have architecture and supporting materials necessary to move into phase 5. still very close to where we need to be to have a 2008 release. did have to change some releases of functionalities…

PH: concern that overall functionality –

GS: what necessitated change with harris?

MW: rfp process worked really well. other agencies have cited fdsys rfp as best practice. what prompted the change was the domain expertise required for this kind of system – better suited for internal program office. have also been watching other programs within federal gov’t that started out with master integrator approach – other agencies are starting to take a similar look – is this the right approach?

RFP – the more the gov’t can do to specify what they want to accomplish, the higher quality the responses. careful monitoring of gpo staff on progress of program led to conclusion that change was necessary.

John Shuler: will you be able to describe how this will impact the day-to-day operations of docs librarians?

SD: reference day-in-the-life presentations from previous conferences. update to that would be beneficial at this point – perhaps can put together. possibly something through the OPAL service, in order to get it out sooner than fall dlc.

CG: told us that changes don’t significantly affect timeline. what does this do to cost?

MW: we don’t believe that it will impact schedule because: started doing parallel design activities last fall, when beginning to question harris [fall-back approach]; allowed gpo to quickly get up to speed if we were to change tracks.

cost: believe that we’ll be in a favorable position on that. more custom code was part of the master integrator solution, different than the original approach gpo had counted on. custom code = more expensive. gpo went back to off-the-shelf vision with parallel programming, and took on more responsibility, so it looks like it might cost less.

Mary Alice Baish: outreach: haven’t mentioned the courts. what are your plans to communicate with them?

What kinds of subject experts have been contracted?: those familiar with some of these products: FAST search, usability, documentum, disaster recovery stuff, mitre organization [fed gov't company, programming, etc.] didn’t catch who else.

Barbie Selby: open access – how will this happen? api

me: exchange of metadata between fdsys and ils – how will that work?

Gil Baldwin: there is a workflow [high-level] in the requirements description. release 1c will have a two-way exchange of data, as opposed to the one-way ? in 1b. gil can show me exact location if i want.

[lost rest of q/a]

Posted in DLC, FDLP | Tagged: | Leave a Comment »