Digital archive from scratch, Solomon Blaylock #InternetArchive @SolomonBlaylock

Presentation at http://conferences.infotoday.com/documents/319/A205_Blaylock.pptx

Until recently at Middlebury Institute of International Studies, Monterey

When we started at the school, made a point of asking people, “What are you working on?”

Retired special ops/student is working on a special operations research database going back to World War II.

File types: videos, PDFs, Word docs, images, streaming links. Proof-of-concept site on WordPress.

Challenges:
* Standardization, naming conventions
* Site organization
* Improved searchability
* Scalability

* WordPress update
* Project plan: interviews, task list
* Resources: no time from I.T. Staff, had to do most of it himself and use network of contacts. UCLA intro to digital humanities; The Getty’s intro to metadata; Dublin Core guides.
* Workflow
* Guide, so he could hand it off to the project team. Data input standards. How to upload to Omeka. How to upload videos to YouTube and have it to do auto-transcriptions.

Installed Omeka on a server. I.T. Wanted to vet any plug-ins they wanted to use.

Documentation at: https://library.woodbury.edu/c.php?g=878987

Advertisements

Crawled & Collected, now what? Access & discovery in web archives #InternetLibrarian @IndustryDocs @StanfordLibs @UCDavisLibrary @archiveitorg

Slides at Google docs

Jillian Lohndorf, Internet Archive

Largest web archive in existence. Web archives aim to collect as much of the content/code as possible so it looks as close as possible to the original experience.

Topical collections.

Web history for a specific institution: records retention, FOIA laws, historical record. (Also national libraries collecting web sites from their countries.)

* Capture: Heritrix
* Storage: WARC is industry standard, redundant storage
* Access: Playback mechanism is necessary. Wayback has its own.

Additional consideration:
* Search (Archive-It)
* Metadata

Integrations:
* Catalog
* Web site
* WorldCat

Can create derivative files: metadata, visualization

Kris Kasianovitz, Stanford

Archive of state and local government web sites. *ca.gov web space. California Digital Library, State Library, State Archives, U. Of California, Stanford.

Using Archive-It.

700+ seed URLs.

Realized they were missing metadata. Collection-level records on WorldCat. Did a “metadata sprint,” call for volunteers. Used Dublin Core fields: coverage, subject, languages, etc.

Agencies go away. Take info from “About” page.

Public libraries could call for suggestions for “seed” pages in their communities.

Kevin Miller, UC Davis

https://archive-it.org/collections/5778

Using Archive-It.

Archives sites related to the campus and Davis community. Including UC Davis individuals, such as prominent faculty. A way to capture their work (such as on blogs) before they retire. Using ORCiD to find UC Davis faculty and their publications.

Automated taxonomy pilot for Archive-It. Script determines the “aboutness” of a website.

Integration with:
* Library catalog
* Finding aids

Rachel Taketa, UCSF

https://idl.ucsf.edu

Industry documents from companies “that negatively impact public health.” Where they have strategies to mislead the public. For example, tobacco, chemicals (e.g., Monsanto), pharmaceuticals.

Collect documents that are produced during lawsuits, documents from whistleblowers.

Tobacco industry site has about about 15 million documents.

Coming soon: sugar industry.

E-cigaratte advertising. Campaigns against cigarette taxes. (Campaign web site go down the day after the election.)

Access through Archive-It and on their site.

Using spreadsheet to cross load metadata from Archive-It to IDL.

Question re. Copyright: Internet Archive leaves it up to their partners. UCSF doesn’t worry about it: commercial and electoral materials. Stanford: state and local government sites can be copyrighted. Depend on robots.txt files to alert them. If they run into that, contact the agency. UC Davis: alert faculty and let them opt-out (in itself an opportunity for outreach as well).

Updated to add links.

Digitizing and Archiving, Susie Kopecky #InternetLibrarian

Allan Hancock College, Santa Maria

Have an archive from the Hancock family, whose estate is the site of the college. Also have an archive devoted to the history of the college.

Hancock was a wealthy oilman in Southern California in the early 20th century.

They have approximately 60 large, flat archival boxes, 28 wider boxes, photo boxes, newspaper clippings, correspondence, etc. Previous librarians cleaned the items and moved them to acid-free containers. None trained as archivists.

Sorted:
1. Cleaned and entered into an Access database.
2. Cleaned but not entered into the dB.
3. Neither cleaned nor entered into the dB.

A storage container suffered water damage.

Metadata: decide what info to enter and how to label items with unique identifiers. Former librarians volunteered.

* Year object created
* Accession no.
* Title
* Author
* Brief description
* Part of another collection?
* Subject
* Cross-listed events and individuals

Got a scanner.

Some of the first scans: telegrams and evidence of Captain Hancock’s pursuits.

Challenges:

* Scanner went missing
* Having limited time
* Not currently having volunteers
* MS Access is kind of clunky
* Scanning large amounts of info in a timely fashion. (Want to have something ready for 100th anniversary of college in 2020.)
* Want a cloud-based system to share collection

Hosting possibility: Airtable. Currently being used by a performing arts archive.

Software possibility: ArchivesSpace (formerly Archivist’s Toolkit)

Crowdsourcing ideas for the above. (History students as interns, California Digital Library)

Another problem: nitrate films are flammable. May not be able to keep them.

Suggestion: LC site on personal archiving. (Possibly
http://www.digitalpreservation.gov/personalarchiving/)

Suggestion: Anything you do yourself takes lots of babying. ContentDM is expensive, but very nice.

Open source: Omeka S, Islandora

Brainstorming a content management program, Jaye Lapachet @JayeLapachet #InternetLibrarian

Jaye Lapachet, J8 Consulting

Slides here: http://www.jayelapachet.com/2018/10/17/internet-librarian-2018/

San Bruno fire, PG&E gas line blew up. PG&E had to go through pallets of documents. (SF Chronicle 3/5/2011)

Companies come to her when they are about to do an IPO and need to find documents for the SEC.

Start somewhere: it can be paper or digital.

* Culture
* People
* Process
* Systems
* Audit & control

Vague, but you can make them work for your organization. It has to work for your organization.

Culture: Try to disrupt ongoing business as little as possible.

People: Listen to ideas. Try to do things upfront that are quick wins. Findability can be one. People need to know that the way they do their work and find their information are being considered. They have to know that they’re being heard.

Process: Identifying silos. Don’t segregate by format. When someone goes to look for information, it’s all in there. Where content is needed. Information governance. Review taxonomies, but allow personal terms that may only show up for an individual user. Taxonomies need “care and feeding” (updating, etc.).

Systems: Not just buying tools. Inventory systems, which could be tools or software, but could be processes. What you have and how it’s being used. Expand those, merge them when possible.

Audit & control: regulations, etc.

Get a champion. Have succession planning in place (for yourself as the content manager).

Team collaboration spaces (e.g., Microsoft Teams, Google documents).

Find people’s hidden talents. You can make a database from that.

Think outside the box. Can blockchain help with content management? It’s good with people who don’t trust each other. Walmart is testing it with food products. Maybe a QR code could keep track of who opened a document and changed it.

Question about products to share documents with specific people: Lucidea has a project called Presto.

Question about Google Drive and privacy: Google is going to know what you put up there. Make sure you use their business products. I don’t think you have any privacy with Google, but read your contract. Dropbox and Box might be better, since they are meant for business.

When you work for a company, any work you do is for the company. But if people are concerned about privacy, you can anonymize things.

You can read the contract. You can ask for changes.

Her web site is: www.jayelapachet.com.

Edited to add links.

Culture in Transit: Digitizing and Democratizing NYC’s Cultural Heritage #InternetLibrarian @AnneKZ

Anne Karle-Zenith

Slides here: http://conferences.infotoday.com/documents/259/E302_Karle-Zenith.pdf

Scanning program for METRO (New York City + Westchester)

Digital Culture of New York

(Everything gets harvested by the NY State digital library, then Digital Public Library of America)

Switched from ContentDM (an OCLC product) to Islandora for database.

Blog: Culture in Transit

Toolkit coming soon.

Institutional scanning:

Had a goal to get to 10-15 institutions — small, but interesting libraries that didn’t have time/money/staff to do it themselves — in a year.  They got to 10 and scanned 1,600 items.  One person would go to the institution with portable scanning gear, spend about two weeks scanning, then another two weeks back at the office doing processing, metadata, etc.

Community Scanning:

Went to Brooklyn and Queens public libraries, scanned people’s materials and talked to them about what they were.  Returned digital copies to donors on a thumb drive.  Not just libraries: also schools, churches, cemeteries, bars.  Three to four staff who knew all phases of the project would go to a site.  Encouraged community groups (for example, Filipinos of Queens).

Computational Text Analysis #InternetLibrarian

Cody Hennessy, UC Berkeley

Slides are available here: http://conferences.infotoday.com/documents/259/A204_Hennesy(1).pptx

Not exactly my line of work, but interesting.

Group of people at UC Berkeley who do or are interested in text analysis/text mining/distant reading (as opposed to close reading). Hennessy attends so he can learn and advise (for example, not to download the whole Proquest database, because it’s copyrighted and that would be a violation of the university’s license agreement).

The Congressional Record is a favorite source, because it’s in the public domain and includes both spoken and written text.

Another blog post on this session: http://www.libconf.com/2016/10/19/computational-text-analysis-k-text-mining/

Digitizing #InternetLibrarian @CybrarianViews

Charlotte Spinner and Christine Rasmussen, AARP

Presentation here: http://conferences.infotoday.com/documents/259/A203_Spinner.pptx

Staff needed to be able to find articles in back issues of AARP: The Magazine, which goes back to 2003.  Library decided to take it on.  Wanted to use XML.   Approval from management, money from pubs dept.

Different versions of the magazine for different age groups.  Sometimes tiny variations in article.  Regional variations.  A third of the database turned out to be content variations.

A quarter of the issues not available electronically at all.  The rest had missing pages, etc.

Spawned new digitization projects, including Modern Maturity, which was published 1958-2003.  Also digitizing the founder’s papers.

Increased the visibility of the library.  Contracts with Ebsco and Gale, which will bring in money for the association.

1. It’s always harder than you think.

2. It always takes longer than you think.

3. It always costs more than you think. (actually under budget)

3. Pave the way.

4. Have solutions ready for the naysayers.

5. Roll up your sleeves.

6. (Gently) push, and push some more.

Richard Hulser, Natural History Museum of Los Angeles County

Scanning books for Biodiversity Heritage Library.  Old books with odd fonts and smudges, ink bleed-through, foxing don’t do well with OCR.

Used a crowdsourced game to get the general public to fix OCR errors.  You can work on a word or phrase at a time.  Beanstalk and Smorball.

Lessons learned: didn’t select game designer in advance, who then spent too long designing, and didn’t leave enough time to collect data within the grant period.  But did determine that games are a viable way to improve OCR.  Games are open source and could be used by others.

Question about AARP’s XML conversion: Hired a company to do that part.

AARP database: Cuadra Star.

Another blog post about this session: http://www.libconf.com/2016/10/18/digitizing/