Crawled & Collected, now what? Access & discovery in web archives #InternetLibrarian @IndustryDocs @StanfordLibs @UCDavisLibrary @archiveitorg

Slides at Google docs

Jillian Lohndorf, Internet Archive

Largest web archive in existence. Web archives aim to collect as much of the content/code as possible so it looks as close as possible to the original experience.

Topical collections.

Web history for a specific institution: records retention, FOIA laws, historical record. (Also national libraries collecting web sites from their countries.)

* Capture: Heritrix
* Storage: WARC is industry standard, redundant storage
* Access: Playback mechanism is necessary. Wayback has its own.

Additional consideration:
* Search (Archive-It)
* Metadata

Integrations:
* Catalog
* Web site
* WorldCat

Can create derivative files: metadata, visualization

Kris Kasianovitz, Stanford

Archive of state and local government web sites. *ca.gov web space. California Digital Library, State Library, State Archives, U. Of California, Stanford.

Using Archive-It.

700+ seed URLs.

Realized they were missing metadata. Collection-level records on WorldCat. Did a “metadata sprint,” call for volunteers. Used Dublin Core fields: coverage, subject, languages, etc.

Agencies go away. Take info from “About” page.

Public libraries could call for suggestions for “seed” pages in their communities.

Kevin Miller, UC Davis

https://archive-it.org/collections/5778

Using Archive-It.

Archives sites related to the campus and Davis community. Including UC Davis individuals, such as prominent faculty. A way to capture their work (such as on blogs) before they retire. Using ORCiD to find UC Davis faculty and their publications.

Automated taxonomy pilot for Archive-It. Script determines the “aboutness” of a website.

Integration with:
* Library catalog
* Finding aids

Rachel Taketa, UCSF

https://idl.ucsf.edu

Industry documents from companies “that negatively impact public health.” Where they have strategies to mislead the public. For example, tobacco, chemicals (e.g., Monsanto), pharmaceuticals.

Collect documents that are produced during lawsuits, documents from whistleblowers.

Tobacco industry site has about about 15 million documents.

Coming soon: sugar industry.

E-cigaratte advertising. Campaigns against cigarette taxes. (Campaign web site go down the day after the election.)

Access through Archive-It and on their site.

Using spreadsheet to cross load metadata from Archive-It to IDL.

Question re. Copyright: Internet Archive leaves it up to their partners. UCSF doesn’t worry about it: commercial and electoral materials. Stanford: state and local government sites can be copyrighted. Depend on robots.txt files to alert them. If they run into that, contact the agency. UC Davis: alert faculty and let them opt-out (in itself an opportunity for outreach as well).

Updated to add links.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s