Slides at Google docs
Jillian Lohndorf, Internet Archive
Largest web archive in existence. Web archives aim to collect as much of the content/code as possible so it looks as close as possible to the original experience.
Web history for a specific institution: records retention, FOIA laws, historical record. (Also national libraries collecting web sites from their countries.)
* Capture: Heritrix
* Storage: WARC is industry standard, redundant storage
* Access: Playback mechanism is necessary. Wayback has its own.
* Search (Archive-It)
* Web site
Can create derivative files: metadata, visualization
Kris Kasianovitz, Stanford
Archive of state and local government web sites. *ca.gov web space. California Digital Library, State Library, State Archives, U. Of California, Stanford.
700+ seed URLs.
Realized they were missing metadata. Collection-level records on WorldCat. Did a “metadata sprint,” call for volunteers. Used Dublin Core fields: coverage, subject, languages, etc.
Agencies go away. Take info from “About” page.
Public libraries could call for suggestions for “seed” pages in their communities.
Kevin Miller, UC Davis
Archives sites related to the campus and Davis community. Including UC Davis individuals, such as prominent faculty. A way to capture their work (such as on blogs) before they retire. Using ORCiD to find UC Davis faculty and their publications.
Automated taxonomy pilot for Archive-It. Script determines the “aboutness” of a website.
* Library catalog
* Finding aids
Rachel Taketa, UCSF
Industry documents from companies “that negatively impact public health.” Where they have strategies to mislead the public. For example, tobacco, chemicals (e.g., Monsanto), pharmaceuticals.
Collect documents that are produced during lawsuits, documents from whistleblowers.
Tobacco industry site has about about 15 million documents.
Coming soon: sugar industry.
E-cigaratte advertising. Campaigns against cigarette taxes. (Campaign web site go down the day after the election.)
Access through Archive-It and on their site.
Using spreadsheet to cross load metadata from Archive-It to IDL.
Question re. Copyright: Internet Archive leaves it up to their partners. UCSF doesn’t worry about it: commercial and electoral materials. Stanford: state and local government sites can be copyrighted. Depend on robots.txt files to alert them. If they run into that, contact the agency. UC Davis: alert faculty and let them opt-out (in itself an opportunity for outreach as well).
Updated to add links.