Curlie - Computers: Data Formats: Archive: WARC: Software

Software

Tools and utilities for writing, reading, inspecting and managing WARC files.
More information

Sites 25 Sorted by Review Date Sorted Alphabetically

DeDuplicator (Heritrix Add-on)

An add-on module (plug-in) for the web crawler Heritrix. It offers a means to reduce the amount of duplicate data collected in a series of snapshot crawls.

GitHub: Alard/warc-proxy

Viewer for browsing the contents of a WARC file.

GitHub: archiveteam-megawarc-factory

Scripts to bundle Archive Team uploads and upload them to Archive.org.

GitHub: cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code.

GitHub: CDX-Writer

Python script to create CDX index files of WARC data.

GitHub: Heritrix-Cassandra

A library for writing Heritrix output directly to Cassandra.

GitHub: Megawarc

Nondestructive warc-in-tar to warc conversion.

GitHub: python-heritrix

Simple Python wrapper around Heritrix API.

GitHub: warc-mapreduce

Warc and wet support for Hadoop's mapreduce api.

GitHub: warc-tools

Miscellaneous tools for processing WARC files from the CommonCrawl.

GitHub: WarcMiddleware

Lets download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.

GitHub: WarcMITMProxy

HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.

GitHub: WarcProxy

Saves proxied HTTP traffic to a WARC file.

GItHub: WarcQtViewer

UI to view and manage .warc and .warc.gz files.

GitHub: warctozip-service

An HTTP-based warc-to-zip converter.

GitHub: Wpull

Wget-compatible web downloader and crawler.

Heritrix

The Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

IIPC: Open Wayback Development

Landing site for open source Wayback development.

Java Web Archive Toolkit (JWAT)

A package to read and validate WARC, ARC and GZip files.

NetarchiveSuite

A complete web archiving package whose primary function is to plan, schedule and run web harvests of parts of the Internet. Is built around the Heritrix web crawler.

SiteStory

Transactional Archiving. Consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.

WARCAT

Python tool and library for handling Web ARChive (WARC) files.

WarcManager

Database web application which indexes and provides a browsing and search interface to a collection of warc data.

WARCreate

Extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive's open source Wayback Machine.

Web Archiving Integration Layer (WAIL)

A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages.