Curlie - Computers: Data Formats: Archive: WARC

WARC

The WARC (Web ARChive) file format is a successor to the ARC format. Specifies a method for combining multiple digital resources into an aggregate archival file together with related information.
More information

Subcategories 1

Software

Related categories 2

Sites 16 Sorted by Review Date Sorted Alphabetically

WARC File Format Specifications

Collection of a number of drafts prepared as the WARC format has developed.

Common Crawl data set

Description of the data set.

Digital Preservation Coalition: Web-Archiving

Report intended for those with an interest in, or responsibility for, setting up a web archive, particularly new practitioners or senior managers wishing to develop a holistic understanding of the issues and options available.

Example ARC and WARC files

Short examples of the ARC and WARC files that are generated by the Internet Archive's crawlers.

Github: example-warc-java

Java and Clojure examples for processing Common Crawl WARC files.

Github: pylibwarc

A Python library for dealing with Web ARChive (WARC) files.

Github: webarchive-commons

Common web archive utility code.

International Internet Preservation Consortium: Tools and Software

Perspectives of setting up a Web archiving chain, contains tools recommended and used by members of the IIPC.

Internetarchive/warc

Python library for reading and writing warc files and warc headers.

The WARC Ecosystem

Wiki with resources about the WARC format and the tools that support it.

WARC Implementation Guidelines v.1

To gather advice and best practice to help institutions designing and creating WARC files for collection management, access, preservation, and interoperability with collections from different institutions.

WARC, Web ARChive file format

Format description, ISO 28500:2009. Used by archival institutions to store content harvested by web crawls, for example via use of the Heritrix harvesting tool.

Web Archive Transformation (WAT) Specification, Utilities, and Usage Overview

Utilities to extract metadata from WARC files and create data analysis reports. Terminology, using WAT and Pig for data analysis.

Web Data Commons

The project extracts structured data from the Common Crawl and provides it for public download.

Wget with WARC output

About the development version of Wget which is capable to save WARC files.

WSDK

A lightweight Erlang library to write Web Archiving software. Overview, requirements, quick start, tutorial, support services, bugs reports, license and third party libraries.

WARC File Format Specifications

Collection of a number of drafts prepared as the WARC format has developed.

WARC, Web ARChive file format

Format description, ISO 28500:2009. Used by archival institutions to store content harvested by web crawls, for example via use of the Heritrix harvesting tool.

Common Crawl data set

Description of the data set.

Web Data Commons

The project extracts structured data from the Common Crawl and provides it for public download.

Github: webarchive-commons

Common web archive utility code.

Github: pylibwarc

A Python library for dealing with Web ARChive (WARC) files.

Wget with WARC output

About the development version of Wget which is capable to save WARC files.

Github: example-warc-java

Java and Clojure examples for processing Common Crawl WARC files.

Internetarchive/warc

Python library for reading and writing warc files and warc headers.

WSDK

A lightweight Erlang library to write Web Archiving software. Overview, requirements, quick start, tutorial, support services, bugs reports, license and third party libraries.

Example ARC and WARC files

Short examples of the ARC and WARC files that are generated by the Internet Archive's crawlers.

WARC Implementation Guidelines v.1

The WARC Ecosystem

Wiki with resources about the WARC format and the tools that support it.

Web Archive Transformation (WAT) Specification, Utilities, and Usage Overview

Utilities to extract metadata from WARC files and create data analysis reports. Terminology, using WAT and Pig for data analysis.

Digital Preservation Coalition: Web-Archiving

International Internet Preservation Consortium: Tools and Software

Perspectives of setting up a Web archiving chain, contains tools recommended and used by members of the IIPC.

Last update:

September 2, 2021 at 5:25:03 UTC

Computers

Games

Health

Home

News

Recreation

Reference

Regional

Science

Shopping

Society

Sports

All Languages

Arts

Business

"Computers ... WARC" search on:

AOL - Ask - Bing - DuckDuckGo - Ecosia - Mojeek - Google - StartPage - Yahoo - Yandex