Warc Extractor

The most popular thing I have ever built is my warc-extractor. I built it while working on a university project that was effectively pulling various datasets from the internet (twitter, internet archive, conventional scrapping etc) and experimenting with various ways to visualize this data. As it was a university project I was expected to upload everything I had built to a public repository, in this case Github.

At the time, there were basically no tools available for dealing with warc files. The only one available was the official warc[efn_note]https://github.com/internetarchive/warc[/efn_note] project, which to this day remains completely abandoned. My main goal was to create a script that could extract all the text from a warc file; however, that evolved into a general utility that acted as an “unzip” script for warc files. Once the project finished I uploaded everything to github[efn_note]https://github.com/recrm/ArchiveTools[/efn_note] as an archive and expected nothing more to happen.

Interest in the project grew organically, I’ve done nothing to promote it, and has maintained a small but remarkably steady traffic volume for years now; about half a dozen unique visitors / downloads every week for at least seven years. This is a truly enormous amount of people from my perspective. I am genuinely glad that there is a small community out there who finds my tool useful.

I am committed to fixing any bugs that get reported (when I find time), and keeping the tool as accessible and up to date as possible. However, I don’t intend on adding any more features. There are a lot more warc tools floating around these days and I would recommend anyone needing more functionality to try them out.

Just this last month I uploaded the warc-extractor, separate from the rest of ArchiveTools which remains an archive of the original project, to pypi[efn_note]https://pypi.org/project/warc-extractor/[/efn_note], so now it can be downloaded using pip.

python3 -m pip install warc-extractor

Once installed, the script can be run similar to how it was run before. To dump all warc files in the current directory just type:

warc-extractor -dump content

Additional help can be found in the built in –help flag as well as at the repository.

I appreciate all the interest and hope that you will continue to find this simple script useful in the future.

Published by

ryan

This is the personal blog of Ryan Chartier. I post all of my long form content here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.