Saving the Internet Archive

with Jason Anthony Guy

Saving the Internet Archive

Monday, September 30 2024

In “The Internet Archive’s Fight to Save Itself”, Kate Knibbs at Wired writes:

It is no exaggeration to say that digital archiving as we know it would not exist without the Internet Archive--and that, as the world's knowledge repositories increasingly go online, archiving as we know it would not be as functional. Its most famous project, the Wayback Machine, is a repository of web pages that functions as an unparalleled record of the internet. Zoomed out, the Internet Archive is one of the most important historical-preservation organizations in the world. The Wayback Machine has assumed a default position as a safety valve against digital oblivion. The rhapsodic regard the Internet Archive inspires is earned--without it, the world would lose its best public resource on internet history.

I, too, am rhapsodic about the Internet Archive. I use it regularly to find previous versions of websites, or content not otherwise available. Preserving our digital history is a noble and worthy effort that should be applauded. Sadly, but unsurprisingly, some would prefer to sue them out of existence:

Since 2020, it's been mired in legal battles. In Hachette v. Internet Archive, book publishers complained that the nonprofit infringed on copyright by loaning out digitized versions of physical books. In UMG Recordings v. Internet Archive, music labels have alleged that the Internet Archive infringed on copyright by digitizing recordings.

The book lending was a decade-old program, where they bought (or were donated) a physical copy of a book, scanned it, and loaned it out to a single person at a time, similar to a physical book from a library. It was expanded during the pandemic:

In March 2020, as schools and libraries abruptly shut down, they faced a dilemma. Demand for ebooks far outstripped their ability to loan them out under restrictive licensing deals, and they had no way of lending out books that existed only in physical form. In response, the Internet Archive made a bold decision: It allowed multiple people to check out digital versions of the same book simultaneously. It called this program the National Emergency Library. “We acted at the request of librarians and educators and writers,” says Chris Freeland.

Here’s what the Internet Archive wrote when they announced the National Emergency Library:

To address our unprecedented global and immediate need for access to reading and research materials, as of today, March 24, 2020, the Internet Archive will suspend waitlists for the 1.4 million (and growing) books in our lending library by creating a National Emergency Library to serve the nation’s displaced learners. This suspension will run through June 30, 2020, or the end of the US national emergency, whichever is later.

During the waitlist suspension, users will be able to borrow books from the National Emergency Library without joining a waitlist, ensuring that students will have access to assigned readings and library materials that the Internet Archive has digitized for the remainder of the US academic calendar, and that people who cannot physically access their local libraries because of closure or self-quarantine can continue to read and thrive during this time of crisis, keeping themselves and others safe.

Students and libraries didn’t have easy access to books during the pandemic, and the Internet Archive tried to help, at no cost to readers. Instead of supporting the effort, or providing access to ebooks themselves, book publishers and authors sued. It’s unclear how much money the book lending cost these publishers and authors; I’m guessing it’s far less than the lawsuit amount. I doubt a significant percentage of those book loans would have been purchases.

The recordings were of records in the “obsolete” 78 rpm format:

In 2023, several major record labels, including Universal Music Group, Sony, and Capitol, sued the Internet Archive over its Great 78 Project, a digital archive of a niche collection of recordings of albums in the obsolete record format known as 78s, which was used from the 1890s to the late 1950s. The complaint alleges that the project "undermines the value of music." It lists 2,749 recordings as infringed, which means damages could potentially be over $400 million.

I’m guessing these record companies weren’t making any money from these 78s, certainly not $400 million worth. I’d bet they haven’t made that much combined since those records were first sold. They’re suing because it’s the only way for them to make money on works that otherwise make them nothing. It’s rent-seeking in the form of copyright infringement lawsuits, a transfer of wealth from a nonprofit to a very-much-for-profit.

As a nonprofit, the Internet Archive is supported by some very large foundations (and individual donations), with reported revenue around $30 million and expenses of nearly $26 million, yet I’d be surprised if that’s sufficient to continue archiving the ever-growing digital world—and to defend itself from lawsuits. The UMG judgement is thirteen times more than the Internet Archive’s revenue, and may be enough to put the Internet Archive out of business.

The BBC’s Chris Stokel-Walker writes about the potential impact of losing our digital history:

38% of web pages that Pew tried to access that existed in 2013 no longer function. But it's also an issue for more recent publications. Some 8% of web pages published at some point 2023 were gone by October that same year.

This isn't just a concern for history buffs and internet obsessives. According to the study, one in five government websites contains at least one broken link. Pew found more than half of Wikipedia articles have a broken link in their references section, meaning the evidence backing up the online encyclopaedia's information is slowly disintegrating.

Stokel-Walker goes on to note that:

[…] thanks to the work of the Internet Archive, not all those dead links are totally inaccessible. For decades, the Archive's Wayback Machine project has sent armies of robots to crawl through the cascading labyrinths of the internet. These systems download functional copies of websites as they change over time – often capturing the same pages multiple times in a single day – and make them available to public free of charge.

“When we then went and looked at how many of those URLs were available in the Wayback Machine, we found that two-thirds of those were available in a way," [Mark Graham, director of the Internet Archive's Wayback Machine] says. In that sense, the Internet Archive is doing what it set out to do – it's saving records of online society for posterity.

Wikipedia gets a lot of attention as the world’s store of knowledge, but many of the “verifiable facts” that support Wikipedia articles are “backed” by the Internet Archive. Does Wikipedia pay anything to the Internet Archive for making their service more trustworthy?

(Worth noting: Wikipedia had 2023 revenue of $180 million and expenses of $168 million, six times that of the Internet Archive.)

Stokel-Walker, again:

One thing is clear, though, [Mar Hicks, a historian of technology at the University of Virginia] says, we should all pay up to support the fight for preservation. "From a very pragmatic perspective, if you do not pay these people and make sure that these archives are funded, they will not exist into the future, they will break down and then the whole point of collecting them will have gone out the window," says Hicks. "Because the whole point of the archive is not that it just gets collected, but that it persists indefinitely into the future."

If companies don’t want to maintain archives of their content themselves, rather than suing, why not partner with the Internet Archive to handle the archiving?

Just this September, Google and the Internet Archive announced a partnership to allow people to see previous versions of websites surfaced through Google Search by linking to the Wayback Machine. Google previously offered its own cached historical websites; now it leans on a small nonprofit.

It’s unclear how much—if anything—Google is actually paying for this partnership, though. Perhaps they donate, then take a tax deduction, saving themselves potentially millions of dollars while offloading the technical—and legal—burden?

I donate to the Internet Archive (and Wikipedia), but foundational aspects of the internet (see also open source projects) should not rely on the largess of individuals—or even massive foundations—to sustain them.

We also need to address the “single point of failure” nature of the Internet Archive. These recent lawsuits—or future ones—could very well kill the nonprofit, and with it, petabytes of valuable archives.

Perhaps every content company and publisher over a certain valuation should be encouraged (required?) to pay into a fund to ensure their content is archived for posterity, along the lines of FRAND licensing. Or they can maintain archives themselves, as long as they agree to make those archives available to the public in perpetuity.

Or perhaps indemnify the Internet Archive (and other nonprofits with similar goals) from these types of lawsuits. They aren’t selling access to this content, and there are no ads on the site. It’s not a money making venture.

Perhaps such an organization needs to be certified, or adhere to specific behaviors, to be indemnified.

Or perhaps the copyright laws need to be changed to allow for the explicit right to archive content and make it available online in some form.

(I’m not anti-copyright, unlike some critics of these lawsuits. I believe authors and publishers deserve the right to control the use of their content (especially in this AI-driven environment). That fundamental right needs to be balanced with the important goals of preservation and access.)

I’m not sure what the right answer is here, only that we need to preserve our books, movies, tv shows, music, and the rest of our human creativity.

I wrote at the top that I’m a big fan of the Internet Archive. I really do appreciate their work. For example, it enabled me to see the earliest versions of my first technology consulting company’s website. (Cringe.)

A perhaps more useful example: As a cocktail enthusiast, I enjoy drinking out of “Nick & Nora” glasses, named for the main characters in The Thin Man movies. But I’d never seen the movie, and it was challenging to find it to purchase or stream.

But the Internet Archive had a copy, and I was able to finally watch and enjoy this absolutely delightful movie.

(It’s now available almost everywhere, from Amazon to Apple TV+ to YouTube. Progress, I suppose, but what happens when the studio—or the streaming service—decides to pull it? This is also why I buy movies I care about on Blu-Ray, and rip/archive them myself.)

We need to ensure gems like these aren’t lost.

Help End Type 1 Diabetes

Maker’s Mark and Woodford Reserve Tasting

Subscribe to JAG’s Workshop to get new posts by email, and follow JAG’s Workshop using RSS, Mastodon, Bluesky, or LinkedIn . You can also support the site with a one-time tip of any amount.