A Thank You to Journalists Supporting the Wayback Machine

BERJAYA

As publishers block the Internet Archive’s Wayback Machine for unfounded concerns over AI scraping, hundreds of journalists have signed a public letter supporting the Wayback Machine and the importance of preserving the online historical record. Below, Mark Graham, director of the Wayback Machine, shares a message of thanks to the journalism community for standing up for web preservation, accountability, and access to the public record.

Journalists who would like to add their names can sign the letter here, and members of the public can sign the broader support letter here.


Dear colleagues,

On behalf of all of us at the Internet Archive, I want to thank you.

Your support for the Wayback Machine sends a clear message: preserving the record matters.

For thirty years, the Wayback Machine has worked in the background, preserving more than 1 trillion web pages so that reporting doesn’t simply vanish with the next site redesign or corporate decision. Today, more than 100 news articles every month reference, cite, or rely on material preserved by the Wayback Machine to verify claims, recover deleted information, or provide historical context.

Where previous generations could walk into a newsroom morgue or a local library archive, today’s journalists increasingly rely on digital preservation to trace accountability and verify claims that might otherwise be lost. When a source disappears, when a statement is rewritten, when a page is taken down, the ability to recover that record is not a luxury. 

The stakes are not hypothetical. A Pew Research Center study from 2024 found that 38% of webpages from a decade ago are no longer accessible, and about 25% of pages sampled across the decade have disappeared entirely. But that’s not the whole story. New analysis by Internet Archive data scientist Sawood Alam found that the Wayback Machine has rescued roughly 15% of those otherwise lost pages, preserving reporting and historical evidence that would simply no longer exist online.

We are especially grateful that you recognized the care with which we approach this work. We are your partners in preservation. We build systems designed for people, not bulk extraction; we monitor our services to manage abusive access; and we actively collaborate with publishers and newsrooms to ensure their work is preserved with integrity.

Importantly, recent reporting has also underscored a key reality of this debate. As journalist Andrew Deck reported in Marketplace Tech, many publishers blocking the Wayback Machine appear to be acting preemptively out of concern over AI scraping rather than evidence of misuse. “None of the publishers were able to point to a particular AI company or other kinds of direct evidence that their content had already been scraped by the Wayback Machine,” Deck wrote.

At a time when the pressures on journalism are mounting—from economic shifts to the rapid evolution of AI—your support sends a clear message: preserving the public record is not optional. It is essential infrastructure for a functioning democracy.

We remain committed to the important task of preserving the web. And we are deeply encouraged to know that so many of you stand with us in defending that work.

With gratitude,

Mark Graham
Director, Wayback Machine
Internet Archive

Celebrating Thirty Years of the Internet Archive with the ‘Class of 1996’

Before feeds, before algorithms, there was the Class of 1996: websites & organizations founded (or expanded) in 1996, like the Internet Archive.

On the occasion of the Internet Archive’s 30th anniversary, we’re opening the internet’s yearbook to celebrate the sites, services & scrappy experiments that helped shape the web as we know it. From class leaders like Center for Democracy and Technology to cultural icons like The Onion to the archivists making sure none of it disappears, this is a reunion worth attending.

Some are still thriving. Some have changed beyond recognition. Some are already gone. All of them remind us: the early web wasn’t just built, it was lived in.

THE MORE YOU KNOW: Did you know that some publishers are blocking the Wayback Machine from archiving their sites, putting decades of reporting and cultural history at risk of disappearing from the public record? If the web’s past matters — and the Class of 1996 reminds us that it doesnow is the time to speak up. Add your name to the petition calling on publishers to stop blocking the Wayback Machine and help ensure the internet’s history remains accessible for future generations.


Class of 1996

Class President — Center for Democracy and Technology

The Center for Democracy and Technology didn’t just show up—they helped write the rules of the internet. And 30 years later, they’re still fighting to keep it open.

Class President

Go Wayback to 1996: https://web.archive.org/web/19961022174718/https://cdt.org/


Most Likely to Fix Your Computer — CNET

Before YouTube & TikTok tutorials, there was CNET, walking you through every crash, install & “have you tried turning it off and on again?”

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961221064020/http://www.cnet.com/


Best Dressed — eBay

eBay—Where the outfit and the backstory come with it. Vintage, rare, unforgettable…just like the early web.

BERJAYA

Go Wayback to 1999: https://web.archive.org/web/19990117033159/http://pages.ebay.com/aw/index.html


Most Popular (Or Knows Who Is) — Alexa Internet

Before “trending,” there were rankings, and Alexa told us who ruled the web. (RIP to a real one.)

BERJAYA

Go Wayback to 1997: https://web.archive.org/web/19970530104435/http://www.alexa.com/


Most Changed Since Freshman Year — Google

From a dorm room experiment to organizing the world’s information. Some people really did peak after high school.

BERJAYA

Go Wayback to 1998: https://web.archive.org/web/19981111183552/http://google.stanford.edu/


Most Helpful — Ask Jeeves

Ask a question. Get an answer. Preferably in complete sentences. The internet had a butler once & he was awesome.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961219064854/http://www.askjeeves.com/


Class Clown — The Onion

Making us laugh at the news online since 1996 & occasionally making it feel a little too real.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961219015005/http://theonion.com/


Best Hair — Unofficial Spice Girls Fan Site

Before social media, fandom lived here: glitter text, tiled backgrounds & serious ‘Wannabe’ hair.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961229144915/http://spicegirls.com/


Cutest Couple — World Wide Web Consortium & Cascading Style Sheets

Structure meets style. The web’s ultimate power couple & still going strong.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961227091242/https://www.w3.org/


Most Athletic — 1996 Summer Olympics Website

One of the first times the whole world followed the games online. Faster, higher, more digital.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961223003700/http://www.atlanta.olympic.org/


Most Talkative — ICQ & Hotmail

The beginning of being always reachable…for better or worse.

BERJAYA

Go Wayback to 1997: https://web.archive.org/web/19971210072826/http://www.icq.com/

https://web.archive.org/web/19971210171246/http://hotmail.com


Most Likely to Save Everything — Internet Archive

Because the web isn’t forever, unless someone saves it.

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19970126045828/http://www.archive.org/


Most Likely to LAN Party — Quake

Before Twitch streams there were cables, pizza & Quake. You had to be there (literally).

BERJAYA

Go Wayback to 1996: https://web.archive.org/web/19961220085409/http://www.idsoftware.com/


Most Quotable — Salon

Smart, sharp & written to be shared.

BERJAYA

Go Wayback to 1998: https://web.archive.org/web/19981212032509/http://www.salon1999.com/

Internet Archive Switzerland: Expanding a Global Mission to Preserve Knowledge

BERJAYA

Thirty years ago, Brewster Kahle founded the Internet Archive with an ambitious goal: Universal Access to All Knowledge. Today, that mission continues to grow with an exciting new chapter: the launch of the Internet Archive Switzerland, a non-profit foundation based in St. Gallen.

The Internet Archive Switzerland, online at https://internetarchive.ch/, is a newly-formed Swiss non-profit foundation that will operate independently within its national context. Its efforts will initially focus on preserving endangered archives from around the world and collecting the generative AI wave that is currently upon us all. With a UNESCO conference planned for November 2026 in Paris, Internet Archive Switzerland is taking a concrete step to explore how endangered archives can be protected.

In parallel, the Swiss foundation is working in partnership with the School of Computer Science at the University of St. Gallen, on the Gen AI Archive project led by Prof. Dr. Damian Borth. Together, they aim to begin archiving AI models, which is an emerging frontier for preservation.

The choice of St. Gallen is no coincidence. With a thousand-year tradition of archiving and scholarship, the city offers a fitting home for this next phase of digital preservation. Its strong academic environment—including collaboration with the University of St. Gallen—makes it an ideal place to establish a 21st century memory organization.

“St. Gallen is a very suitable place to take the preservation of our universal knowledge a step further. Stability and innovation go hand in hand here and are embedded in a deep understanding of the importance of cultural heritage,” said Roman Griesfelder, the executive director of Internet Archive Switzerland.

Internet Archive Switzerland joins a growing group of mission-aligned organizations, alongside Internet Archive, Internet Archive Canada, and Internet Archive Europe. Together, these independent libraries strengthen a shared vision: building a distributed, resilient digital library for the world.

Contact Internet Archive Switzerland
Roman Griesfelder, executive director
office@internetarchive.ch

Wayback Machine Director: We Are ‘Collateral Damage’ in the Fight Between AI Companies and Publishers

In the latest episode of the Future Knowledge podcast, “Preserving the Web in the Age of AI,” Wayback Machine director Mark Graham, tech policy expert Mike Masnick, and media lawyer Kendra Albert discuss the reports that some news publishers are blocking the Wayback Machine from archiving their websites due to unfounded concerns over AI scraping.

For Graham, it’s an issue of supporting journalism and the historical record. The Wayback Machine has become “collateral damage caught up in the conflict between AI companies and publishers.”

As Graham recounts encounters with reporters and researchers, a clear pattern emerges: even the most well-resourced institutions cannot fully preserve their own digital history. The Wayback Machine has become an indispensable backstop, ensuring that the public record remains accessible even when original sources disappear.

“I was in the offices of The New York Times just a few weeks ago,” said Graham, noting that The New York Times has blocked the Wayback Machine from archiving its website, “and a senior researcher came up to me and said, ‘Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.’ I get those stories all the time.”

For Masnick, blocking the Wayback Machine “will be seen as a huge mistake by those media organizations, an overreaction to a problem that probably isn’t really a problem.”

When considering blocking all bot activity over fears of AI scraping, Albert cautions that websites, “whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, ‘Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.'”

Listen to the full episode on the Future Knowledge podcast:

Full transcript:

Chris Freeland (00:05):
Welcome to Future Knowledge, a podcast about knowledge, creativity, and policy brought to you by the Internet Archive and Authors Alliance.

(00:14):
We tend to think of the web as a living archive, this vast searchable record of who we were, what we knew, and how we understood the world at any given moment in time. But that assumption is starting to crack. As publishers move to block AI scraping and restrict access to their content, the tools that quietly preserve our online history like the Internet Archives Wayback machine are getting caught in the crossfire. What started as a fight over AI training data is quickly becoming something bigger, a question about whether the web itself will be able to be archived in the future. Because if preserving the web becomes a threat, what happens to our memory when the past can’t be saved? Hi, everyone. I’m Chris Freeland. I’m a librarian at the Internet Archive. I want to welcome you to today’s discussion. So we have assembled an excellent panel of experts to discuss how efforts to limit AI access are reshaping the boundaries of preservation and what’s at stake if those boundaries continue to close.

(01:17):
So today we’re joined by Mike Masnick, the founder of TechDirt, Mark Graham, the director of The Wayback Machine, and Kendra Albert, a tech and media policy expert at Albert Sellers LLP. Here to introduce our speakers and to set the stage for today’s conversation is Dave Hansen, the Executive Director of Authors Alliance.

Dave Hansen (01:37):
Thanks, Chris. Hi, everyone. So this is a little bit different for us today. Usually we’re doing book talks, and we thought that this was such an important issue and such a fast-moving issue. No one has yet written the book on what is happening with the crisis in web archiving and preserving the web. But it’s a really important issue. Authors Alliance, from our perspective, we care about this because we’re quite fond of the internet and being able to research adequately what has happened over time across the web is so important for any sort of journalistic writing, for history. I refer to the Wayback Machine weekly at least for writing when I’m looking at back versions of documents and things like that. And I think we really do face a real crisis at this moment where this has never been an easy task. And I think we’ll hear from Mark about the Wayback Machine takes a lot of work and other web preservation efforts take a lot of work.

(02:32):
It’s never been easy, but in the current moment, it is particularly challenging when we have news publishers and other platforms online making it not just legally or technically complicated, but in some cases, almost impossible to really engage in preserving content in an automated way. So we’re here to talk about that. And I think the three perspectives that we’ve assembled here are, I hope, going to fill in some of the pieces of what’s happening on the ground with web preservation, what’s happening in the broader policy sphere that’s driving some of this. And also what does the law have to say about this? Because often what we see happening is sort of a reflection or a shadow of the legal rights that exist. So with that, I’m going to turn it over to our speakers here. And Mark, how about we start off with you to just talk a little bit about what do you see as the major challenge right now in terms of preserving the web in the age of AI?

Mark Graham (03:32):
Sure. Well, I mean, first of all, for close to 30 years, Internet Archives Wayback Machine has been archiving much of the public web, including journalism and making that material available to people. I should note that a large percentage of this material was no longer available on the live web and indeed thousands and thousands of news sites that we have archived over the decades are no longer available. And what has been going on recently as reported by Andrew Deck of Harvard’s Journalism School and others is that some news organizations and other platforms, most notably the New York Times and Reddit have began blocking the ability, preventing the internet archives way back machine from producing archives of their material and making that available. Indeed, it has been suggested that we are victims, if you will, collateral damage caught up in the conflict between AI companies and publishers.

Dave Hansen (04:35):
Thanks, Mark. Mike, how about let’s hear from you and take this whatever direction you want, but I’m particularly interested in your take on sort of the broader what’s happening in the policy sphere around this.

Mike Masnick (04:46):
Yeah. I mean, it’s a really tricky space because I think that a lot of people certainly recognize the value and importance of preserving culture and understanding culture, building things like institutions like libraries and related institutions. And yet there’s been this kind of struggle in large part because of the rise of AI, which Mark sort of hinted at in his opening, which is that until just a few years ago, most people consider the archiving of the web and of other resources as something that was seen sort of akin to the role of the library, which makes sense. But with the rise of AI tools, there is this interesting challenge in that all of the major frontier LLM models have been trained on huge corpuses of data and they’re always looking for more. And the question is where and how. And there are all sorts of other related discussions on is training fair use and things like that, which I don’t think we need to get into here, but that is the backdrop behind all of this.

(05:55):
And so companies, especially the media companies, are certainly very concerned about the way that the AI companies have gotten access to their data for training purposes and they feel that they are uncompensated and that they need to be compensated. Some of them have been working out deals. You mentioned the New York Times and Reddit and New York Times is suing OpenAI and has cut deals with others and Reddit has cut deals with Google and some others. And there’s all sorts of back and forth and negotiations. And all of this debate then becomes collateral damage to that because the fear is, and I think it’s an overblown and misguided fear that because you have organizations like the Internet Archive building the Wayback Machine, which again, they’ve built for decades. And I think most people recognize it’s just a generally useful tool for the preservation of culture and for researchers and for the journalists at some of the news organizations who are complaining, they feel that something like the Wayback Machine offers a way to go around these negotiations and undercut the negotiations in some form or another, which I think is at some point in retrospect will be seen as a huge mistake by those media organizations and overreaction to a problem that probably isn’t really a problem, but in this sort of rush to deal with this concern of, oh my gosh, the AI companies are taking over everything, they’re looking to plug any hole and block any opportunity for the AI companies to train on their material.

(07:34):
And the collateral damage of that is that some of them are now blocking the way back machine. And I think there are other efforts then to see about will there be either on the legal side, which Kendra can talk about, or just through technical measures, will there be ways to put up effectively toll booths on the internet if you want to archive this or if you want to make use of the larger corpus of data that various news organizations have put together, do you first have to pay a toll? Do the AI scanning companies have to effectively pay for the right to read that content? And that leads to a whole bunch of other downstream issues, but I’ll cut it off there and give Kendrick a chance to talk as well.

Kendra Albert (08:16):
First of all, just want to, I’m really excited for this conversation. And before I get to the law, I want to obligatorily state that my experience with web archiving is sort of a little bit unique in the sense that I was on the founding team for Perma.cc, which is basically another web archiving service that’s aimed at providing sort of permanent links freeze and scholarly work, cart filings, et cetera, a project that owes a lot to the Wayback Machine. And so I think this is a topic that’s sort of near and dear to my heart, even outside of the sort of intellectually interesting parts of the legal analysis. So I think there’s sort of two sets of legal issues that you can kind of think about when you’re thinking about web architecting. The one that everyone thinks of, and I realize probably most people do not have an instinctive legal reaction to a conversation about web arching, but one of them is basically copyright law.

(09:03):
The question of whether you can make a copy of someone’s work and save it, even if the purpose you’re saving it for is quite different than the purpose it was originally created for. And there’s some good case law from primarily actually the early 2000s suggesting that making cached copies of websites, even the full website by Google is fair use use of images for search results, fair use. So basically fair use being a limitation on copyright law that allows people to make use of copyright of works without the permission from the original owner. And I think that’s how many folks think about many of these large scale web archiving projects is that they use, they’re under fair use and that oftentimes fair use asks questions like, “Hey, are you harming the market for the original copyrighted works? What are you really doing with this sort of use of the copyright of works?” The test asks questions about what you’re doing.

(09:52):
And I think in the context of web archiving, especially for sort of memory institutions, for the kinds of criticism, accountability, journalism that we’re going to talk about, I think probably a little later, there’s some really strong fair use arguments, although it’s not just like most things in the space, it’s not like we have a Supreme Court case on point about this specifically. That’s the sort of, in some ways, the easier question. And when fairy use is the easier question, you’re in a bad situation from a legal perspective. The much harder question having to do with web scraping is sort of the process of scraping material online itself. And this sort of falls under a separate set of legal regimes, including things like the Computer Fraud and Abuse Act, which if you’re a little confused as to why the federal anti-hacking statute applies to certain kinds of web scraping, originally the theory had more to do with the fact that terms of service for websites would prohibit web scraping, and thus those terms of service could be used to argue that there was a CFAA violation.

(10:51):
Now, as we’re thinking more about the sort of technical restriction that folks are placing on accessing websites, or even things like robots.text, which I’m sure we’ll talk about, but which sort of is meant to convey signals about whether websites want themselves to be scraped, courts do have to take up the question of whether violating those signals constitutes breaking the law. And this is where it gets even more tricky because in the fairy’s context, you get to talk about things like, “Hey, it’s really good for general knowledge that folks can access archived version of websites. This isn’t harming the market.” In many conversations around web scraping, whether it’s under the CFAA or some other legal theory, you’re often much less focused on the question of why are you doing this? And this I think gets to Mike’s point about the broader environment and the sort of good guy archival institutions as being collateral damage, much of what we’re seeing in the backlash around AI training, web scraping, and in legitimate concerns about just bandwidth use.

(11:50):
I’m sure Mark can speak to this more than I can, but I think it’s really important to say part of the reason people are concerned about web scraping is just because they are paying for companies to access their website at a scale that is not feasible. So when we think about the law, it’s not an area where we have super clear answers as to the legality. And a lot of it depends on the particulars and has been defended, I think, for a long time by the fact that most folks doing web archiving have been good actors like the internet archive, like PERMA, other folks who are responsive to requests to fee list material from the web or from their online archives who are thoughtful about their engagement with people who want to have a conversation about, “Oh God, are we spamming your website? Let’s not do that.

(12:33):
” And so I think that’s meant that actually we haven’t had a ton of litigation over, okay, exactly, how is this legal under copyright? In the web scraping context, there has been much more direct litigation, including a fair amount having to do with scraping of LinkedIn, especially by commercial providers. And so when we think about the law of web scraping in particular, you’re thinking less about why are you doing it? Well, there’s some slight exceptions and more just about are you trying to get around technological barriers? Are you trespassing on a website? The kinds of questions that aren’t typically how we think about access to things on the internet.

Dave Hansen (13:12):
Thanks, Kendra. The CFAA piece of this is just bewildering to me to think that if you follow that path to its logical conclusion, we’ve criminalized being an archivist potentially online. And it’s just wild to me that that’s the world that we’re now in. So I want to talk a little bit about the motivation for blocking a bit more. I mean, we’ve gotten into AI as, I guess, ostensibly the driver here, but then not everybody, not every news organization, not every website has seen this pot of gold and said, “We must protect it at all costs.” They haven’t shut this down across the board. And so I wanted to probe that a little bit. What’s going on there and why is there significant variation, at least at this point, across policies from … I guess we can focus on news. I know there are other websites as well.

Mark Graham (14:02):
Well, I think first of all, it should be noted that very few news organizations have actually taken these kinds of measures like the New York Times. The vast majority of the news organizations in the world are very happy for their resources to be archived. Indeed, if they hadn’t been over the decades, we would not have access to them today. Examples would include Gawka Media or MTV News, nearly a half a million articles from the US or maybe in Hong Kong where news organizations like Apple Daily or The Stand were shut down for political reasons. And indeed, editors are in jail today. The only way one can access that material is from the wayback machine. In addition, we partner with Bard College and Pan America on a project, the Russian Independent Media Archive, focusing on archiving news from the Russian language journalism in exile and other places in the world where journalism is at risk.

(15:02):
I also note that Andrew Deck in his reporting, he could not find any examples where any publisher found evidence that material from the Wayback Machine was in fact being exploited by AI companies. So there’s, I think, a great deal at risk here, and frankly, very little, if any, evidence whatsoever of a threat to these news organizations. And at the same time, I want to emphasize that the Internet Archive is working collaboratively and supportively with journalists for decades that journalism often is based on references to other journalism. I was in the offices of The New York Times just a few weeks ago, and a senior researcher came up to me and said, “Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.” I get those stories all the time.

(15:59):
And I want to emphasize that we’re not static. We don’t just do our thing and then nothing ever changes. The web is constantly changing, business environments are changing, et cetera, and we change as well. We have implemented a whole variety of mechanisms over the last few years, especially with the rise of AI company scrapings to make it such that by and large, the Wayback machine is limited in its use by humans. The system is optimized for use by humans. We’ve taken specific measures to reduce if not eliminate bulk access to materials, especially from certain news organizations, limiting functionality in the Wayback Machines UI, collaborating with entities like Cloudflare, putting in place rate limiting mechanisms and a whole variety of measures. Some of these we’ve taken in collaboration with news organizations who have expressed specific legitimate concerns. So the conversation is very much up. We welcome it as we look for ways to continue to provide the vital service that we provide to archive and make available what is considered by many the first rough draft of history.

(17:11):
And by definition, that rough draft needs to be available to be able to be examined, to be interrogated, to be reviewed, to be cited and referenced. There’s just one more thought there, the citation side of it. Today, there are millions of URLs from news sites in Wikipedia articles, and a large percentage of them are only available because they’re in the wayback machine, because the news sites they came from just don’t exist anymore.

Mike Masnick (17:40):
The point I’ll add in terms of why our news organizations doing this is that I think it’s all part of a negotiation, right? I mean, if you look at the ones who are sort of at the forefront of trying to do this blocking, the New York Times and Gannett mainly being some of the big ones, their own business model has changed quite a lot in the last few decades and certainly in the last few years as well. And they’re very, very focused on trying to figure out how they’re going to continue to make money. And lately that has been through negotiating with large tech AI players and trying to cut deals. And the concern, which again, I would argue is misplaced, is that anything that might undercut the negotiations to make a deal and sort of prop up their business model is seen as a threat to that.

(18:33):
And so the few of them that are going around and saying that the internet archive is a problem or needs to be blocked from scraping their content, for the most part, they’re using that just to help them in their negotiations out of the fear that, oh, if the AI companies have a back door into getting our content, then the negotiation with us over a deal is a different proposition. I think this is a mistake on multiple levels, but that is kind of where their thinking seems to be.

Kendra Albert (19:04):
I also think to that point that there can be this sort of way of thinking about … I’m reminded of the famous drill tweet, there’s no difference between good and bad things. Sort of this idea that in order to take a stance about bot access on your platform, you have to block all of them. It doesn’t matter what they’re there for, doesn’t matter whether they’re well-behaved in terms of bandwidth use or crawling. I don’t know. I haven’t had conversations with folks. Some of it may be lawyer brain. I’m happy. I think there is a world in which lawyers looking at their legal positions with regard to scraping that might be occurring through AI sites might say, “Well, actually this is simpler.” If we don’t have to explain, we allow it for these folks because we actually think that they’re okay or we think that the uses may be fair or whatever, but we don’t allow it for these folks.

(19:51):
I can imagine that making an argument more complicated even if I sort of agree with Mike that I don’t think it’s a particularly good way to do things. And I also think websites generally, whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, “Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.” The other thing I can imagine, and I don’t think this is true for the New York Times organic, and Mark can correct me if I’m wrong, that is I think there are some circumstances where there are smaller sites that actually may not have the specific technical expertise to really understand what’s fully going on, where you have a site that just knows their bandwidth costs have gone crazy, they’re aware that folks are sort of scraping the web training for AI.

(20:39):
They’re not necessarily in a position to go through and actually distinguish different sort of bots or actors, different sort of folks who are accessing content as may take a uniform approach. But I mean, I think part of this has to be a conversation about, hey, independent of what you think about the training data copyright fight, which I’m not going to get any then saying that, that archival uses are really important. Right now, I think it does this via RSS feeds because it’s for headlines, but there’s a bot on BlueSky and Mastodon that looks at changes to New York Times headlines from the first post to 20 minutes later. And you can see the sort of diff between those headlines. And that provides valuable media criticism, frankly. And we’re not even talking about 20 years from now. We’re talking about it within the day of people posting to be able to see how stories have changed.

(21:28):
So I don’t think folks are doing it. I don’t think the New York Times is doing it because they don’t want people seeing how their headlines have changed or that they’re stealth correcting things in the text, although they certainly do do that. But I think that some of it comes from this sort of general framing of enclosure, as Mike was talking about, or can come from a lack of going through the details to understand the differences between different types of actors who may be using somewhat similar technologies.

Mike Masnick (21:53):
Yeah. Can I just add something to that? One of the things that I think is important that overlays a whole bunch of this is the general feeling of many people somewhat reasonably about the entire AI space right now, that there’s a large sort of backlash. There was this study recently that ICE has a higher approval rating than AI technology right now. There is a general sort of conceptual backlash to this technology. Some of it based on perhaps good reasons, some of it based on perhaps not good reasons, none of that matters. Culturally, there is this general backlash, and especially for smaller, less sophisticated sites that don’t want to go through the process of having to deal with that and the nuance or just saying, “I want to opt out if I can of this technology that I feel is problematic and bad.” And if they don’t have a clear and easy way to do that, one of that might be, “Well, I’m going to block any and all scraping because I vaguely know that that is being used to allow these companies that I hate to do something with my content.” And therefore, for some of them, it is not a well thought out, I am taking a stand against archives.

(23:08):
They’re not thinking that far. They’re just saying like, “AI, bad. I have no control over this situation. The only thing I can do is someone has made it easy for me to block archiving or scraping, and therefore I have to do that as a stand against this technology.”

Mark Graham (23:24):
That’s true. And at the same time, I want to note that there are many other news organizations that take the opposite approach. Indeed, specifically, for example, the Pointer Institute and with the organization behind the Investigative Reporters and Editors Conference has partnered with the internet archive on a project called Today’s News for Tomorrow. And what we are doing specifically is providing free archival services to more than 300 local newsrooms across the United States to help them archive their material. They have chosen to participate in this project because they value and appreciate the importance of the archiving. And at the same time, I note that more than 200 journalists have recently signed a letter endorsing the work of The Way Back Machine, celebrating it. In fact, Rachel Matto and others are on record supporting this and has signed the letter of support. So we are focusing here on some of the pushback from a very small number, but influential and well-known news and other sites.

(24:27):
But I want to put across the point here that generally speaking, we are able to continue to provide the service that we have for decades with the active support of, first of all, the patrons of the internet archive, the folks that are curious enough to want to learn from journalists writ large and from media platforms.

Mike Masnick (24:46):
Yeah. And I signed that letter and I completely agree with that thinking. I’m just sort of explaining some of the thinking. And I would even go a little bit further in that beyond just the importance of archiving and being able to use these tools that journalists use for research, I do worry a little bit, even if we’re talking about the AI technologies as well, that when you have major publications like the New York Times trying to block any and every possible way in which their writing might be read by AI tools, that that actually has problematic downstream consequences as well, where you have more problematic publications that are out there and the ones that have done more careful reporting. The New York Times sometimes does careful reporting, not always, I would say, but you want to have good reporting in these archives and in the AI tools as well as people are using them so that they’re not overrun by more problematic content.

Mark Graham (25:44):
It does. And if I could build on this just a bit, it sets a very bad precedent and that then bleeds into other areas of publishing. For example, the US government, the world’s largest publisher, uses large commercial platforms for much risk publishing. The US Agency for Global Media, the folks behind Radio Free Europe, et cetera, use YouTube to publish videos, millions of videos, thousands of which have been taken down since this new Trump administration. A couple of months ago, the State Department said that they were going to remove all of the social media posts prior to the Trump true administration. And as we were racing to archive more than two million social media posts, we were watching accounts from embassies, ambassadors and others over the years literally disappear from our screen as we were trying to archive them. So I think this just gets a dangerous precedent and something that we should be paying attention to in all dimensions of how we are working to preserve the materials that are published and to never trust a publisher to do the job of a library.

Dave Hansen (26:49):
As you’re talking, I’m really thinking here about some of the business model stuff that underlies so much of these concerns. And I was recalling, it was like three or four years ago, I guess, at this point working with a library that was doing a licensing deal with a rather large newspaper. And I mean, the numbers that they showed me, they’re talking about six-figure data licenses for access to the newspaper data. And we’ve had people talk about this before on here. Sarah Lambden did a talk about her book, Data Cartels, where a lot of it focuses on Read Elsevier, academic publisher, and there’s this real disconnect with how authors and contributors and journalists think of those outlets and what those outlets actually are from a business perspective. And I think the New York Times at this point is as much a data and analytics company as it is a newspaper.

(27:40):
Read Elsevier specifically calls themselves a data analytics company, even though they are on paper an academic publisher. And I think it doesn’t really help solve the situation, but at least explains a little bit more to me why they are making the moves that they are around restricting access to this content, if that’s the core of your business. I still don’t like it, but that explains a little bit. So I do want to talk about some other companies and outside of news, I guess, is where I’d like to go. So Reddit has been pretty public about blocking access. They have a lawsuit right now against Anthropic. That’s been a kind of interesting one to watch. And there are lots of other commercial platforms, social media platforms, for instance, that are restricting access for web scraping and preservation. So what’s going on there? Kendra, maybe we can start with you to just talk a little bit about what’s happening in litigation with some of these other platforms.

Kendra Albert (28:37):
And I think Mark’s point and your point about these sort of platforms is I think it’s really valuable in some ways to think about the actual rights to the content or your sort of legal right to use the content as an actually functionally totally separate question of how scraping it. And I think specifically with Reddit, Reddit doesn’t have the right to sue someone for copyright infringement for copying Reddit posts. I haven’t read the terms of service recently, but I’m pretty sure you’re not allowing Reddit to sue on your behalf for copyright infringement. But oftentimes the way this litigation is framed is around access to the platform, circumventing technological measures. So that’s the anti-circumvention part of the copyright statute, section 1201, or through things like there’s this fantastic, I’m not sure how I mean that word, but I’m not entirely positive tort that used to be because you touched someone’s car without permission called trespass to channels, which also has to do with what has been historically used in some context for web scraping, although usually you need to show that there’s some form of harm to the sort of infrastructure in order to bring it.

(29:41):
We talked about the CFAA, there’s trade secret, there’s all kinds of other sort of legal claims. So in some ways when you’re thinking about how some of these platforms are choosing to in some ways back up their business model goals, Reddit has done licensing deals with AI companies. I forget which one off the top of my head, but There is a very real conversation about like, “Hey, why should we pay you for this data if we could scrape it for much less money?” Now, of course, the version that you’re going to get from Reddit, if you pay them for it, is probably going to have other advantages just in terms of the metadata, the infrastructure, being able to ask Reddit questions about how the data works, all that kinds of stuff. But when we’re thinking about the legal reality behind these decisions, I think part of it has to do with the idea of the business model.

(30:29):
And part of it has to do with, I think, some degree to which I think some of these platforms may be genuinely responding to their own users being upset. And LinkedIn scraping from current generative AI days is actually a really good example of this because LinkedIn brought a lot of scraping litigation against primarily business competitors that were using LinkedIn data in order to run a recruiting tool or do other things that one might want to do with professional information. And to some extent, that was protective of their business model. These were effectively their competitors or they would roll out a product that was competing with whatever that company was doing. But also, legitimately, sometimes folks had real privacy concerns about the fact that, “Hey, I shared this data on LinkedIn. I didn’t assume that it was going to go everywhere. Now it’s gone everywhere.” I think that is different than the web archive in context.

(31:22):
And I’m not saying, “Oh, this is the same thing.” But I think why I bring it up is to say that you have this sort of circumstance under which there’s a variety of different incentives for limiting access to data, and it’s impossible to disentangle them. It’s impossible to say, “Oh, this is only because of business models. Oh, this is only because people have privacy or usage concern where this goes outside where it was supposed to be. ” And that oftentimes tech companies, LinkedIn has long said actually that their primary reason for a lot of their anti-scraping tooling is to protect users’ privacy. Now, I think that that’s a hard position to defend given the sort of business model stuff, that that’s the only reason, but I don’t think it’s not part of it. So I think that when we think about the moves by companies like Reddit to restrict all kinds of access, including the internet archive and the way back machine, you can’t just pin it to one thing.

(32:16):
And it’s not always based on one specific legal theory because oftentimes they’re trying a bunch of different stuff simultaneously, of which copyright might be one of the tools, but often is actually not the most useful if you’re talking about really significant amounts of web scraping. I hope that sort of answered your question, Dave.

Mike Masnick (32:34):
The one thing I was going to add in the Reddit context is that it is an example of where this can lead in terms of starting to test out questionable or extreme legal theories. So one of the cases that Reddit has is against this company called SERP API, which or CERP IP, I don’t know how they pronounce their name. And you can argue that this is perhaps not a good company, but basically what they do is they scrape Google results and create an API so that you can programmatically make use of Google results. Google is also suing them, but that’s a separate case. But you have Reddit suing this company over copyrights that Reddit doesn’t own, as Kendra noted. It’s the users in most cases, if there’s any copyright interest at all. And they’re suing this company for scraping Google’s results, which again is not Reddit and claiming that it’s a DMCA 1201 anti-circumvention measure over something that Reddit itself hasn’t set up the technological protection measure.

(33:35):
The only thing that they’ve done is cut a $40 million deal with Google. And so you get these sort of stacking legal theories and questionable things that while you can see, okay, Reddit is upset that perhaps AI companies are routing around doing a deal with Reddit or with Google because they can use a company like ServpAPI to get Google results that scrape Reddit because they have a deal with Reddit, it leads to really questionable places in terms of other types of scraping or other uses that are important and useful culturally. But because everybody’s sort of trying to figure out how do we do these things and how do we cut these deals, you see these sort of somewhat stretched legal definitions, I think, or tempts at questionable cases.

Kendra Albert (34:22):
And can I just say one more thing about Mike’s point real fast, which is I think that that’s entirely true. And I think the other thing to point out there is as much as I like to, I think it’s good to distinguish between good things and bad things. I’ll go on the record as being in favor of that. I think when we’re talking about making case law, oftentimes the sort of factors judges look at or the decisions judges make don’t say, “Okay, well, I don’t like this company because I think their business model’s bad. And so I’m going to find that they violated the CFAA because of that. ” But the good guys, it’s not a CFAA violation. That part of the law usually works. We actually get to do that way more in fair use. Because of that in my current job, we often work with researchers who scrape internet platforms to look for things like bias, discrimination, to understand how platforms work, that kind of thing.

(35:07):
And those folks are subject to all the same bodies of law that get made by, well, Reddit is pissed off that you can get Reddit results from Google at this company, or Reddit feels like they’re channeling their users outrage that the sort of user’s data is being used for these purposes they didn’t intend. So I think it is really important to note that their archiving, research, all of these kinds of uses often basically require exactly the same tools, just like the way BackMachine uses the same, using bots to view webpages and archive them. Mark, I’m wildly dumbing down the complexity of what you do, but researchers are using the same tools to scrape data and to sort of understand how tech works. So I think it’s not actually easy to just be like, “Okay, great. This technology, this way of doing it is good or bad, and we should just make a rule generally.” And

Mark Graham (36:00):
You have to explain a little bit too about what’s at risk here beyond just news. The internet archives more than a billion URLs a day. And one of the signals that we follow is links added to Wikipedia articles, for example, all of them. And as a result of that, we have been able to identify and fix that is edit and replace otherwise broken URLs that would return a 404 with archives of those references that human beings had added to Wikipedia articles over the years. More than 30 million links have been fixed in this way. Pew Research, for example, identified for a collection of URLs they looked at that were 10 years old, that 38% of them were no longer available on the live web. So what does that mean if we can’t have access to this material anymore? A variety of things. Hundreds of times a year, the Wayback Machine Team produces an affidavit to attest to the veracity of our web archives for the use by lawyers in courts.

(37:03):
And often these are cases of product liability, maybe a misrepresentation by a company, et cetera. And this material is often the critical evidence that is used to determine the outcome of the case. So there are any one of a number of applications of web archives beyond just news that are vital to our society to be able to hold those in power accountable and to be able to help those curious enough to learn to inform themselves.

Mike Masnick (37:30):
I do think that that is important to just remember the concept of the open web itself and sort of how we got here in the first place. I think it gets very easy. I mean, I even sort of got bogged down immediately on the AI aspect of all this, but the open web has been around for more than three decades at this point. And I think many of us are here because we believe in the promise of the open web and what it enabled in terms of community and culture and sharing of information and meeting people and everything. So much of what we rely on today was built on this open web. And the concept of the open web is this idea that it’s not controlled by any one entity. And it is not locked down and limited, but that we can build on it and do more with it and we can share with each other and build culture.

(38:22):
Culture is about multiple people understanding the same concepts. And that is built very much on the open web these days. And so much of where this unfortunately potentially leads to is a locking down of the open web just because of concerns about how it might be used in one particular way. And so just as I know we’re sort of getting to the Q&A part, I felt like we should emphasize that aspect of why we’re all here.

Chris Freeland (38:52):
Thank you, Mike, for acknowledging that. I’d say long live the open web. I 100% agree with everything you said. It was the open web is an important part of our culture and I hope that it remains that way. And Mark, I think it may be helpful if you can explain how does the Wayback Machine make data available in bulk and what kinds of protections are in place to prevent some of the abuses that have been mentioned here?

Mark Graham (39:17):
Sure. Generally speaking, we don’t make material available in bulk. The underlying files behind the Wayback Machine are generally not publicly accessible. We do provide an ability to playback, to replay individual web pages through what I refer to as the thin straw of the wayback machine. For those of you who have used the service, you understand what I mean, it’s pretty slow. There are certain features where one can list large numbers of URLs for a given site, for example. At the request of some publishers, including the New York Times, we’ve disabled that capability for those particular sites. We do some archiving of material that is generally considered to be publicly available in particular material from governments. We participate with many others, including Kendrick with Perma CC at Harvard and others on doing a deep dive on material from the US government. And we do package that material up and we do make bulk acts for that particular collection of web archives available to researchers and others.

(40:23):
And also, as I noted, that’s on the playback side on the archiving side or how we serve material out to the world. There are a variety of mechanisms that we put in place to do limiting, to detect and deter access to the service that is not human originated.

Chris Freeland (40:41):
Very helpful. Thank you. Question for everyone. If the way back machine and other archival institutions get blocked, people are probably still going to do some archiving, but they’re going to do so in maybe less legitimate ways and screenshots and other things. And so I’d be interested in your thoughts on this issue of maybe the non-legitimate archives or the preservation by organizations that are outside of the traditional library sphere. What does that mean for the historical record? I’m

Kendra Albert (41:07):
Going to just leave non-legitimate archives over there. Well, so I think there’s a couple things to think about there. One is I think, yes, certainly screenshots are not as good as a more interactive page component, but I think ultimately having something of it is better than having nothing at all. One area I work on a lot is video game preservation, which where we encounter a lot of somewhat similar challenges in terms of the degree to which the technological complexity, the sort of challenges with permissions from rights holders, that kind of thing. And I think one thing that I think about a lot there is in some ways when you make it really hard for institutions to legitimately preserve things, for institutions that are big, public, who are very clear about what they do and how they do it, you do in some ways seed ground to smaller institutions that may have different practices.

(41:55):
And some of those institutions are often really good at what they do and they’re just quiet about it. And that’s great. And some of those institutions, I think we maybe all followed. There was a whole kerfluffle about, I think archive.is, which was sort of a tool that people used for archiving webpages, often getting around paywalls that was allegedly running a fake capture that was DDoSing a critic of the site. I think that’s a really good example of one of the potential downsides of some of the more aggressive attempts to limit automated access or access because folks were not going to that site because they unnecessarily would’ve preferred that site. They were going to that site because they could view content there that they weren’t able to view elsewhere or they could access an archive page that they couldn’t access elsewhere. And so I think there is a real risk in a lot of these spaces of making it very hard for institutions that want to do the right thing to effectively preserve or save works.

(42:50):
And then it’s sort of causing challenges for both the historical record and for who’s left.

Mike Masnick (42:57):
Yeah. I mean, I think there are good actors in this space. And obviously the Wayback Machine, the Internet Archive are a very clear example of a good actor. And if you continue to make life difficult for them, it is only going to push people to those who maybe are less good actors and there is other kinds of collateral damage that comes along with that.

Chris Freeland (43:18):
Leaving the non-legitimate archives on the floor, but something of a related question. So should preservation institutions be treated differently from AI companies in law or policy? And are there then proactive policies that libraries need to be able to continue doing this work in the digital age?

Kendra Albert (43:37):
I mean, in some ways they already are. Section 107, the reason I kept being like, and you get to actually talk about whether what people do, Section 107, which is fair use within the US does actually care about what you’re doing with the content. Section 108 of the Copyright Act is specific to libraries, certain kinds of archival and preservation institutions allows them to do things that other institutions can’t do. It’s not a question of like, should we treat them differently? We already do. It then becomes a question of, “Hey, should we treat them differently anywhere else?” Is maybe the sort of question I’m asking. And I think it’s really hard in the existing scraping law context to see how that would quite work. Although I think that we did see some of that in a case called Sandig v. DOJ where some researchers sued the DOJ over the computer fraud and BSX criminal components, making it harder to First Amendment kinds of protective research.

(44:24):
So I think there’s some inklings of that and it would be fantastic to, I think, see more engagement with this question of what are the actual uses we think are good and important and how do we promote those versus sort of, okay, just get rid of the whole thing.

Mark Graham (44:38):
Yeah. I’m going to add, first of all, I’m not a lawyer and I do recognize existing copyright and fair use allowances to substantiate the work of The Wayback Machine to support it. But at the same time, there was the Vanderbilt clause added to carve out specific explicit protections in the area of television news archiving. I should note that the internet archive is a very robust television news archiving program as well. But I want to flip it around a little bit and say that news is a very special category of online material. It plays a vital role in our democracy. Indeed, it’s been referred to as the fourth estate, and various measures of privilege are given to news and news organizations. And I might suggest that with those privileges and rights come certain responsibilities related to access and availability. I think living in a world that’s awash with mis and disinformation, the Internet Archive recently co-published a paper that suggests that up to a third of new websites and webpages appearing on the public web today are at least partially AI generated.

(45:50):
And so this is a time of rapid change. In fact, if we’re paywalling and making quality journalism generally unavailable to people unless they have a subscription, which is a teeny, teeny percentage of the population, that we’re going to end up with a world more and more where the truth, the quality journalism is paywalled and therefore generally inaccessible to people, but the lies will proliferate and they will become, as they are in many cases, the dominant presence in the conversation. When I was growing up, I had a library, a physical library, and I had access to the New York Times and other magazines that I was able to read. If that library didn’t have access to that material, I simply wouldn’t have had access.

Chris Freeland (46:35):
Hat tip to Nathan J. Robinson and a current affairs. The truth is paywalled, but the lies are free. I want to leave with our final question here for each of the panelists. What can anyone who’s listening here today do to help change this trajectory?

Mike Masnick (46:49):
I mean, speak about it, talk about it. Obviously use the tools well and intelligently and explain to others how you’re using these tools and why they matter. Certainly when it comes to things like potential policy or legislation, being aware of what’s happening and being willing to speak out and make sure that there is nothing that will then get in the way of important cultural institutions like the Internet Archive, but really just being a part of the conversation. I think a lot of people don’t understand where this is leading and sort of the impact on organizations like the Internet Archive and tools like the Wayback Machine. And so making sure that more people are aware, I think is the most important thing that at an individual level you can do. Obviously at institutional levels, if you do work for a news organization that is blocking access to the internet archive, maybe try to convince people that that is a bad idea and we’ll have downstream cultural impacts that are not good for society, but that more depends on where people are situated.

Kendra Albert (47:54):
Mike still, one of the things I was going to say, which is I think that for folks who have institutional affiliations, I think making sure that A, you can still access the internet archive is still accessing pages from your institution. And then if it’s not making the case internally that, hey, this is why it’s important for my work, for the things that I do, for the things that I care about, which I think is going to be much more powerful coming from folks who are internal to an institution than necessarily coming from those of us who are sort of out here being like, “Doom is coming, archiving is stopping.” So I thinking about to the extent that folks have an institutional role where they bring attention to these issues, I think that’s really valuable.

Chris Freeland (48:33):
Mark, how about you?

Mark Graham (48:35):
Just a few things. First of all, use our service. We’re a public library and we love it when people are able to benefit from the resources that are available from our library and give us feedback about how we can do a better job at providing those services. Subscribe to our newsletters, follow us on the socials. If you’re a journalist or know a journalist, I’d recommend that you check out the Fight for the Future letter that we share here, that Chris, you shared here. And then if you’re in the Bay Area, come visit us. We host more than a hundred events a year at our facility in San Francisco and every Friday, except for I think Thanksgiving and Christmas, at one o’clock, we host a tour so you can kind of get an in- depth and personal look at what we do and how we do it.

Chris Freeland (49:23):
Thank you for that, Mark. Thank you to Mark and to Mike and to Kendra for such a fascinating conversation today and to Dave Hansen and Authors Alliance as always for facilitating and co-hosting this session. Thanks everyone. Have a great day. Thanks for joining us on this journey into the Future of Knowledge. Be sure to follow the show. New episodes drop every other Wednesday with bold ideas, fresh insights, and the voices shaping tomorrow.

On World Press Freedom Day, a Call to Keep the News Preserved

BERJAYA

For nearly 30 years, the Internet Archive’s Wayback Machine has worked alongside journalists, researchers, and the public to ensure that the web—and the news it carries—remains part of our shared historical record. Today, on World Press Freedom Day, that mission faces a new and urgent challenge.

BERJAYA

Some news organizations, including The New York Times, The Atlantic, and USA Today, are blocking their sites from being preserved in the Wayback Machine over unfounded concerns about AI scraping. As Andrew Deck from Nieman Lab noted noted in Marketplace, “None of the publishers were able to point to a particular AI company or other kinds of direct evidence that their content had already been scraped by the Wayback Machine.” As a result, important journalism is at risk of disappearing from the public record. More than 200 journalists have added their support to keeping the news in the Wayback Machine.

In response, Fight for the Future has launched a public petition calling on news leaders to work with the Internet Archive to ensure their reporting remains accessible for generations to come.

Take action

On this World Press Freedom Day, we invite you to stand with journalists and with the future of the historical record. Add your name to the public petition and join the call for news organizations to work with the Internet Archive to keep the news in the Wayback Machine.

Public Libraries Jump On Board the Our Future Memory Movement

BERJAYA

The Our Future Memory movement was already building momentum with flagship library organizations like IFLA, ALA, and SPARC. But now, local and regional library systems from across the United States are leading the way in their own communities. From Wellesley Free Library in Massachusetts to St. Mary’s County Library in Maryland, all the way to the Minnesota Library Association, new signatories to the “Statement on Digital Rights” are demonstrating that the daily practices of libraries and other memory institutions need long-overdue legal protections. As the Statement lays out, those protections include four basic rights:

  1. to COLLECT MATERIALS IN DIGITAL FORM;
  2. to PRESERVE DIGITAL MATERIALS;
  3. to PROVIDE CONTROLLED ACCESS TO DIGITAL MATERIALS; and
  4. to COOPERATE WITH OTHER MEMORY INSTITUTIONS.

Wellesley Free Library (WFL) became the first town library in the United States to endorse these rights and join the Our Future Memory movement. It first opened in the Fall of 1883 in the building that is now Wellesley Town Hall. Now, in addition to serving more than 18,000 card-holding patrons from its main library and two branches, WFL pursues its community-oriented mission in great part by collaborating with libraries in neighboring and nearby communities. Director Jamie Jurgensen serves on the Board of the Minuteman Library Network, a local consortium of over 40 libraries, whose leaders recently voted unanimously to encourage other member libraries to join the Our Future Memory movement.

“The Wellesley Free Library is proud to sign the Statement on Digital Rights for Protecting Memory Institutions Online,” said Jurgensen and WFL Trustee Ann Howley. “In doing so, we continue to take a leadership role in raising awareness within our community and among peer libraries, in helping people to navigate the digital landscape and in advocating for equitable digital rights and access. Town libraries have always been spaces for learning, creativity, and exploration of new ideas. Digital literacy has become a huge part of our future. Through Wellesley Free Library’s endorsement, we are actively participating in conversations regarding the future of digital literacy, as well as reaffirming our commitment to stated values including promoting universal access to knowledge and ideas.”

Taking the Our Future Memory movement from Massachusetts to Maryland, the St. Mary’s County Library serves its own local community through three branches in Leonardtown, Lexington Park, and Charlotte Hall—not to mention its mobile library for senior facilities, day care centers, and other remote communities. Last year, it celebrated 75 years of service, but looking ahead, Director Michael Blackwell has shown national and global leadership by stewarding and contributing regularly to Readers First, an organization of over 300 libraries advocating better practices and collaborative partnerships between libraries and their e-content providers.

“St. Mary’s County Library is delighted to join the Our Future Memory coalition,” Blackwell remarked. “Libraries of all sizes and types have every reason to join this coalition, with small to medium sized libraries perhaps the most important reasons of all.”

“First, we too, perhaps in a small way, serve as ‘memory institutions.’ We are charged with preserving the history and heritage of our county, and we wish to make our materials available digitally as much as we can, without hindrance, and to ensure their access and preservation.

“Second, the budgets and staffing of smaller libraries make us utterly dependent upon a large community of digital providers to ensure free and fair access for our patrons. We do not have the wherewithal to navigate paywalls or create archives beyond our immediate purview. If memory institutions are unable to join together to ensure digital access to human heritage, we face being shut out of access to and participation in a larger whole. Just as ‘no man is an island,’ no library can now exist except as ‘part of the main.’ And increasingly, digital access and preservation are what make up that greater whole. They are the essence of library work today and into the future. Ownership of our physical materials in the digital world and continued access to those materials are fundamental to our mission.  Without these rights, we ultimately cannot engage in our work.”

In the midwest, the Minnesota Library Association (MLA) undertakes its own impressive advocacy and public-service work. Now a chapter of the American Library Association, it has a history dating all the way back to December 29, 1891, when a small group of librarians met to organize a State Library Association for Minnesota. Early on, MLA played a major role in building support for the legislative bill establishing the first State Library Commission, and it continued that legislative advocacy throughout the 20th century. In 2022, it incorporated the professional organization of Information and Technology Educators of Minnesota (ITEM) as a division within its ranks. Most recently, MLA has been spearheading efforts to pass state legislation to keep publishers and commercial vendors from gouging tax-funded libraries with costly, short-term eBook licenses—which can be priced at levels three to five times higher than those facing other consumers. 

MLA President Liza Shafto explained the decision to sign the Statement and join the Our Future Memory movement not only as a natural extension of traditional library work, but as encouragement for these ambitious legislative efforts: “For libraries across Minnesota,” she said, “the four digital rights reflect core functions carried out every day. Libraries must be able to collect digital materials, preserve them, provide appropriate online access, and work with other institutions to ensure ongoing availability. These responsibilities extend long-standing library practices into the digital environment.

“These rights also reinforce the Minnesota Library Association’s advocacy for fair and sustainable access to digital content, including efforts to ensure that Minnesota libraries can provide reliable and equitable access to ebooks for all residents.

“The Minnesota Library Association supports these rights because they enable libraries to continue serving their communities with dependable and equitable access to information.”

Ready to Join?

The process is simple, and we encourage memory institutions and their allies to sign the Statement and join the movement. Just go to the Our Future Memory website, download and sign the statement, and send that copy back to campaigns@internetarchive.eu.

Looking for Other Ways to Participate?

If you’re going to the Rare Book and Manuscript Section conference in Milwaukee, be sure to sign up for our workshop, “Protect Our Future Memory: Developing Digital Rights for Special Collections.”

Information Stewardship Forum 2026: Creating Community and Purpose Around US Government Information 

As soon as people started walking in the door, I breathed a sigh of relief. After months of careful (some might say obsessive) planning, we were kicking off the inaugural Information Stewardship Forum, 2026. Over three days in March, we opened the doors of the Internet Archive to 120 people who work tirelessly to preserve and give access to government information in the United States. They traveled to San Francisco representing different vital facets of this work: libraries, archives, journalism, research, policy, nonprofits, funding, and technology. Participants also reflected different parts of the government information ecosystem (which includes federal, state and local stakeholders).

BERJAYA

The Forum was constructed as a space for participants to share tools, workflows, and lessons learned from digital and physical preservation efforts, and to support practical knowledge exchange across domains and disciplines to ensure that government data remains accessible, trustworthy, and resilient in a rapidly changing information landscape. The preservation of government information has long been carried out by libraries and archives but in the current moment, this work carries a sense of weight and importance. 

Internet Archive was a natural host for this event, having long supported preservation and access to government created information and publications: through web archiving efforts including Archive-It; by participating in the the End of Term Crawl; through digitizing government produced publications; and by serving as a depository library in the Federal Depository Library Program (FDLP). In 2022, Internet Archive’s Democracy’s Library was launched, built on a straightforward but urgent premise: governments have created an abundance of information and put it in the public domain, but the public can’t easily access it.

Of course the Internet Archive is but one of many stakeholders; scaffolding has been established by institutions including the U.S. Government Publishing Office (buttressed by more than 1,000 FDLP depository libraries), the National Archives and Records Administration, and the Library of Congress, alongside countless state and local agencies, archives; and thousands of government information librarians and other specialized data stewards.

BERJAYA

Held under Chatham House Rule, the forum created space for candid discussion across plenaries, lightning talks, Birds of a Feather sessions, and closing conversations. It also created ample room for people working on related problems to compare notes, test ideas, and begin seeing the field less as a set of isolated projects and more as a community of collaborators. One theme surfaced again and again: preservation is not a solo endeavor. Preserving public information is not only a technical challenge. It is also an organizational, legal, financial, and civic one. Some high level themes emerged across our three days spent in community. 

  • Not only is government information being lost, but the the stewardship ecosystem itself remains fragmented and under-described. Mapping this space – who is doing what, and identifying key information assets will be a critical part of this work going forward. 
  • Recognition that stewardship is more than storage. Saving material is only the beginning. Continuity, trust, and usability are also important components of preservation.
  • The emergency response or triage mode that has animated many recent efforts are not sustainable; this work has been indispensable, but fragile. Building on work done on an emergency basis without feeling confined by it is a challenge for moving this work forward. 
  • Public records laws and access frameworks exist, yet information is still removed, obscured, or made difficult to use over time. Data may remain technically public while being trapped in formats or interfaces that frustrate long-term use. Worse, government information may be inappropriately constricted behind paywalls.
  • A related concern was the growing use of web harvesting restrictions in response to concerns about use by AI companies. While understandable in many contexts, those restrictions can also impede public interest archiving and make preservation harder precisely when long term capture is most needed. 
  • Local government information was identified as especially vulnerable, as were climate data, health data, and disaggregated data that allows communities to see themselves in the record.
  • Advocacy matters; the more people who can be drawn in to understand what is at stake and can participate in stewardship, even in modest ways, the more resilient the system becomes. One attendee framed this practical challenge as “How can we have easy to use tools that will allow others to invest in this work?” Tools for participation!

An important concrete outcome from this convening is Preservation of Government Information: A Call to Action. Shared as a draft during the Forum, this text serves as a manifesto of sorts, and gives  broader language to themes that surfaced throughout the event: that public access to government information cannot be left to chance, that archive-ready publication should become a norm, and that preserving public information must be treated as a civic obligation. Individuals and organizations are urged to sign and express their support for the document. 

So why did I breathe that sigh of relief when we opened the doors at 300 Funston Avenue? I could see the positive body language – recognition, surprise, delight, handshakes, hugs and exclamations that come when people are in community with those they recognize as their people. Across the three days, this emerging community had the opportunity to coalesce, to learn together, and to recognize that they are part of a broader stewardship ecosystem, one that will need stronger coordination, communication, and community. There are still many challenges in this space, but there is a firm resolve to ensure that access to government information remains open and accessible to the public. 

BERJAYA

The Information Stewardship Forum 2026 was designed to surface the shared problem space, and to facilitate connections, and it was rewarding to see it unfold into a gathering where people were actively identifying concrete collaborations, naming shared principles, discussing infrastructure and standards; fortunately attendees did not need to wonder about how to keep the energy alive after going home; they were able to join and engage in an online community space established by Internet Archive that ensures that the conversation and community can continue.

Graphic recording by Jasmin Pamukcu, Cusp Consulting.

Introducing Vanishing Culture: A New Book on the Loss of Our Digital Memory

From disappearing news articles to lost films, music, and websites, a new book from the Internet Archive reveals how our shared digital record is eroding, and what it will take to preserve it.

BERJAYA

What does it mean to live in an era where culture can simply… disappear?

Vanishing Culture: A Report on Our Fragile Cultural Record—a new book from the Internet Archive—brings together essays, research, and case studies that document a growing crisis: the erosion of access to the knowledge, media, and history that shape our collective memory. From journalism and government information to music, film, and the web itself, the shift from ownership to access—and from physical to digital—has made culture more vulnerable than many realize.

This isn’t just about nostalgia. It’s about accountability, scholarship, and the public’s right to access information. When news articles are altered or removed, when public information is taken offline, or when creative works are locked behind shifting licenses, the historical record becomes incomplete. What disappears is not just content, but context.

DOWNLOAD & READ Vanishing Culture for free at the Internet Archive. PURCHASE A PRINT COPY from Better World Books, or your local bookstore.

Recent efforts by some publishers to block web archiving services like the Wayback Machine underscore how fragile access to digital history has become. When large portions of the web are intentionally excluded from preservation, gaps in our shared record are structural, not accidental.

At the same time, libraries, archivists, and preservationists are working to push back against this loss. The Internet Archive and its partners continue to build a digital library for the web: capturing, preserving, and providing access to materials that might otherwise vanish.

Vanishing Culture is both a warning and a call to action. It invites readers to reconsider what it means to preserve culture in a digital age, and to recognize that without intentional effort, much of what we create today may not be available tomorrow.

Read the book, explore the essays, and join us in the work of preserving our digital past before more of it disappears.

Gone but Not Forgotten: Recovering the Dead Web

TL;DR: A Pew Research Center study found that 38% of webpages from a decade ago, and about 25% of pages sampled across the decade, are now inaccessible; our analysis shows that the Wayback Machine has rescued roughly 15% of those otherwise dead pages.

In 2024, the Pew Research Center published a link-rot study, “When Online Content Disappears”. They stated, “38% of webpages that existed in 2013 are no longer accessible a decade later”. They further noted, “a quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible”. This is not an isolated report that quantified the rate of loss of the online information. Numerous other link-rot studies in the last two decades have reported similar numbers or worse, depending on the context and samples. For example, Ahrefs, an SEO company, reported in the same year, “At Least 66.5% of Links to Sites in the Last 9 Years Are Dead”. In 2021, Jonathan Zittrain published an article in the Atlantic, “The Internet Is Rotting”, in which his team analyzed about 2 million external links from New York Times (NYTimes) articles and reported that 25% of deep links have rotted. They further noted that 72% of the older links from 1998 were dead. A recent longitudinal study on link-rot from the Old Dominion University (ODU), “Some URLs Are Immortal, Most Are Ephemeral”, analyzed 27.3 million URL samples from the Wayback Machine since 1996 and reported that about 65% of the sampled URLs were found dead on the live Web, when checked in 2023. Brewster Kahle, the founder of the Internet Archive, has been citing numbers from the early days of the Web and stating the average life of webpages to be anywhere from 40 to 100 days. A 2026 book, “Vanishing Culture: A Report on Our Fragile Cultural Record”, by Messarra et al. highlights underlying causes of numerous recent cultural digital losses while emphasizing the critical roles libraries and archives must play to maintain our cultural history for the future. Different studies have looked at the problem from different perspectives and contexts, hence it is often difficult to compare them side-by-side, but they all agree that an increasing number of links are rotting with the passage of time. However, some of these studies (not all) have failed to acknowledge the existence of Web archives, such as the Wayback Machine, where a portion of the dead Web might be preserved and can be used as a fallback when a reference leads to a broken link.

In this post we go through some of the link-rot studies and look at them from the perspective of the Wayback Machine to see how much of the dead Web can be rescued. Table 1 shows the status of the dead and rescued Web at a glance as sampled by a few different studies.

StudyYearPeriodSamplesDeadRescued
Pew (All)20242013-20235.4M26%16%
Pew (General)20242013-20231M27%13%
Zittrain NYT*20212013-201388K40%38%
ODU NYPW20241996-202127.3M65%65%
Table 1: Dead links from various link-rot studies rescued by the Wayback Machine.
* The NYT numbers are based on our recreated dataset.

Let us begin by looking at the study from Pew Research Center. They have generously shared their dataset with us so it was rather trivial for us (after performing some transformations and extractions, as the original dataset was stored in Parquet files) to check the URLs against the Wayback Machine to see if and when each of those were archived the first time. Their dataset contains 5.4 million unique URLs in general, news, government, and Wikipedia references categories sampled from the Common Crawl archive and Wikipedia pages. They also reported on Tweets in their post, but that dataset was not shared with us due to the restrictions posed by the usage policies.

Before we dive into our findings, below are brief descriptions of some terminologies that we will use frequently:

  • Alive: URLs that return 200 OK HTTP status code when resolved
  • Dead: URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved
  • Preserved: URLs that are Alive on the live Web as well as present in a Web archive
  • Rescued: URLs that are Dead on the live Web, but are present in a Web archive
  • Endangered: URLs that are Alive on the live Web, but are not present in any Web archive
  • Vanished: URLs that are Dead on the live Web and also not present in any Web archive
  • Archived: Preserved + Rescued
  • Accessible: Preserved + Rescued + Endangered

When we do not take any Web archives into account, about a quarter of all the 5.4 million sampled URLs would be considered inaccessible or dead as illustrated in Figure 1. However, when we leverage the Wayback Machine to access otherwise dead URLs, the fraction of inaccessible or vanished URLs drops from one in every four down to only one in every ten. The Wayback Machine has about 72% of the entire dataset archived, of which 56% are preserved from the URLs that are still alive on the live Web and 16% are rescued from the dead. There are 18% of the URLs from the sample that are still alive, but have not been archived in the Wayback Machine yet, which we call endangered, as they may become vanished if they cease to exist on the live Web ever. It is worth noting that we did not account for any captures of these URLs that might be present in any of the many smaller Web archives other than the Wayback Machine, which if accounted for, might increase the percentage of the accessible URLs a little more. Moreover, we relied on HTTP status codes and did not look into the contents of the pages to check for any soft-404s (i.e., error pages that wrongly return a 200 OK HTTP status code) or other irrelevant content, which might change the numbers further.

BERJAYA
Figure 1: Archiving status of all the URLs from the Pew dataset in the Wayback Machine.

A subset of about 1 million URLs from the Pew dataset is a sample of general webpages from the last decade, spanning across 11 years from 2013 to 2023. They noted that about a quarter of the URLs from this subset were dead in 2023, with older URLs having a greater percentage of loss, all the way to 38% for links from 2013. We recreated their yearly graph in Figure 2 in orange color with an overlay of rescued URLs by the Wayback Machine in green color. We found that about 38% of the 38% dead URLs from 2013 (i.e., about 15% of the total) are rescued by the Wayback Machine. Moreover, about a quarter of the accumulative URLs of the general sample which were considered dead, about half of them were rescued by the Wayback Machine. It is worth noting that the last three years in Figure 2 seem to be rescued almost completely, but it is a side-effect of ingestion of Common Crawl data from the recent years into the Wayback Machine, which happens to be the source of the sample of the Pew dataset.

BERJAYA
Figure 2: Yearly archiving status of URLs from the general sample of the Pew dataset in the Wayback Machine.

We tried getting access to the dataset of about 2 million URLs from the Zittrain’s NYTimes outlinks study, but we did not get it yet. However, in the interim we created our own dataset by downloading all the NYTimes pages published in 2013 that are present in the Wayback Machine, extracting all the outlinks from them, and excluding all the links to pages from NYTimes itself. We were able to collect about 88 thousand such URLs this way. Then we checked the live Web status of each of the URLs (after following up to 5 redirects, if any) and also checked for their presence in the Wayback Machine. We found that 40% of the external links from NYTimes pages from 2013 were found dead on the live Web, but 96% of those URLs are archived in the Wayback Machine. This means, only about 2% URLs from this sample have vanished. However, this impressive number needs to be taken with a grain of salt because we do not have the original URL sample and our own sample is derived from pages present in the Wayback Machine, which has an inherent bias of outlinks from those pages being more likely to be archived than the outlinks of the pages that are not present in the Wayback Machine. That said, we will be keen to revisit these numbers if and when we get access to the original sample of URLs used in Zittrain’s study.

A recent, and perhaps the most comprehensive, longitudinal link-rot study from ODU, to which we are a collaborator, analyzed 27.3 million URLs sampled from the index of the Wayback Machine spanning over more than two and a half decades. They reported about 65% of the sampled URLs from 1996 to 2021 were found dead in 2023. A significant number of these samples were not even resolving the DNS, indicating that many of those domain names were not registered anymore. They found that most of the URLs die rapidly in the first few years of their existence, but some of the longest living sites are not dead yet. Luckily, all of the dead URLs in this sample are rescued by the Wayback Machine by the virtue of it being the source of the sample in the first place. This also means, the ODU study would not be able to tell the percentages of endangered or vanished URLs, because its dataset contains no URLs that were never archived.

In summary, all of the link-rot studies, with varying numbers, indicate that the Web is brittle and an increasing number of Web resources die with the passage of time. However, we found that Web archives like the Wayback Machine play an increasingly important role in rescuing the dead Web and minimizing the fracture of the knowledge graph on the Web, but there is a lot more to do. For example, the Turn All References Blue (TARB) project has fixed more than 30 million broken links (and counting) on hundreds of wikis with the help of the InternetArchiveBot, the WaybackMedic bot, and the Wayback Machine.

While there is not a lot that can be done to resurrect the vanished Web other than attempting to find alternate locations where the content might have moved to (via projects like FABLE), we are determined to minimize the percentage of the endangered URLs. However, there are some internal and external factors that limit our ability to make it ZERO, such as, resource limitations, JavaScript-heavy pages, bot blocking, loginwalls, paywalls, deepweb, lack of timely discovery, etc. We strive to narrow down the potential loss of our cultural heritage via different means such as ingesting feeds from MediaCloud, GDELT, Wikipedia EventStream, and more recently, becoming part of the IndexNow initiative for link discovery soon after corresponding page creation or update on the Web. Moreover, we have the Save Page Now (SPN) service and urge that when you “See Something, Save Something!”. Your continued support will help us preserve the Web more and better.

NOTE: This work was presented at the IIPC WAC 2025, with the talk recording available on YouTube and slides hosted in the UNT Digital Library. It was also presented at the WADL 2025.

ACKNOWLEDGEMENTS: We thank our friends at the Pew Research Center and the Old Dominion University and our colleagues Jake LaFountain, Stephen Balbach, Chris Freeland, and Mark Graham for their help and support in this work.


Dr. Sawood Alam
Research Lead, Wayback Machine
Internet Archive

U.S. Supreme Court Records and Briefs: The Arguments That Shaped America, Now Freely Available

Thanks to a generous gift of materials from the Wolf Law Library at the William & Mary Law School, and the Internet Archive’s mission to digitize and provide universal access to knowledge, we are pleased to share more than 125,000 U.S. Supreme Court records and briefs. These materials which span nearly two centuries of American law are now freely accessible online

Why This Matters

BERJAYA

Most people are familiar with the U.S. Supreme Court opinions as public documents. But the opinions are only part of the story. Behind every landmark ruling lies a vast archive of briefs, petitions, appendices, and supporting records; these are the the arguments, evidence, and voices that shaped each decision. The Supreme Court may receive 7,000-8,000 petitions each year, but only grants a writ of certiorari to hear the case for about 80 cases. This collection includes records and briefs received by the court, both those granted certiorari and those denied certiorari; the latter category is much more voluminous than the former. Until now, these important public documents have only been available in limited ways — in print form in a limited number of law libraries, and in other formats in other libraries but not generally available for all people to freely access. 

That has now changed. As part of Democracy’s Library, the Internet Archive’s large-scale effort to preserve and open government information, this collection includes records and briefs spanning cases from 1830 through 2019, making it one of the most comprehensive archives of freely available Supreme Court materials ever assembled in one place.

What’s Now Available

The collection covers three kinds of materials:

  • The first is the official records from the lower court(s): the trial transcripts, evidence, and procedural documents that travel with each case up through the federal judiciary. 
  • The second is the briefs: the petitions, responses, amicus filings, and supporting appendices submitted by the litigants themselves and by interested third parties. These briefs are the raw material of American constitutional argument. They capture the perspectives of individuals, corporations, civil society organizations, and government agencies pressing their cases before the nation’s highest court. 
  • The third category is the opinions (for cases that are heard by the Supreme Court): the ultimate decisions reached by the highest court in the United States, demonstrating the logic and reasoning of the court.

Taken together, they form a detailed documentary record of how legal arguments, social concerns, and political priorities have evolved over nearly two hundred years of American life.

Three Cases, Three Windows Into History

To understand what this collection makes possible, consider three landmark cases — one celebrated, one lesser-known, and one that provides a relevant window into the importance of public access — that together show why access to the full record matters.

Brown v. Board of Education, 347 U.S. 483 (1954)

BERJAYA
Linda Brown Smith, Ethel Louise Belton Brown, Harry Briggs, Jr., and Spottswood Bolling, Jr. — all plaintiffs in Brown v. Board of Education during a 1964 press conference. Image from the Library of Congress.

The U.S. Supreme Court’s unanimous ruling that racial segregation in public schools was unconstitutional is one of the most studied decisions in American history. But the briefs filed in Brown reveal dimensions that the opinion itself does not capture. Thurgood Marshall and the NAACP Legal Defense Fund assembled social-science evidence, testimony from psychologists, and firsthand accounts from families to argue that separation was inherently unequal — not just legally, but psychologically. The record shows the Kansas district court’s own finding that segregation harmed Black children psychologically and that the practice was widely understood as a statement of racial inferiority. Testimony provided by sociologist Dr. Wilbur Bookover framed this issue: 

In American society we consistently present to the child a model of democratic equality of opportunity… At the same time, in a segregated school situation he is presented a contradictory or inharmonious model. He is presented a school situation in which it is obvious that he is a subordinate, inferior kind of a citizen… the segregated schools perpetuate this conflict in expectancies, condemns the negro child to an ineffective role as a citizen and member of society.

— Dr. Wilbur Brookover, expert witness, trial testimony (1951).

The U.S. government filed an amicus brief urging desegregation — a striking signal of the federal government’s position at a pivotal moment in the civil rights era. 

Loving v. Virginia, 388 U.S. 1 (1967)

Less well-known than Brown, but no less significant, Loving v. Virginia is the case that struck down laws banning interracial marriage. Mildred and Richard Loving — a Black woman and a white man from rural Virginia — married in Washington, D.C. in June 1958 and returned to live as husband and wife in Caroline County, Virginia. Warrants were issued for their arrest the following month, and they were charged with the felony of having married across racial lines and returned to the state. A judge suspended their one-year prison sentences on the condition that they leave Virginia for twenty-five years.

The briefs make clear what was at stake beyond the criminal charge: the voiding of their marriage under Virginia law, the potential illegitimacy of their children, and the loss of inheritance rights, Social Security benefits, and other protections contingent on a legally recognized union. Language from the briefs illustrates the pain of separation and disruption that these laws caused for couples like the Lovings who were “prohibited from establishing a family abode and raising their children in places where they and their family have often been long established and where many blood relatives still reside.” Brief for Appellants, Loving v. Virginia (1967). To understand the legal and social world those briefs were arguing against, it helps to read the 1965 written opinion of Judge Leon Bazile from the Circuit Court of Caroline County, Virginia denying the Lovings’ earlier appeal — a document accessible in this collection. In it, he explained his reasoning:

“Almighty God created the races white, black, yellow, malay and red, and he placed them on separate continents. And but for the interference with his arrangement there would be no cause for such marriages. The fact that he separated the races shows that he did not intend for the races to mix.” These views were rejected by the U.S. Supreme Court two years later.

The U.S. Supreme Court unanimously reversed the Lovings’ conviction in 1967, holding that “the freedom to marry has long been recognized as one of the vital personal rights essential to the orderly pursuit of happiness by free men.” The briefs that built that argument — and the lower-court records that documented what the Lovings were up against — are now open to all.

Richmond Newspapers, Inc. v. Virginia, 448 U.S. 555 (1980)

BERJAYA

In September 1978, a judge in Hanover County, Virginia cleared a courtroom during a murder trial, expelling reporters from Richmond Newspapers along with all other members of the public — the first time in the courthouse’s 243-year history that such a closure had occurred. A Richmond newspaper challenged the closure, and the case produced the U.S. Supreme Court’s first explicit ruling that the First Amendment guarantees the public and press a right to attend criminal trials. An excerpt from the documents make it clear why access to trials is so important:

“Imagine an America in which secret trials had been held in the prosecutions of Aaron Burr, John Peter Zenger, or John Thomas Scopes; of John Wilkes Booth, James Earl Ray, or Sirhan Sirhan; of the Chicago Eight, the Watergate Seven, or the Wilmington Ten.”  — Brief for Appellants, Richmond Newspapers, Inc. v. Virginia (1979).

The briefs filed by the newspaper’s attorneys make an open access argument that extends naturally to legal records of every kind:

“The right of the individual to attend and observe any criminal trial… is of constitutional dimension because its derogation would undermine the logic of the constitutional scheme — a logic that relies crucially upon the publicity and openness of the state’s ultimate confrontations with its citizens.” — Brief for Appellants, Richmond Newspapers, Inc. v. Virginia (1979).

If the public has a constitutional stake in observing civil and criminal proceedings, it has an equal stake in reading the arguments that shaped those proceedings. This collection makes that possible for the first time at scale — not just for scholars with institutional access, but for anyone.

Who Benefits

Opening this collection to all matters. Legal scholars and historians can now trace the evolution of constitutional doctrine without traveling to law libraries or paying for access to expensive databases. Journalists investigating civil rights, criminal justice, or government power can dig into the primary record. Law students can study cases that never made it before the court, and can discover not just how the U.S. Supreme Court ruled, but how the best advocates in the country made their cases. And curious citizens — anyone who wants to understand how the American judicial system works through the highest court in the land — can read these documents for themselves.

A basic democratic principle is also at stake. Public confidence in legal institutions depends on free public access to legal records. When significant portions of the constitutional archive exist only behind paywalls or in specialized collections, the historical record becomes available only to the few. Records and briefs are not peripheral materials — they are essential ingredients in the judicial decisions comprising our nation’s law. Making them freely available is a matter of civic accountability, not just scholarly convenience.

What’s Next

The collection is available now through the Internet Archive, fully searchable and freely downloadable. Whether you’re tracing the history of a constitutional doctrine, researching a case that affected your community, or simply curious about the arguments behind a ruling you’ve heard about, we invite you to explore. This archive of American constitutional argument is now truly open to everyone. This collection falls under the auspices of Democracy’s Library which is built on a straightforward but urgent premise: governments have created an abundance of information and put it in the public domain, but the public can’t easily access it.

Thank you to Leslie Street, Director, Wolf Law Library, William & Mary Law School for helping make this possible. Special thanks to the Free Law Project and CourtListener for providing metadata to enrich records in this collection.