In the latest episode of the Future Knowledge podcast, “Preserving the Web in the Age of AI,” Wayback Machine director Mark Graham, tech policy expert Mike Masnick, and media lawyer Kendra Albert discuss the reports that some news publishers are blocking the Wayback Machine from archiving their websites due to unfounded concerns over AI scraping.
For Graham, it’s an issue of supporting journalism and the historical record. The Wayback Machine has become “collateral damage caught up in the conflict between AI companies and publishers.”
As Graham recounts encounters with reporters and researchers, a clear pattern emerges: even the most well-resourced institutions cannot fully preserve their own digital history. The Wayback Machine has become an indispensable backstop, ensuring that the public record remains accessible even when original sources disappear.
“I was in the offices of The New York Times just a few weeks ago,” said Graham, noting that The New York Times has blocked the Wayback Machine from archiving its website, “and a senior researcher came up to me and said, ‘Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.’ I get those stories all the time.”
For Masnick, blocking the Wayback Machine “will be seen as a huge mistake by those media organizations, an overreaction to a problem that probably isn’t really a problem.”
When considering blocking all bot activity over fears of AI scraping, Albert cautions that websites, “whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, ‘Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.'”
Listen to the full episode on the Future Knowledge podcast:
Full transcript:
Chris Freeland (00:05):
Welcome to Future Knowledge, a podcast about knowledge, creativity, and policy brought to you by the Internet Archive and Authors Alliance.
(00:14):
We tend to think of the web as a living archive, this vast searchable record of who we were, what we knew, and how we understood the world at any given moment in time. But that assumption is starting to crack. As publishers move to block AI scraping and restrict access to their content, the tools that quietly preserve our online history like the Internet Archives Wayback machine are getting caught in the crossfire. What started as a fight over AI training data is quickly becoming something bigger, a question about whether the web itself will be able to be archived in the future. Because if preserving the web becomes a threat, what happens to our memory when the past can’t be saved? Hi, everyone. I’m Chris Freeland. I’m a librarian at the Internet Archive. I want to welcome you to today’s discussion. So we have assembled an excellent panel of experts to discuss how efforts to limit AI access are reshaping the boundaries of preservation and what’s at stake if those boundaries continue to close.
(01:17):
So today we’re joined by Mike Masnick, the founder of TechDirt, Mark Graham, the director of The Wayback Machine, and Kendra Albert, a tech and media policy expert at Albert Sellers LLP. Here to introduce our speakers and to set the stage for today’s conversation is Dave Hansen, the Executive Director of Authors Alliance.
Dave Hansen (01:37):
Thanks, Chris. Hi, everyone. So this is a little bit different for us today. Usually we’re doing book talks, and we thought that this was such an important issue and such a fast-moving issue. No one has yet written the book on what is happening with the crisis in web archiving and preserving the web. But it’s a really important issue. Authors Alliance, from our perspective, we care about this because we’re quite fond of the internet and being able to research adequately what has happened over time across the web is so important for any sort of journalistic writing, for history. I refer to the Wayback Machine weekly at least for writing when I’m looking at back versions of documents and things like that. And I think we really do face a real crisis at this moment where this has never been an easy task. And I think we’ll hear from Mark about the Wayback Machine takes a lot of work and other web preservation efforts take a lot of work.
(02:32):
It’s never been easy, but in the current moment, it is particularly challenging when we have news publishers and other platforms online making it not just legally or technically complicated, but in some cases, almost impossible to really engage in preserving content in an automated way. So we’re here to talk about that. And I think the three perspectives that we’ve assembled here are, I hope, going to fill in some of the pieces of what’s happening on the ground with web preservation, what’s happening in the broader policy sphere that’s driving some of this. And also what does the law have to say about this? Because often what we see happening is sort of a reflection or a shadow of the legal rights that exist. So with that, I’m going to turn it over to our speakers here. And Mark, how about we start off with you to just talk a little bit about what do you see as the major challenge right now in terms of preserving the web in the age of AI?
Mark Graham (03:32):
Sure. Well, I mean, first of all, for close to 30 years, Internet Archives Wayback Machine has been archiving much of the public web, including journalism and making that material available to people. I should note that a large percentage of this material was no longer available on the live web and indeed thousands and thousands of news sites that we have archived over the decades are no longer available. And what has been going on recently as reported by Andrew Deck of Harvard’s Journalism School and others is that some news organizations and other platforms, most notably the New York Times and Reddit have began blocking the ability, preventing the internet archives way back machine from producing archives of their material and making that available. Indeed, it has been suggested that we are victims, if you will, collateral damage caught up in the conflict between AI companies and publishers.
Dave Hansen (04:35):
Thanks, Mark. Mike, how about let’s hear from you and take this whatever direction you want, but I’m particularly interested in your take on sort of the broader what’s happening in the policy sphere around this.
Mike Masnick (04:46):
Yeah. I mean, it’s a really tricky space because I think that a lot of people certainly recognize the value and importance of preserving culture and understanding culture, building things like institutions like libraries and related institutions. And yet there’s been this kind of struggle in large part because of the rise of AI, which Mark sort of hinted at in his opening, which is that until just a few years ago, most people consider the archiving of the web and of other resources as something that was seen sort of akin to the role of the library, which makes sense. But with the rise of AI tools, there is this interesting challenge in that all of the major frontier LLM models have been trained on huge corpuses of data and they’re always looking for more. And the question is where and how. And there are all sorts of other related discussions on is training fair use and things like that, which I don’t think we need to get into here, but that is the backdrop behind all of this.
(05:55):
And so companies, especially the media companies, are certainly very concerned about the way that the AI companies have gotten access to their data for training purposes and they feel that they are uncompensated and that they need to be compensated. Some of them have been working out deals. You mentioned the New York Times and Reddit and New York Times is suing OpenAI and has cut deals with others and Reddit has cut deals with Google and some others. And there’s all sorts of back and forth and negotiations. And all of this debate then becomes collateral damage to that because the fear is, and I think it’s an overblown and misguided fear that because you have organizations like the Internet Archive building the Wayback Machine, which again, they’ve built for decades. And I think most people recognize it’s just a generally useful tool for the preservation of culture and for researchers and for the journalists at some of the news organizations who are complaining, they feel that something like the Wayback Machine offers a way to go around these negotiations and undercut the negotiations in some form or another, which I think is at some point in retrospect will be seen as a huge mistake by those media organizations and overreaction to a problem that probably isn’t really a problem, but in this sort of rush to deal with this concern of, oh my gosh, the AI companies are taking over everything, they’re looking to plug any hole and block any opportunity for the AI companies to train on their material.
(07:34):
And the collateral damage of that is that some of them are now blocking the way back machine. And I think there are other efforts then to see about will there be either on the legal side, which Kendra can talk about, or just through technical measures, will there be ways to put up effectively toll booths on the internet if you want to archive this or if you want to make use of the larger corpus of data that various news organizations have put together, do you first have to pay a toll? Do the AI scanning companies have to effectively pay for the right to read that content? And that leads to a whole bunch of other downstream issues, but I’ll cut it off there and give Kendrick a chance to talk as well.
Kendra Albert (08:16):
First of all, just want to, I’m really excited for this conversation. And before I get to the law, I want to obligatorily state that my experience with web archiving is sort of a little bit unique in the sense that I was on the founding team for Perma.cc, which is basically another web archiving service that’s aimed at providing sort of permanent links freeze and scholarly work, cart filings, et cetera, a project that owes a lot to the Wayback Machine. And so I think this is a topic that’s sort of near and dear to my heart, even outside of the sort of intellectually interesting parts of the legal analysis. So I think there’s sort of two sets of legal issues that you can kind of think about when you’re thinking about web architecting. The one that everyone thinks of, and I realize probably most people do not have an instinctive legal reaction to a conversation about web arching, but one of them is basically copyright law.
(09:03):
The question of whether you can make a copy of someone’s work and save it, even if the purpose you’re saving it for is quite different than the purpose it was originally created for. And there’s some good case law from primarily actually the early 2000s suggesting that making cached copies of websites, even the full website by Google is fair use use of images for search results, fair use. So basically fair use being a limitation on copyright law that allows people to make use of copyright of works without the permission from the original owner. And I think that’s how many folks think about many of these large scale web archiving projects is that they use, they’re under fair use and that oftentimes fair use asks questions like, “Hey, are you harming the market for the original copyrighted works? What are you really doing with this sort of use of the copyright of works?” The test asks questions about what you’re doing.
(09:52):
And I think in the context of web archiving, especially for sort of memory institutions, for the kinds of criticism, accountability, journalism that we’re going to talk about, I think probably a little later, there’s some really strong fair use arguments, although it’s not just like most things in the space, it’s not like we have a Supreme Court case on point about this specifically. That’s the sort of, in some ways, the easier question. And when fairy use is the easier question, you’re in a bad situation from a legal perspective. The much harder question having to do with web scraping is sort of the process of scraping material online itself. And this sort of falls under a separate set of legal regimes, including things like the Computer Fraud and Abuse Act, which if you’re a little confused as to why the federal anti-hacking statute applies to certain kinds of web scraping, originally the theory had more to do with the fact that terms of service for websites would prohibit web scraping, and thus those terms of service could be used to argue that there was a CFAA violation.
(10:51):
Now, as we’re thinking more about the sort of technical restriction that folks are placing on accessing websites, or even things like robots.text, which I’m sure we’ll talk about, but which sort of is meant to convey signals about whether websites want themselves to be scraped, courts do have to take up the question of whether violating those signals constitutes breaking the law. And this is where it gets even more tricky because in the fairy’s context, you get to talk about things like, “Hey, it’s really good for general knowledge that folks can access archived version of websites. This isn’t harming the market.” In many conversations around web scraping, whether it’s under the CFAA or some other legal theory, you’re often much less focused on the question of why are you doing this? And this I think gets to Mike’s point about the broader environment and the sort of good guy archival institutions as being collateral damage, much of what we’re seeing in the backlash around AI training, web scraping, and in legitimate concerns about just bandwidth use.
(11:50):
I’m sure Mark can speak to this more than I can, but I think it’s really important to say part of the reason people are concerned about web scraping is just because they are paying for companies to access their website at a scale that is not feasible. So when we think about the law, it’s not an area where we have super clear answers as to the legality. And a lot of it depends on the particulars and has been defended, I think, for a long time by the fact that most folks doing web archiving have been good actors like the internet archive, like PERMA, other folks who are responsive to requests to fee list material from the web or from their online archives who are thoughtful about their engagement with people who want to have a conversation about, “Oh God, are we spamming your website? Let’s not do that.
(12:33):
” And so I think that’s meant that actually we haven’t had a ton of litigation over, okay, exactly, how is this legal under copyright? In the web scraping context, there has been much more direct litigation, including a fair amount having to do with scraping of LinkedIn, especially by commercial providers. And so when we think about the law of web scraping in particular, you’re thinking less about why are you doing it? Well, there’s some slight exceptions and more just about are you trying to get around technological barriers? Are you trespassing on a website? The kinds of questions that aren’t typically how we think about access to things on the internet.
Dave Hansen (13:12):
Thanks, Kendra. The CFAA piece of this is just bewildering to me to think that if you follow that path to its logical conclusion, we’ve criminalized being an archivist potentially online. And it’s just wild to me that that’s the world that we’re now in. So I want to talk a little bit about the motivation for blocking a bit more. I mean, we’ve gotten into AI as, I guess, ostensibly the driver here, but then not everybody, not every news organization, not every website has seen this pot of gold and said, “We must protect it at all costs.” They haven’t shut this down across the board. And so I wanted to probe that a little bit. What’s going on there and why is there significant variation, at least at this point, across policies from … I guess we can focus on news. I know there are other websites as well.
Mark Graham (14:02):
Well, I think first of all, it should be noted that very few news organizations have actually taken these kinds of measures like the New York Times. The vast majority of the news organizations in the world are very happy for their resources to be archived. Indeed, if they hadn’t been over the decades, we would not have access to them today. Examples would include Gawka Media or MTV News, nearly a half a million articles from the US or maybe in Hong Kong where news organizations like Apple Daily or The Stand were shut down for political reasons. And indeed, editors are in jail today. The only way one can access that material is from the wayback machine. In addition, we partner with Bard College and Pan America on a project, the Russian Independent Media Archive, focusing on archiving news from the Russian language journalism in exile and other places in the world where journalism is at risk.
(15:02):
I also note that Andrew Deck in his reporting, he could not find any examples where any publisher found evidence that material from the Wayback Machine was in fact being exploited by AI companies. So there’s, I think, a great deal at risk here, and frankly, very little, if any, evidence whatsoever of a threat to these news organizations. And at the same time, I want to emphasize that the Internet Archive is working collaboratively and supportively with journalists for decades that journalism often is based on references to other journalism. I was in the offices of The New York Times just a few weeks ago, and a senior researcher came up to me and said, “Oh my God, Mark, thank you so much for the Wayback Machine. We use you all the time. There is material available that we’ve used from the Wayback Machine that we can’t even find in our own archives.” I get those stories all the time.
(15:59):
And I want to emphasize that we’re not static. We don’t just do our thing and then nothing ever changes. The web is constantly changing, business environments are changing, et cetera, and we change as well. We have implemented a whole variety of mechanisms over the last few years, especially with the rise of AI company scrapings to make it such that by and large, the Wayback machine is limited in its use by humans. The system is optimized for use by humans. We’ve taken specific measures to reduce if not eliminate bulk access to materials, especially from certain news organizations, limiting functionality in the Wayback Machines UI, collaborating with entities like Cloudflare, putting in place rate limiting mechanisms and a whole variety of measures. Some of these we’ve taken in collaboration with news organizations who have expressed specific legitimate concerns. So the conversation is very much up. We welcome it as we look for ways to continue to provide the vital service that we provide to archive and make available what is considered by many the first rough draft of history.
(17:11):
And by definition, that rough draft needs to be available to be able to be examined, to be interrogated, to be reviewed, to be cited and referenced. There’s just one more thought there, the citation side of it. Today, there are millions of URLs from news sites in Wikipedia articles, and a large percentage of them are only available because they’re in the wayback machine, because the news sites they came from just don’t exist anymore.
Mike Masnick (17:40):
The point I’ll add in terms of why our news organizations doing this is that I think it’s all part of a negotiation, right? I mean, if you look at the ones who are sort of at the forefront of trying to do this blocking, the New York Times and Gannett mainly being some of the big ones, their own business model has changed quite a lot in the last few decades and certainly in the last few years as well. And they’re very, very focused on trying to figure out how they’re going to continue to make money. And lately that has been through negotiating with large tech AI players and trying to cut deals. And the concern, which again, I would argue is misplaced, is that anything that might undercut the negotiations to make a deal and sort of prop up their business model is seen as a threat to that.
(18:33):
And so the few of them that are going around and saying that the internet archive is a problem or needs to be blocked from scraping their content, for the most part, they’re using that just to help them in their negotiations out of the fear that, oh, if the AI companies have a back door into getting our content, then the negotiation with us over a deal is a different proposition. I think this is a mistake on multiple levels, but that is kind of where their thinking seems to be.
Kendra Albert (19:04):
I also think to that point that there can be this sort of way of thinking about … I’m reminded of the famous drill tweet, there’s no difference between good and bad things. Sort of this idea that in order to take a stance about bot access on your platform, you have to block all of them. It doesn’t matter what they’re there for, doesn’t matter whether they’re well-behaved in terms of bandwidth use or crawling. I don’t know. I haven’t had conversations with folks. Some of it may be lawyer brain. I’m happy. I think there is a world in which lawyers looking at their legal positions with regard to scraping that might be occurring through AI sites might say, “Well, actually this is simpler.” If we don’t have to explain, we allow it for these folks because we actually think that they’re okay or we think that the uses may be fair or whatever, but we don’t allow it for these folks.
(19:51):
I can imagine that making an argument more complicated even if I sort of agree with Mike that I don’t think it’s a particularly good way to do things. And I also think websites generally, whether they’re news publishers or not, should be careful about the degree to which we throw the baby out with the bathwater and say, “Well, actually some of these entities are behaving badly or doing things that we don’t like with our content, whether they’re titled to legally or not, and therefore we’re just going to take a broad stance across the board.” The other thing I can imagine, and I don’t think this is true for the New York Times organic, and Mark can correct me if I’m wrong, that is I think there are some circumstances where there are smaller sites that actually may not have the specific technical expertise to really understand what’s fully going on, where you have a site that just knows their bandwidth costs have gone crazy, they’re aware that folks are sort of scraping the web training for AI.
(20:39):
They’re not necessarily in a position to go through and actually distinguish different sort of bots or actors, different sort of folks who are accessing content as may take a uniform approach. But I mean, I think part of this has to be a conversation about, hey, independent of what you think about the training data copyright fight, which I’m not going to get any then saying that, that archival uses are really important. Right now, I think it does this via RSS feeds because it’s for headlines, but there’s a bot on BlueSky and Mastodon that looks at changes to New York Times headlines from the first post to 20 minutes later. And you can see the sort of diff between those headlines. And that provides valuable media criticism, frankly. And we’re not even talking about 20 years from now. We’re talking about it within the day of people posting to be able to see how stories have changed.
(21:28):
So I don’t think folks are doing it. I don’t think the New York Times is doing it because they don’t want people seeing how their headlines have changed or that they’re stealth correcting things in the text, although they certainly do do that. But I think that some of it comes from this sort of general framing of enclosure, as Mike was talking about, or can come from a lack of going through the details to understand the differences between different types of actors who may be using somewhat similar technologies.
Mike Masnick (21:53):
Yeah. Can I just add something to that? One of the things that I think is important that overlays a whole bunch of this is the general feeling of many people somewhat reasonably about the entire AI space right now, that there’s a large sort of backlash. There was this study recently that ICE has a higher approval rating than AI technology right now. There is a general sort of conceptual backlash to this technology. Some of it based on perhaps good reasons, some of it based on perhaps not good reasons, none of that matters. Culturally, there is this general backlash, and especially for smaller, less sophisticated sites that don’t want to go through the process of having to deal with that and the nuance or just saying, “I want to opt out if I can of this technology that I feel is problematic and bad.” And if they don’t have a clear and easy way to do that, one of that might be, “Well, I’m going to block any and all scraping because I vaguely know that that is being used to allow these companies that I hate to do something with my content.” And therefore, for some of them, it is not a well thought out, I am taking a stand against archives.
(23:08):
They’re not thinking that far. They’re just saying like, “AI, bad. I have no control over this situation. The only thing I can do is someone has made it easy for me to block archiving or scraping, and therefore I have to do that as a stand against this technology.”
Mark Graham (23:24):
That’s true. And at the same time, I want to note that there are many other news organizations that take the opposite approach. Indeed, specifically, for example, the Pointer Institute and with the organization behind the Investigative Reporters and Editors Conference has partnered with the internet archive on a project called Today’s News for Tomorrow. And what we are doing specifically is providing free archival services to more than 300 local newsrooms across the United States to help them archive their material. They have chosen to participate in this project because they value and appreciate the importance of the archiving. And at the same time, I note that more than 200 journalists have recently signed a letter endorsing the work of The Way Back Machine, celebrating it. In fact, Rachel Matto and others are on record supporting this and has signed the letter of support. So we are focusing here on some of the pushback from a very small number, but influential and well-known news and other sites.
(24:27):
But I want to put across the point here that generally speaking, we are able to continue to provide the service that we have for decades with the active support of, first of all, the patrons of the internet archive, the folks that are curious enough to want to learn from journalists writ large and from media platforms.
Mike Masnick (24:46):
Yeah. And I signed that letter and I completely agree with that thinking. I’m just sort of explaining some of the thinking. And I would even go a little bit further in that beyond just the importance of archiving and being able to use these tools that journalists use for research, I do worry a little bit, even if we’re talking about the AI technologies as well, that when you have major publications like the New York Times trying to block any and every possible way in which their writing might be read by AI tools, that that actually has problematic downstream consequences as well, where you have more problematic publications that are out there and the ones that have done more careful reporting. The New York Times sometimes does careful reporting, not always, I would say, but you want to have good reporting in these archives and in the AI tools as well as people are using them so that they’re not overrun by more problematic content.
Mark Graham (25:44):
It does. And if I could build on this just a bit, it sets a very bad precedent and that then bleeds into other areas of publishing. For example, the US government, the world’s largest publisher, uses large commercial platforms for much risk publishing. The US Agency for Global Media, the folks behind Radio Free Europe, et cetera, use YouTube to publish videos, millions of videos, thousands of which have been taken down since this new Trump administration. A couple of months ago, the State Department said that they were going to remove all of the social media posts prior to the Trump true administration. And as we were racing to archive more than two million social media posts, we were watching accounts from embassies, ambassadors and others over the years literally disappear from our screen as we were trying to archive them. So I think this just gets a dangerous precedent and something that we should be paying attention to in all dimensions of how we are working to preserve the materials that are published and to never trust a publisher to do the job of a library.
Dave Hansen (26:49):
As you’re talking, I’m really thinking here about some of the business model stuff that underlies so much of these concerns. And I was recalling, it was like three or four years ago, I guess, at this point working with a library that was doing a licensing deal with a rather large newspaper. And I mean, the numbers that they showed me, they’re talking about six-figure data licenses for access to the newspaper data. And we’ve had people talk about this before on here. Sarah Lambden did a talk about her book, Data Cartels, where a lot of it focuses on Read Elsevier, academic publisher, and there’s this real disconnect with how authors and contributors and journalists think of those outlets and what those outlets actually are from a business perspective. And I think the New York Times at this point is as much a data and analytics company as it is a newspaper.
(27:40):
Read Elsevier specifically calls themselves a data analytics company, even though they are on paper an academic publisher. And I think it doesn’t really help solve the situation, but at least explains a little bit more to me why they are making the moves that they are around restricting access to this content, if that’s the core of your business. I still don’t like it, but that explains a little bit. So I do want to talk about some other companies and outside of news, I guess, is where I’d like to go. So Reddit has been pretty public about blocking access. They have a lawsuit right now against Anthropic. That’s been a kind of interesting one to watch. And there are lots of other commercial platforms, social media platforms, for instance, that are restricting access for web scraping and preservation. So what’s going on there? Kendra, maybe we can start with you to just talk a little bit about what’s happening in litigation with some of these other platforms.
Kendra Albert (28:37):
And I think Mark’s point and your point about these sort of platforms is I think it’s really valuable in some ways to think about the actual rights to the content or your sort of legal right to use the content as an actually functionally totally separate question of how scraping it. And I think specifically with Reddit, Reddit doesn’t have the right to sue someone for copyright infringement for copying Reddit posts. I haven’t read the terms of service recently, but I’m pretty sure you’re not allowing Reddit to sue on your behalf for copyright infringement. But oftentimes the way this litigation is framed is around access to the platform, circumventing technological measures. So that’s the anti-circumvention part of the copyright statute, section 1201, or through things like there’s this fantastic, I’m not sure how I mean that word, but I’m not entirely positive tort that used to be because you touched someone’s car without permission called trespass to channels, which also has to do with what has been historically used in some context for web scraping, although usually you need to show that there’s some form of harm to the sort of infrastructure in order to bring it.
(29:41):
We talked about the CFAA, there’s trade secret, there’s all kinds of other sort of legal claims. So in some ways when you’re thinking about how some of these platforms are choosing to in some ways back up their business model goals, Reddit has done licensing deals with AI companies. I forget which one off the top of my head, but There is a very real conversation about like, “Hey, why should we pay you for this data if we could scrape it for much less money?” Now, of course, the version that you’re going to get from Reddit, if you pay them for it, is probably going to have other advantages just in terms of the metadata, the infrastructure, being able to ask Reddit questions about how the data works, all that kinds of stuff. But when we’re thinking about the legal reality behind these decisions, I think part of it has to do with the idea of the business model.
(30:29):
And part of it has to do with, I think, some degree to which I think some of these platforms may be genuinely responding to their own users being upset. And LinkedIn scraping from current generative AI days is actually a really good example of this because LinkedIn brought a lot of scraping litigation against primarily business competitors that were using LinkedIn data in order to run a recruiting tool or do other things that one might want to do with professional information. And to some extent, that was protective of their business model. These were effectively their competitors or they would roll out a product that was competing with whatever that company was doing. But also, legitimately, sometimes folks had real privacy concerns about the fact that, “Hey, I shared this data on LinkedIn. I didn’t assume that it was going to go everywhere. Now it’s gone everywhere.” I think that is different than the web archive in context.
(31:22):
And I’m not saying, “Oh, this is the same thing.” But I think why I bring it up is to say that you have this sort of circumstance under which there’s a variety of different incentives for limiting access to data, and it’s impossible to disentangle them. It’s impossible to say, “Oh, this is only because of business models. Oh, this is only because people have privacy or usage concern where this goes outside where it was supposed to be. ” And that oftentimes tech companies, LinkedIn has long said actually that their primary reason for a lot of their anti-scraping tooling is to protect users’ privacy. Now, I think that that’s a hard position to defend given the sort of business model stuff, that that’s the only reason, but I don’t think it’s not part of it. So I think that when we think about the moves by companies like Reddit to restrict all kinds of access, including the internet archive and the way back machine, you can’t just pin it to one thing.
(32:16):
And it’s not always based on one specific legal theory because oftentimes they’re trying a bunch of different stuff simultaneously, of which copyright might be one of the tools, but often is actually not the most useful if you’re talking about really significant amounts of web scraping. I hope that sort of answered your question, Dave.
Mike Masnick (32:34):
The one thing I was going to add in the Reddit context is that it is an example of where this can lead in terms of starting to test out questionable or extreme legal theories. So one of the cases that Reddit has is against this company called SERP API, which or CERP IP, I don’t know how they pronounce their name. And you can argue that this is perhaps not a good company, but basically what they do is they scrape Google results and create an API so that you can programmatically make use of Google results. Google is also suing them, but that’s a separate case. But you have Reddit suing this company over copyrights that Reddit doesn’t own, as Kendra noted. It’s the users in most cases, if there’s any copyright interest at all. And they’re suing this company for scraping Google’s results, which again is not Reddit and claiming that it’s a DMCA 1201 anti-circumvention measure over something that Reddit itself hasn’t set up the technological protection measure.
(33:35):
The only thing that they’ve done is cut a $40 million deal with Google. And so you get these sort of stacking legal theories and questionable things that while you can see, okay, Reddit is upset that perhaps AI companies are routing around doing a deal with Reddit or with Google because they can use a company like ServpAPI to get Google results that scrape Reddit because they have a deal with Reddit, it leads to really questionable places in terms of other types of scraping or other uses that are important and useful culturally. But because everybody’s sort of trying to figure out how do we do these things and how do we cut these deals, you see these sort of somewhat stretched legal definitions, I think, or tempts at questionable cases.
Kendra Albert (34:22):
And can I just say one more thing about Mike’s point real fast, which is I think that that’s entirely true. And I think the other thing to point out there is as much as I like to, I think it’s good to distinguish between good things and bad things. I’ll go on the record as being in favor of that. I think when we’re talking about making case law, oftentimes the sort of factors judges look at or the decisions judges make don’t say, “Okay, well, I don’t like this company because I think their business model’s bad. And so I’m going to find that they violated the CFAA because of that. ” But the good guys, it’s not a CFAA violation. That part of the law usually works. We actually get to do that way more in fair use. Because of that in my current job, we often work with researchers who scrape internet platforms to look for things like bias, discrimination, to understand how platforms work, that kind of thing.
(35:07):
And those folks are subject to all the same bodies of law that get made by, well, Reddit is pissed off that you can get Reddit results from Google at this company, or Reddit feels like they’re channeling their users outrage that the sort of user’s data is being used for these purposes they didn’t intend. So I think it is really important to note that their archiving, research, all of these kinds of uses often basically require exactly the same tools, just like the way BackMachine uses the same, using bots to view webpages and archive them. Mark, I’m wildly dumbing down the complexity of what you do, but researchers are using the same tools to scrape data and to sort of understand how tech works. So I think it’s not actually easy to just be like, “Okay, great. This technology, this way of doing it is good or bad, and we should just make a rule generally.” And
Mark Graham (36:00):
You have to explain a little bit too about what’s at risk here beyond just news. The internet archives more than a billion URLs a day. And one of the signals that we follow is links added to Wikipedia articles, for example, all of them. And as a result of that, we have been able to identify and fix that is edit and replace otherwise broken URLs that would return a 404 with archives of those references that human beings had added to Wikipedia articles over the years. More than 30 million links have been fixed in this way. Pew Research, for example, identified for a collection of URLs they looked at that were 10 years old, that 38% of them were no longer available on the live web. So what does that mean if we can’t have access to this material anymore? A variety of things. Hundreds of times a year, the Wayback Machine Team produces an affidavit to attest to the veracity of our web archives for the use by lawyers in courts.
(37:03):
And often these are cases of product liability, maybe a misrepresentation by a company, et cetera. And this material is often the critical evidence that is used to determine the outcome of the case. So there are any one of a number of applications of web archives beyond just news that are vital to our society to be able to hold those in power accountable and to be able to help those curious enough to learn to inform themselves.
Mike Masnick (37:30):
I do think that that is important to just remember the concept of the open web itself and sort of how we got here in the first place. I think it gets very easy. I mean, I even sort of got bogged down immediately on the AI aspect of all this, but the open web has been around for more than three decades at this point. And I think many of us are here because we believe in the promise of the open web and what it enabled in terms of community and culture and sharing of information and meeting people and everything. So much of what we rely on today was built on this open web. And the concept of the open web is this idea that it’s not controlled by any one entity. And it is not locked down and limited, but that we can build on it and do more with it and we can share with each other and build culture.
(38:22):
Culture is about multiple people understanding the same concepts. And that is built very much on the open web these days. And so much of where this unfortunately potentially leads to is a locking down of the open web just because of concerns about how it might be used in one particular way. And so just as I know we’re sort of getting to the Q&A part, I felt like we should emphasize that aspect of why we’re all here.
Chris Freeland (38:52):
Thank you, Mike, for acknowledging that. I’d say long live the open web. I 100% agree with everything you said. It was the open web is an important part of our culture and I hope that it remains that way. And Mark, I think it may be helpful if you can explain how does the Wayback Machine make data available in bulk and what kinds of protections are in place to prevent some of the abuses that have been mentioned here?
Mark Graham (39:17):
Sure. Generally speaking, we don’t make material available in bulk. The underlying files behind the Wayback Machine are generally not publicly accessible. We do provide an ability to playback, to replay individual web pages through what I refer to as the thin straw of the wayback machine. For those of you who have used the service, you understand what I mean, it’s pretty slow. There are certain features where one can list large numbers of URLs for a given site, for example. At the request of some publishers, including the New York Times, we’ve disabled that capability for those particular sites. We do some archiving of material that is generally considered to be publicly available in particular material from governments. We participate with many others, including Kendrick with Perma CC at Harvard and others on doing a deep dive on material from the US government. And we do package that material up and we do make bulk acts for that particular collection of web archives available to researchers and others.
(40:23):
And also, as I noted, that’s on the playback side on the archiving side or how we serve material out to the world. There are a variety of mechanisms that we put in place to do limiting, to detect and deter access to the service that is not human originated.
Chris Freeland (40:41):
Very helpful. Thank you. Question for everyone. If the way back machine and other archival institutions get blocked, people are probably still going to do some archiving, but they’re going to do so in maybe less legitimate ways and screenshots and other things. And so I’d be interested in your thoughts on this issue of maybe the non-legitimate archives or the preservation by organizations that are outside of the traditional library sphere. What does that mean for the historical record? I’m
Kendra Albert (41:07):
Going to just leave non-legitimate archives over there. Well, so I think there’s a couple things to think about there. One is I think, yes, certainly screenshots are not as good as a more interactive page component, but I think ultimately having something of it is better than having nothing at all. One area I work on a lot is video game preservation, which where we encounter a lot of somewhat similar challenges in terms of the degree to which the technological complexity, the sort of challenges with permissions from rights holders, that kind of thing. And I think one thing that I think about a lot there is in some ways when you make it really hard for institutions to legitimately preserve things, for institutions that are big, public, who are very clear about what they do and how they do it, you do in some ways seed ground to smaller institutions that may have different practices.
(41:55):
And some of those institutions are often really good at what they do and they’re just quiet about it. And that’s great. And some of those institutions, I think we maybe all followed. There was a whole kerfluffle about, I think archive.is, which was sort of a tool that people used for archiving webpages, often getting around paywalls that was allegedly running a fake capture that was DDoSing a critic of the site. I think that’s a really good example of one of the potential downsides of some of the more aggressive attempts to limit automated access or access because folks were not going to that site because they unnecessarily would’ve preferred that site. They were going to that site because they could view content there that they weren’t able to view elsewhere or they could access an archive page that they couldn’t access elsewhere. And so I think there is a real risk in a lot of these spaces of making it very hard for institutions that want to do the right thing to effectively preserve or save works.
(42:50):
And then it’s sort of causing challenges for both the historical record and for who’s left.
Mike Masnick (42:57):
Yeah. I mean, I think there are good actors in this space. And obviously the Wayback Machine, the Internet Archive are a very clear example of a good actor. And if you continue to make life difficult for them, it is only going to push people to those who maybe are less good actors and there is other kinds of collateral damage that comes along with that.
Chris Freeland (43:18):
Leaving the non-legitimate archives on the floor, but something of a related question. So should preservation institutions be treated differently from AI companies in law or policy? And are there then proactive policies that libraries need to be able to continue doing this work in the digital age?
Kendra Albert (43:37):
I mean, in some ways they already are. Section 107, the reason I kept being like, and you get to actually talk about whether what people do, Section 107, which is fair use within the US does actually care about what you’re doing with the content. Section 108 of the Copyright Act is specific to libraries, certain kinds of archival and preservation institutions allows them to do things that other institutions can’t do. It’s not a question of like, should we treat them differently? We already do. It then becomes a question of, “Hey, should we treat them differently anywhere else?” Is maybe the sort of question I’m asking. And I think it’s really hard in the existing scraping law context to see how that would quite work. Although I think that we did see some of that in a case called Sandig v. DOJ where some researchers sued the DOJ over the computer fraud and BSX criminal components, making it harder to First Amendment kinds of protective research.
(44:24):
So I think there’s some inklings of that and it would be fantastic to, I think, see more engagement with this question of what are the actual uses we think are good and important and how do we promote those versus sort of, okay, just get rid of the whole thing.
Mark Graham (44:38):
Yeah. I’m going to add, first of all, I’m not a lawyer and I do recognize existing copyright and fair use allowances to substantiate the work of The Wayback Machine to support it. But at the same time, there was the Vanderbilt clause added to carve out specific explicit protections in the area of television news archiving. I should note that the internet archive is a very robust television news archiving program as well. But I want to flip it around a little bit and say that news is a very special category of online material. It plays a vital role in our democracy. Indeed, it’s been referred to as the fourth estate, and various measures of privilege are given to news and news organizations. And I might suggest that with those privileges and rights come certain responsibilities related to access and availability. I think living in a world that’s awash with mis and disinformation, the Internet Archive recently co-published a paper that suggests that up to a third of new websites and webpages appearing on the public web today are at least partially AI generated.
(45:50):
And so this is a time of rapid change. In fact, if we’re paywalling and making quality journalism generally unavailable to people unless they have a subscription, which is a teeny, teeny percentage of the population, that we’re going to end up with a world more and more where the truth, the quality journalism is paywalled and therefore generally inaccessible to people, but the lies will proliferate and they will become, as they are in many cases, the dominant presence in the conversation. When I was growing up, I had a library, a physical library, and I had access to the New York Times and other magazines that I was able to read. If that library didn’t have access to that material, I simply wouldn’t have had access.
Chris Freeland (46:35):
Hat tip to Nathan J. Robinson and a current affairs. The truth is paywalled, but the lies are free. I want to leave with our final question here for each of the panelists. What can anyone who’s listening here today do to help change this trajectory?
Mike Masnick (46:49):
I mean, speak about it, talk about it. Obviously use the tools well and intelligently and explain to others how you’re using these tools and why they matter. Certainly when it comes to things like potential policy or legislation, being aware of what’s happening and being willing to speak out and make sure that there is nothing that will then get in the way of important cultural institutions like the Internet Archive, but really just being a part of the conversation. I think a lot of people don’t understand where this is leading and sort of the impact on organizations like the Internet Archive and tools like the Wayback Machine. And so making sure that more people are aware, I think is the most important thing that at an individual level you can do. Obviously at institutional levels, if you do work for a news organization that is blocking access to the internet archive, maybe try to convince people that that is a bad idea and we’ll have downstream cultural impacts that are not good for society, but that more depends on where people are situated.
Kendra Albert (47:54):
Mike still, one of the things I was going to say, which is I think that for folks who have institutional affiliations, I think making sure that A, you can still access the internet archive is still accessing pages from your institution. And then if it’s not making the case internally that, hey, this is why it’s important for my work, for the things that I do, for the things that I care about, which I think is going to be much more powerful coming from folks who are internal to an institution than necessarily coming from those of us who are sort of out here being like, “Doom is coming, archiving is stopping.” So I thinking about to the extent that folks have an institutional role where they bring attention to these issues, I think that’s really valuable.
Chris Freeland (48:33):
Mark, how about you?
Mark Graham (48:35):
Just a few things. First of all, use our service. We’re a public library and we love it when people are able to benefit from the resources that are available from our library and give us feedback about how we can do a better job at providing those services. Subscribe to our newsletters, follow us on the socials. If you’re a journalist or know a journalist, I’d recommend that you check out the Fight for the Future letter that we share here, that Chris, you shared here. And then if you’re in the Bay Area, come visit us. We host more than a hundred events a year at our facility in San Francisco and every Friday, except for I think Thanksgiving and Christmas, at one o’clock, we host a tour so you can kind of get an in- depth and personal look at what we do and how we do it.
Chris Freeland (49:23):
Thank you for that, Mark. Thank you to Mark and to Mike and to Kendra for such a fascinating conversation today and to Dave Hansen and Authors Alliance as always for facilitating and co-hosting this session. Thanks everyone. Have a great day. Thanks for joining us on this journey into the Future of Knowledge. Be sure to follow the show. New episodes drop every other Wednesday with bold ideas, fresh insights, and the voices shaping tomorrow.