Prevent patents by allowing crawlers


#1

One way to help protect against software patents is to make sure posts are stored in Internet Archive, thus giving a reliable publish date. I just attempted to do that with a post, and it didn’t work because they won’t save anything that has a robot.txt prohibiting web crawlers.

Would it be possible to modify robot.txt?


#2

The robots.txt looks fine to me; IA’s crawler should be able to discover and archive any topic pages it likes. It looks like IA’s crawler just hasn’t decided to archive very many topic pages, for whatever reason, but there are some. Here’s an example.

If someone representing the website could email info@archive.org, maybe they could adjust some configuration to make their crawler more likely to archive all the topics here.

Edit: I tried requesting that IA archive a topic page through their web UI, and IA did archive it (link), but the server didn’t give it the actual content of the topic; instead it returned “Oops! That page doesn’t exist or is private.” Might be a bug in Discourse? Or it could be some intentional bot blocking code within Discourse, possibly with a rate limit that IA’s crawler sometimes exceeds.


#3

Interesting. On one request I got a message about robots.txt but on several other attempts I got the same message you did.