Prevent patents by allowing crawlers

DennisPeterson · November 17, 2018, 2:41pm

One way to help protect against software patents is to make sure posts are stored in Internet Archive, thus giving a reliable publish date. I just attempted to do that with a post, and it didn’t work because they won’t save anything that has a robot.txt prohibiting web crawlers.

Would it be possible to modify robot.txt?

dlubarov · November 17, 2018, 8:40pm

The robots.txt looks fine to me; IA’s crawler should be able to discover and archive any topic pages it likes. It looks like IA’s crawler just hasn’t decided to archive very many topic pages, for whatever reason, but there are some. Here’s an example.

If someone representing the website could email [email protected], maybe they could adjust some configuration to make their crawler more likely to archive all the topics here.

Edit: I tried requesting that IA archive a topic page through their web UI, and IA did archive it (link), but the server didn’t give it the actual content of the topic; instead it returned “Oops! That page doesn’t exist or is private.” Might be a bug in Discourse? Or it could be some intentional bot blocking code within Discourse, possibly with a rate limit that IA’s crawler sometimes exceeds.

DennisPeterson · November 17, 2018, 10:05pm

Interesting. On one request I got a message about robots.txt but on several other attempts I got the same message you did.

DZack · December 20, 2018, 5:47pm

I can think of another place to store posts for future “proof of publish date”

(or hashes of posts, anyway)

DZack · December 21, 2018, 8:01pm

…but actually tho, if we can just get posts in a standard/ plaintext format, say once a week, hashing them, storing the hash on Eth, and hosting the content (IPFS, or even just have a few redundant copies hosted somewhere) could be a neat project, and a nice illustration of an easy use-case.

virgil · December 21, 2018, 10:06pm

I will ask my colleagues at archive.org to look at this.