Hi everyone, I’m seeking advice and opinions. I’m building a web-based RSS reader/search engine/discovery tool. Like any RSS reader, my app fetches content from feeds and displays it to subscribers. Often, blog authors include only a short summary to the RSS, and the user has to visit the blog website to read the full content. My app also attempts to scrape the full webpage of the blog post for search indexing purposes (respecting <code>robots.txt</code>, of course). It also saves the HTML content for archiving purposes, like Internet Archive (if the author disallows <code>ia_archiver</code> user agent, I also honour that and don’t archive). So, since the app might already store the full content, my dilemma now is whether it’s ok (ethical) to show the full article in my reader? This view is never public, so only registered users who subscribe to the blog can see it. But still, it feels wrong, because it’s not even like browser’s “reader mode” — the user does not visit the original page at all. Not ok because:
- Authors who only include a short summary in the RSS do so precisely because they want readers to visit their website.
- Visiting the original blog is a much more personal experience than reading all blogs from the same UI of the reader app; bloggers craft their digital gardens for visitors!
- Some blogs include styles, math, scripts, etc. which aren’t rendered correctly elsewhere after scraping.
Ok because:
- It’s a nicer UX for the reader?
Curious what others think.
The original content creator relies on advertising, click-throughs and maybe merchandise sales that you would be denying them by scraping their content. This is the entire argument against Google doing what they’ve been doing for the past decade. The value of Google, and by extension your rss reader, is generated by other people’s content, it has little inherent value on its own as without content, it is useless. Drain the income of content creators for long enough and you no longer have content creators, so now you need another thing to generate content. Enter generative ai.
And thus was the internet of 2024 forged, through stolen content and seeing no value in the creations of people, only desiring more content at any cost, as long as that cost to the platform is zero.