• mindbleach@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    12 hours ago

    And perhaps most recently, when a person who publishes their work under a free license discovers that work has been used by tech mega-giants to train extractive, exploitative large language models? Wait, no, not like that.

    … permissively-licensed media is about the only thing that’s not extractive or exploitative to train on. It’s the least questionable source of data. Its authors should be annoyed there’s training on anything else.

    Models could use the public domain if we fucking had one.

    [Creative Commons says] AI training should be considered non-infringing by default from a copyright perspective.

    Pretty much, yeah. It’s transformative. You can’t take one word from every book ever written and say publishers are owed ten zillion dollars. Similarly, you can’t distill the whole internet into a gigabyte of numbers and call that mere piracy. If the model contains more than a snippet of any particular work - you’ve built it wrong. Copying even one article is more infringement than training on a whole archive.

    Irresponsible AI companies are already imposing huge loads on Wikimedia infrastructure

    Wait, how? Aren’t there torrents of the whole corpus? Do these idiots re-download everything, every time?