To avoid misunderstandings, here's my stance on copyright issues in training and using LLMs.
The basic premise: If you use copyright-protected data for your business, it must not be in conflict with the business in which the copyright holder is involved.
Examples:
- A copyright holder describes in their book how to work with a specific technology based on their decades of experience in the field. The LLM must not be trained on this book and provide this information to LLM users, bypassing the original. This can happen only when a licensing agreement is signed with the copyright owner and the creator is fairly rewarded on a per-use basis, similar to selling a copy of the book.
- A copyright holder is an expert sharing their expertise online without charging for it, but the visitors of their website could then contact the expert for a paid service. The LLM must reference the expert and their service every time content based on their article is served to an LLM user.
The fact that LLM technology cannot do that by design isn't an excuse to let it violate the basic premise. The technology must be improved to be fair to creators' interests and only then commercialized.
What I consider fair use of copyrighted content: any other use in which the LLM users might benefit from the LLM's capabilities without copyright authors losing money, directly or indirectly.
For example:
- The copyrighted content is used to give the LLM conversational abilities that are then used for solving tasks the copyrighted content cannot be used for. For example, machine translation, paraphrasing/summarization of the original content provided by the user, searching for errors in a document, transformation from one format into another, or answering questions about information in the public domain, like Wikipedia or books in the public domain.
- The copyrighted content is used to give the LLM abilities for downstream fine-tuning to solve business problems the documents in the training data are not concerned with. For example, fine-tuning an LLM to serve as a classifier based on business-specific taxonomy, or to output information in a business-specific JSON schema to be consumed by a machine. Or, fine-tuning it on a proprietary legal corpus to be used as a legal assistant when legal-domain books and other copyright-protected legal documents have been excluded from the pretraining dataset.
The list of faire use cases can be expanded further, as long as it doesn't contradict the basic premise stated in the beginning.
In our latest Staff Stream, Peter and Andrew from the Creator Analytics team will be talking about new capabilities to help optimize in-experience economy and also doing a live Q&A. Register now and tune in July 24th at 10:00AM PDT. #RobloxDev https://t.co/R3Kldpl7Rm
When we started working on creator analytics at Roblox, it was in rough shape.
In less than a year, we delivered:
- Similar experience benchmarks
- Audience demographics
- Real-time metrics
- Error reports
- Avatar analytics
- Sales & revenue for in-game products
- Acquisition sources
- Insights
Thanks to the Roblox community for giving us feedback along the way. We are always listening and couldn't have done this without you.
#RobloxDev