I have a software that parses and renders LaTeX formulas. I want to test its accuracy on a large dataset of equations extracted from arxiv papers. Arxiv offers an API to download papers in bulk and so, it is technically feasible to extract the code of all equations from all arxiv papers (or say a very large chunk of it).
Is it legal to make publically available a dataset of LaTeX equations generated in this way, as part of the test suite of my MIT-licensed software? Although it would be majorly inconvenient, I can include attributions to the many many authors in some form. In case the jurisdiction matters, I'm based in France but the papers of course from all over the globe.
What I (think I) learned from browsing the Internet and Legal SE:
- Arxiv holds a distribution license, not the copyright.
- Equations are not copyright-able, but their specific presentation may be. That would probably include the LaTeX code for the equation.
- There is a notion of fair use, although I'm not sure whether it applies here.
Some references:
- Technical Standards and Papers - Can the equations be copyrighted: https://law.stackexchange.com/a/62393/68725
- When can you use images from arxiv papers for commercial purposes?: https://law.stackexchange.com/a/51905/68725