I've created a plugin for mypy that checks for "impure" functions – those with side effects – in user-defined functions. This involves using AI, particularly for the AST visitor part. However, I'm concerned about the potential for copyrighted code being part of the datasets that LLMs were trained on. I initially set the project to the MIT license, but I'm open to other options, including public domain, since it's mainly an experiment. I've also included a disclaimer regarding the use of LLMs in the project. Here are my main questions:
- Should I avoid using LLMs entirely, or should I inquire about their data sources? I'm located in Europe, specifically Spain.
- Are there any policies on PyPI regarding LLM-generated code?
- Could these factors hinder the adoption of my library?
2 Answers
Go with the MIT license and just move forward with your project. It’s a great way to encourage others to use your work without much hassle, and it won’t complicate things for you.
It sounds like you're really overthinking this! Just remember to include a notice if any of your code references have been sourced from places like StackOverflow. You might also want to clarify that certain symbols are common in various projects. It’s a standard practice.

But remember, the difference with StackOverflow is that they have a licensing requirement for posting. With LLM code generation, you have no idea where the code came from. If it happens to copy from GPL-licensed code, you might inadvertently lock others into that license, which could be problematic.