The GitHub Copilot Lawsuit Threatens Open Source and Human Progress

Background

Matthew Butterick, "writer, designer, pro­gram­mer, and law­yer" has teamed up with the class action plaintiff's law firm Joseph Saveri Law Firm to sue GitHub and Microsoft over GitHub Copilot.

Over at githubcopilotinvestigation.com, Butterick describes his reasoning. He says:

When I first wrote about Copi­lot, I said “I’m not wor­ried about its effects on open source.” In the short term, I’m still not wor­ried.

That's good to know. But he then goes to wax nostalgic about his specific experience in open source and how GitHub Copilot is something new and different.

But as I reflected on my own jour­ney through open source—nearly 25 years—I real­ized that I was miss­ing the big­ger pic­ture. After all, open source isn’t a fixed group of peo­ple. It’s an ever-grow­ing, ever-chang­ing col­lec­tive intel­li­gence, con­tin­u­ally being renewed by fresh minds....Amidst this grand alchemy, Copi­lot inter­lopes.

It reads like someone threatened by innovative technologies, gatekeeping because this isn't how he did open source over the last 25 years. Butterick throws in some vague handy-wavy anti-Microsoft rhetoric, also outdated and trite (but sure to appeal to the FSF crowd), for good measure:

We needn’t delve into Microsoft’s very check­ered his­tory with open source...

Almost none of the people mentioned in the article he linked to, which is about Microsoft in the late 1990s and early 2000s, are still at Microsoft. Since that time, Microsoft has seen a profound pivot towards open source software. The CEO, Satya Nadella, has not been implicated in any of the anti-FOSS activities at Microsoft during that time, has embraced-even promoted-the pivot towards open source software, and, for what it's worth, came to Microsoft from Sun Microsystems.

Most of the product managers, engineers, and decision-makers at Microsoft these days were barely out of high school in the early 2000s. In tech, it's ancient history.

Butterick even knocks Bill Gates for his open letter to computer hobbyists, which contained the radical idea that developers should be able to define the terms on which their software is distributed, which is the fundamental basis of modern free and open source software. Ironic, because if there are no rules on software code, then he would have no basis for his lawsuit.

Sadly, the Copilot case also seems prepared to bring open source software patent claims, something free and open source software advocates largely solved with the GPL v3 and Red Hat's Open Innovation Network efforts:

I thought we were all against software patents, particularly in open source. I guess not.

Why F/OSS Advocates Should Support GitHub Copilot

The legal underpinnings of GitHub Copilot are based on two basic principles:

  1. Fair use.
  2. The de minimus exception (US) or incidental inclusion (UK and elsewhere).

Fair use protection is broad under US copyright law and is codified in EU copyright directives, although adoption at the member state level varies. Many other non-US and non-EU countries have similar exceptions, though the US is the broadest to my knowledge.

Fair use is a doctrine that allows the limited use of copyrighted material without obtaining prior permission of the copyright owner.

Fair use - Wikipedia

Since 2016, US law has held that automated scanning, indexing, and minor reproduction of copyrighted works, specifically Google's indexing of books for Google Books, is protected fair use.

Authors Guild, Inc. v. Google, Inc. - Wikipedia

Fair use has come into play in open source software most recently with the Google v. Oracle case, in which the courts held that Google's clean-room implementation of the Java API did not violate Oracle's copyright on those API calls.

Google LLC v. Oracle America, Inc. - Wikipedia

Fair use protects the reimplementation of other APIs, such as the Win32 API in ReactOS or WINE, in open source. It also protects reverse engineering of proprietary applications to create open source alternatives. Importantly, like in the Google Books case, protects scanning copyrighted datasets to develop indices, ML models, and other derivative works.

GitHub's scanning of the source code available on its servers to develop an AI model is protected fair use for this reason.

Incidental inclusion is a legal term of art from UK copyright law. In the UK, incidental inclusion protects accidentally including small bits of copyrighted material by this distinct carve-out:

A typical example of this would be a case where a someone filming inadvertently captured part of a copyright work, such as some background music, or a poster that just happened to on a wall in the background.

This specific carve-out is needed in UK and other non-US countries where fair use protections are not as broad.

In the US accidentally including small bits of copyrighted material is protected under the umbrella of broad fair use protections, but is referred to as the de minimis exception.

Under US law fair use protections, the intent, amount, and effect of infringement determines whether an infringement is protected by the fair use doctrine. Where the intent, amount, and effect of infringement is minimal, it is covered by the de minimis exception.

GitHub Copilot is not intended to violate GitHub contributors' copyrights.

While there have been a handful of viral examples of verbatim reproduction of code by Copilot, GitHub has produced reports that state the actual rate of verbatim reproduction is exceedingly low. There is reason to believe the model will continue to improve and that rate will go down.

Finally, the effect of that verbatim reproduction is also minimal. GitHub Copilot is not currently capable of reproducing whole software projects, undermining other companies, or destroying open source communities.

It is an AI-assisted pair programmer that is great at filling in boilerplate code we all use and borrow from each other in free and open source software, protected by FOSS licenses and fair use.

While this is a general overview of the legal basis of GitHub Copilot, there are several valuable in-depth analyses that go into further detail:

GitHub Copilot is not infringing your copyright
Analyzing the Legal Implications of GitHub Copilot - FOSSA
The release of GitHub Copilot raises questions about potential copyright infringement and license compliance issues.

It is also worth pointing out that organizations like Free Software Foundation do not actually disagree with the legality of GitHub Copilot, they just also raise similar vague concerns about it and throw in anti-Microsoft rhetoric for good measure, to appease their base. They must fundraise, after all.

Sure, GitHub’s AI-assisted Copilot writes code for you, but is it legal or ethical?
GitHub’s Copilot is an AI-based, pair-programming service. Its machine learning model raises a number of questions. Open source experts weigh in.

What Could Happen If GitHub Loses

What are some of the potential outcomes of the GitHub Copilot litigation?

  • Fair use and "incidental inclusion" in open source software becomes more restrictive.

Ever copy and paste code snippets from StackOverflow? Did you remember to properly cite and add the relevant Creative Commons license to your LICENSE.md file for that code? How about borrowing some sample code from a blog or looking at a GitHub Gist and reproducing it in your code?

We all know we should apply that attribution/license, but do we always? How much of that code is running in production in your company or open source community right now?

Thankfully, that kind of usage is likely protected under fair use. If that goes away, copying code like this could open free and open source developers up to additional liability, expensive lawsuits, and even troll lawsuits.

We could see a whole industry crop up of open source software copyright trolls, going after open source projects for minor infringements of licenses.

  • Training ML datasets on copyrighted materials becomes more restrictive.

The ability for ML developers to train their models on copyrighted datasets under fair use right now is dramatically accelerating AI/ML. The advances we have seen in open source AI/ML being developed on datasets that are otherwise copyrighted is unprecedented. Just in the last 12 months the advances we have seen in AI/ML have been extraordinary. Models and advances in model development that used to take years are taking weeks and sometimes just days.

If training ML models on copyrighted datasets becomes more restrictive, AI/ML development will slow.

For example, I know of one AI/ML project (PDF) that scraped publicly accessible webcams during COVID lockdowns to measure social distancing. Those webcam images were copyrighted and, if fair use did not apply, could not be used without obtaining written permission from thousands of webcam owners.

This will have profound impacts on medical research, science, models that improve accessibility for users, and other practical applications of AI/ML that improve the lives of humans and benefit our planet.

This means more lawyers involved in model training, which will then become more expensive, and slower.

It will also likely take ML model training out of the hands of hobbyists, open source developers, and individual researchers and limit it to big corporations who can afford those compliance costs.

Individual ML developers, like individual open source developers, will suddenly face much more legal ambiguity and exposure, if we do not defend fair use.

Conclusion

tl;dr Based on squeamish feelings that GitHub Copilot is something new and different, and gripes about Microsoft from 20 years ago, a tech lawyer has teamed up with a class action plaintiff's law firm to sue GitHub over an incredibly helpful tool that improves open source quality and output, the potential outcomes of which could include:

  • Making free and open source software harder to share
  • Re-implementing proprietary applications, hardware, and protocols as free and open harder to do
  • Making training AI/ML models more expensive, taking it out of the hands of hobbyists and researchers, limiting it only to big corporations with huge legal departments
  • Slowing development of real-world application of AI/ML models that will improve human life and longevity
  • Upending the current détente in the free software and open source communities over software patents

You do not have to love Microsoft, GitHub, or 'back them' in this case. But free and open source advocates who have concerns about GitHub Copilot should but just as skeptical of the GitHub Copilot plaintiffs based on what is at risk here.