[ML News] GitHub Copilot – Copyright, GPL, Patents & more | Brickit LEGO app | Distill goes on break

an open door an open window an open bottle open ai and github invent co-pilot and everyone freaks out about copyright welcome to ml news greg brockman writes an ai pair programmer in your editor it's powered by open ai codex a new ai system which can convert from natural language to code with increasing reliability he's talking about github copilot so copilot is this system uh that's developed by openai and github to be a super duper auto complete basically what you do is you write the name of a function or some kind of a class or actually anything you want maybe along with a little bit of a docstring and the system will complete code for you now other than classical autocomplete systems which are rule-based and basically suggest to you what's possible which variables fit here which ones are in scope this system goes much beyond this it'll try to guess what you're trying to do and it will write this code for you or it will at least suggest it so they have a bunch of examples here for example this parse expenses statement the user writes the function name and then a few examples in the docs string as you would write if you were to program it and then copilot implements the function itself now i've been using tab 9 for a while and i'm pretty happy with its suggestions especially if you pair it up with a classic autocomplete you get the classic autocomplete which tells you what you are allowed to do essentially and you get the ai autocomplete which is trying to guess what you want to do this enables things like if i catch an error that's called password error it will already provide a log message for me that says password wrong and there are many more examples where it just kind of infers what you want to do and that's super helpful at times copilot by github is this on steroids it will implement entire functions entire classes from a description or even just from a name of a function now it's not going to be perfect of course whether it actually helps or hurts and who does it help does it help the experienced programmer because they can write faster and just have to check for errors because there definitely are errors if you see right here in this expense function the money is held as a floating point number which is a big no-no when you handle currency on the other hand does it help novice programmers because they see the implementations of functions they wouldn't know how to implement however they're probably going to not catch the mistakes there are there's a lot of debate around this but i'm pretty excited to see this honestly now the issue comes when you talk about the following they say it's trained on billions of lines of public code github copilot puts the knowledge you need at your fingertips saving you yada yada marketing however trained on billions of lines of public code that means they essentially went to all of github or the public repo and trained a giant language model on it it's nothing more than this it's essentially something like gpt 3 on code probably augmented by a bit of syntaxing and whatnot but it's not much more it's just lots of data lots of compute gives you a model of what people usually do when prompted with some sort of strings so safe to say this won't replace programmers exactly anytime soon as you can maybe see from this is even function implemented to extreme precision of course and actually i don't know if that's even uh real or a fake because people have definitely been making fakes about co-pilot this is not going to happen anytime soon what's more worrisome is for example openai co-pilot emitting personal information such as this openssh private key which someone left in their repository and now co-pilot is just regurgitating it in fact on the faq page github copilot says yes they sometimes output personal data not because they do anything wrong but because people left that personal data in their repositories and the system is trained on those repositories and sometimes it will decide that the most likely output is that training sample and that gets us into an interesting topic so the topic is does github copilot recite code from the training set now we've been having this discussion for a long time do these large language models actually understand what they're doing or are they simply kind of reproducing the training set and if they reproduce the training set by which degree do they integrate maybe multiple training set samples combine them or do they just take one and kind of reformulate it a little bit who knows github did an extensive study in which they found that only about 0.1 of the outputs are in some way reproductions from the training set however there is a big dispute about what exactly counts as a copy as a recitation and how different is different enough and that gets us into the biggest issue which is copyright so the issue here is that github and openai essentially take all of this code train their system with it and they don't give you the co-pilot for free of course not i mean how are you going to live up to that name open ai uh they're of course going to sell this now fair enough they did something cool they want to make money however um the code they used in order to train the system isn't always freely available at least that's what people think now how would you feel if you wrote some code you were the legal owner of the copyright to that code and github simply trains a model on your code and then sells that model for other people to produce their code and they don't have to give you anything for it also there is the issue of gpl license code which requires that any modifications to it again become gpl licensed the question is if the model outputs code that was a result of training on gpl code does the output of the system also become gpl licensed or not and there is even more of an issue when it comes to patents on code patents are yet another category of intellectual property protection and we've seen example of co-pilot reciting patent-protected code with all of this i've been reading into software copyright and whatnot a little bit and i want to give the disclaimer i'm not a lawyer this is not legal advice this is entertainment purposes only if you want some actual opinion go to an actual lawyer and pay them but also what one can say is what lucas buyer here says with everybody hypothesizing about co-pilot and gpl license let me add another perspective nobody knows and nothing whatsoever will happen until someone sues someone now i'm not going to hold my breath which is true ultimately a judge is going to have to decide case law has to be established and we'll take it from there so what follows is my personal opinion on the matter trying to analyze this a little bit here's a bit of a diagram of what's happening currently in this system you have the co-pilot system as a piece of software that contains uh maybe a neural network that has been trained on some stuff how did this co-pilot come to be the co-pilot is built upon libraries such as pytorch which are usually fairly openly licensed like an mit license or something like this so there's no problem there then copilot of course needs copilot.pie the thing that you actually run to do the training and the inference which also is authored by the co-pilot authors and therefore not an issue in our case but then one of the inputs to co-pilot is of course the giant data set before we even get into licensing of that data we have to talk about copyright itself everybody's talking about gpl license and whatnot but gpl being a copy left license only pulls if copyright law even applies so first we have to see does copyright law even say anything about using this code in this way copyright law works differently in different countries but in general it protects creative outputs of people so if you do something if you express yourself in some creative way you obtain automatically copyright on that artistic expression so if i write a song then i am the owner of copyright for that song i don't have to register it anywhere i have it by default now as an owner of copyright i get certain benefits for example i can decide whether or not my work is reproduced which derivative works can be made and how they are treated how it is distributed to the public how it is performed and so on i have certain rights to the dissemination reproduction and modification of my work now notice what's not on this list enjoying the work reading the book reading the code so as a copyright owner once i've decided to display my work publicly i can't actually prevent anyone from looking at it in the public space that i chose to display it so one place we actually have to go is the terms of service of github so under user generated content github says you own content you create but you allow us certain rights to it in a subpoint they say we need the legal right to do things like host your content publish it and share it this license includes the right to do things like copy it to our database make backups show it to you and other users parse it into search index or otherwise analyze it now you can debate whether or not otherwise analyze it means they can run machine learning model on top of it given that they say this is in order to fulfill their service but certainly you allow github to display your code and anyone can go on github and you cannot prevent them from reading your code you cannot prevent them from actually downloading your code to a private hard drive in fact the ideas and algorithms behind code are not copyrightable what's copyrightable is only your expression of those ideas so i can't copy your code but i can look at your code learn from it and then express the same idea in my own code if you want to protect an idea that's the terms of patents and that's a whole other game you actually have to register for a patent whereas copyright you obtain automatically so if i can look at your code learn from it and then reproduce it in my own way why shouldn't a machine be able to and that brings us to the second important point right here which is the right to prepare derivative works based upon the work now according to wikipedia a derivative work is an expressive creation that includes major copyrightable elements of an original previously created first work now the article here is mainly concerned with what copyright exists on the derivative work but for our purposes if something is a derivative work of something else it is potentially in violation of the copyright of that first work and when is something a derivative work if it contains major copyrightable elements of that original now is this all a bit fuzzy yes absolutely and there is a giant gray area of course so if i look at an algorithm and i implement that in my own code what counts as containing major copyrightable elements of the original if i use the same kind of indentations if i use the same variable names if i use the same structure this isn't really an exact science it is for judges to decide but safe to say there is a way where i can learn from other people's code no matter the copyright situation and i can then write something based upon that and it is not a copyright violation there is also many situations where the exact same thing is a copyright violation and that all depends on how much of the copyrightable elements so not the ideas but the expression of the original work is contained in the derivative work and that of course brings us all the way back to the discussion do large language models simply recite the training data and change it a tiny bit or do they integrate the training data learn from the training data learn the patterns behind the training data and then come up with their own way of expressing those patterns the truth is probably somewhere in between they're not exactly copying the training data but it's also not the fact that they understand what's behind the training data but safe to say there is a way where copyright might not even apply and then there is actually no problem right here but let's assume for a moment that copyright does apply and things are actually in the realm of derivative works well then there are still multiple questions right here for example here you see that there are multiple elements in the system one is copilot itself as a software now if you argue that somehow the copyrightable elements of the input data end up in the weights of the neural network and therefore the neural networks are essentially a derivative work of the input data then copilot itself might be in violation of copyright law but even if co-pilot isn't a violation of copyright law still the output of co-pilot might be in violation of copyright law and that's going to probably have to be decided on a case-by-case basis and it might even be that open ai might not be responsible for this but the person actually using the copilot tool to generate output it's all a bit of a messy situation notice what we haven't talked about so far gpl because gpl as i said only applies when copyright applies now let's assume copyright applies so here is where we get into licenses of code in general the training data contains broad categories of how code is licensed and i've listed four of them here there is the boring code which is so boring that copyright doesn't apply literally it's no expression of creativity it's just formulaic code writing maybe even auto-generated not copyrightable not a problem there there is also the open category which is so openly licensed that it's usable in any format like an mit license as long as you keep the disclaimers there you're fine then there is the bunch of code that does not have a license at all if there is no license that essentially means that copyright owner simply gives github the right to publish but retains all other copyright and everything we said so far applies so either copilot or the output copilot generates or actually both might be a violation of the copyright of the unlicensed code and then there is gpl code so the gpl the new general public license in this case version three but they're all kind of similar i know i know tivolization they are generally known as copy left licenses because if a piece of code is licensed under the gpl it means that if you were to modify this code then your modifications also have to be licensed under the gpl and being licensed under the gpl means things like if someone obtains a copy of the software then also you have to provide a copy of the source code with that software so the gpl is a bit like a virus that if it initially applies to a piece of software someone else uses that software maybe modifies it a little bit or includes it into their system the whole system has to be under the gpl or they are in violation of the license of course if copilot is found to be a derivative work of gpl licensed data that would mean co-pilot itself would fall under the gpl and therefore openai would have to give us its source now what source code is is a bit of a tricky business in the legal scene but gpl defines it as the preferred form of the work for making modifications to it now what is that exactly for open ai pilot maybe it's not the weights of the neural network itself because like how can i modify them maybe it's the training set plus copilot dot pi maybe it's even not even the training set but it's actually the scraper for the training set as well as the training code who knows now github and openai can save themselves from having to release the source code of copilot if they only make it available over the network in which case you don't have to give out the source code license that would only be in the case of the agpl regardless of that the bigger question is what if the output of copilot is a derivative work of gpl licensed code in that case the output of copilot in a case-by-case basis would also have to be gpl licensed and who's responsible for that probably you as a user of copilot if you ask copilot for code you get an output i don't think it matters whether or not you know that it's a derivative work of some gpl licensed code if you then use that code and build upon it and maybe sell software based on it that software technically is under the gpl so this was my little take on the copyright situation around openai co-pilot i think it's a great tool but you can also see it brings a lot of difficulties with it not necessarily technical difficulties but difficulties from the human environment so let me know in the comments what you think about the situation about copyright and whether i completely butchered some of the things thanks next news speaking of copyright facebook ai launches a image similarity challenge where they want you to figure out where all the memes came from so the challenge is essentially figuring out if someone took some photo and modified it in some way and of course the reason behind all of this is going to be to find the original creator of every meme so we can give them proper credit and glory they deserve nothing else no one else image matching very limited applications don't even worry about it next news brickette is a new app that scans your legos and tells what you can build from them pictopixel is a good article about it and shows this demo video the app will scan your collection of legos and then tell you what you can do with it so you can see it gives you a bunch of suggestions of what to do pretty neat now this is a really really cool app though i wonder the things it proposes are often made out of maybe 20 parts and this pile has at least 500 or so in any case if you do have an ios device which i don't give it a try it looks like a lot of fun [Music] next news in more sad news the distil pop website is going on a break so you might notice still as an online journal which publishes in a non-traditional way they want very interactive articles they want very visual articles explaining something they also publish commentaries threads but also peer-reviewed science the frequency of publication hasn't been too high from them but the things they have published generally were super well received so one reason they cite is volunteer burnout which given the high quality standards that they have i can totally believe this is an enormous effort to keep this going to keep the quality high and you know respect for doing it this long the article makes another point namely that self-publication seems like the future in most cases and i think the field generally agrees today's scientific progress is more made through sharing archive publications and discussing them on social media than it is through the peer review system of conferences so even though it's sad distill will take a break what they're advocating for is a better future for science and that's a great thing okay next news engadget writes amazon is reportedly using algorithms to fire flex delivery drivers so amazon being amazon has this huge fleet of drivers that they don't necessarily hire it's kind of like an uber model where the driver has an app then they get essentially subcontracted for driving stuff somewhere and these aren't few drivers there are apparently millions of drivers doing this now keeping up some sort of hr department on some sort of human contact with millions of people is a challenge so amazon opted to just not do it instead they use algorithms to track the performance of their drivers and if the performance sinks too low they fire the drivers algorithmically so the article states the frustration of some of these drivers saying the system can often fire workers seemingly without good cause according to the report one worker said her rating fell after she was forced to halt deliveries due to a nail in her tire she succeeded in boosting it to great over the next several weeks but her account was eventually terminated for violin amazon's terms of service she contested the firing but the company wouldn't reinstate her another driver was unable to deliver packages to an apartment complex because it was closed with the gate lock and the residents wouldn't answer their phones in another building an amazon locker failed to open so their own system failed and they punished their drivers for it his rating also dropped he spent six weeks trying to raise it only to be fired for falling below a prescribed level if a driver feels they're wrongly terminated some feel there's not much recourse either driver must spend 200 to dispute any termination and many have said it's not worth the effort whenever there's an issue there is no support said cope who is 29.

It's you against the machine so you don't even try now here you could try to make a nuanced point that these people aren't employees that it's simply not a practical solution to manage these as employees that overall the system might be better off that a lot of drivers are having good experiences that this is just a necessity of managing so many people but but see not so long ago i wanted to get some amazon gift cards for my discord admins they're doing a good job i wanted to give them some thanks so i tried to buy some gift cards and amazon locked me out of my account security reasons so i verified my identity all good tried to buy the gift cards again they locked me out again verified my identity cried a third time now they locked me out permanently so i'm trying to contact support guess what you have to do to contact support login oh great guess what you have to do to get a support contact number log in oh great tried emailing them nothing happened tried calling them they say they'll fix it they haven't fixed it for months now they said i should make a new account great verify the phone number of the new account your phone is already associated with an account my old account has all my collection of audio books and ebooks on it and this is just splendid so i definitely feel with this drivers if it's you against the machine amazon ranks just about second to paypal when it comes to actual customer support so i'm not going to make the nuance point here screw you amazon screw you you deserve every bit of negative press that you're getting here at least when there's an issue have some support for your drivers who get a nail stuck in their tire yes i'm using a journalistic medium to settle a personal dispute what are you gonna do about it get me my account back [Music] okay next we're going to look at some helpful libraries we should make this a segment helpful libraries helpful libraries okay tensorflow introduces decision forests new algorithm never heard of it before give it a try decision forests in tensorflow facebook habitat 3d environment to train your autonomous robot to get you something from the fridge when you're just too lazy have fun with your diabetes try it out google research falcon trains your game playing agent you give it a little bit of a demonstration it learns how to play your game and test it for you and find bugs so now you don't even have to play your game while you don't walk to the fridge good job and lastly did you ever want to figure out what the gradient is of your face smashing against the wall well now you can with google ai's brax you can simulate physics in a differentiable way on a tpu really fast and in our last news tnw writes fake science is getting faker thanks a.i journals are retracting more and more papers because they're not by the authors they claim to be now of course you always know it's a serious article when there is a very futuristic robot on the picture in the front but the article is actually a good article talking about the rise of ai generated papers and how there is a massive upsurge in retractions among scientific publications but besides that i like the intro they say they say of course sometimes papers get retracted because of the authors made an honest mistake in the research in more than half the cases however it's because of academic misconduct or fraud up until a decade ago this sort of behavior was more or less limited to researchers falsifying experimental data or skewing results to favor their theory the more sophisticated technology has become however the more things have gotten a lot more complicated so the rest of the article talks about how people add big names to their papers how people generate fake authors even how people generate even fake papers and so on you know that's a whole big problem but i still think that people being shady with the results of their research is still the biggest problem there's just not too many retractions of it in machine learning because who can ever reproduce someone else's paper if you didn't get my numbers you just did it wrong so what is the real solution against fake science it's probably hard to know but i guess an approach to a solution would be to have some sort of a distributed checking mechanism where you can aggregate opinions from all around the world about a given topic and then sort of look at everything and evaluate for yourself rather than relying on a centralized committee to do it for you be that for fake news or fake science or fake anything i think that's the only way forward because any centralized institutions will eventually get either corrupted or gamed because they have some sort of scoring system but i'm be interested in what you have to say all of this is a problem it's not exactly clear how we go about making this better can we even make it better or can we just find better ways to ignore the fake things alright that was it from me for this week's ml news i hope you had fun i hope you don't get replaced by a machine anytime soon and most of all i hope i don't get replaced by a machine anytime soon so wish you a happy day and [Music] goodbye [Music] you

You May Also Like