Even months after ChatGPT and Copilot have had a chance to take hold in the development world, it seems there’s still a lot of frenzied excitement…
But less discussion of how using LLMs to help write code actually affects developers.
I wanted to go beyond surface-level garbage discussion I was finding, so I tried GitHub Copilot for myself for 4 weeks and recorded my thoughts.
My goal is to share how I felt personally and how using Copilot affected me as a developer, as well as explore what elements I think will change or remain the same.
I’ll be upfront: this is a snapshot in time from July-August 2023, and I fully expect many aspects of the tech will change as companies like GitHub learn and build.
Setting the stage: my situation#
For context, I’ve been programming for over a decade, and VSCode is my go-to IDE. The only change for me was installing the Copilot extension. I make heavy use of Intellisense autocomplete and type-checking in my IDE and can crank out code decently fast.
While I’ve been programming for a long time, I’ve only more recently started using TypeScript, which is what I was using for my two primary projects during the experiment: a custom web app and a small test project for experimenting with different database/ORM libraries.
I was very excited to test out Copilot because A) I was writing in a language that I was not an expert in and B) I desperately wanted to speed up my development so I could finish projects faster.
So I really wanted Copilot to work, but I also wasn’t expecting the silver bullet promised by the hype on social media.
Starting off (on the wrong foot)#
I started off in “don’t suggest large blocks of other people’s code” mode, mostly because it seemed like the right thing to do, but also beacuse I wanted to see if it made a difference in how Copilot performed.
This decision ended up making my first week’s experience the worst in terms of effectiveness, and the experience was really underwhelming until I understood what I could do to improve the suggestions (other big improvements came from writing very specific suggestions and getting a feeling for when to request multiple answers rather than just taking the first autocompletion).
The results got better when I turned off blocking suggestions matching public code, which is just a sign of where the tech is right now and something I expect will get better. But for now this was a pretty big disappointment.
What Copilot is really like#
My initial testing looked similar to the GIF below, where Copilot suggests a few wrong things: just returning a path string, or a download function that doesn’t exist… but with some cajoling it spits out what you expect:
One of the things you learn from using Copilot is that it doesn’t do well in situations like this where it can’t infer what to do from other parts of the codebase (the screen capture above was also using the “block public code” setting).
What’s the impact of a bad suggestion look like? One time I lost at least 15 minutes debugging an issue just to find that Copilot had suggested a variable name that looked very similar to a variable that existed, but it wasn’t actually the right name -_-
I consider myself lucky to have been using a dev environment with solid as-you-go type checking, otherwise I would have gone mad and rage-quit the experiment within the first few days.
When wrong names are right#
On a positive note, one time Copilot suggested a name for a variable that didn’t exist instead of the one that did… and I realized that the naming scheme that Copilot was suggesting was actually more consistent than the one that I had originally written.
This is actually an unexpected upside. “Novelty” in code is often a bad thing because it requires more from the reader to figure out what’s going on (principle of least surprise at work).
Good consistency in naming shows when users familiar with a codebase can guess the name of a function even if they didn’t know about it, and using an LLM could reinforce this.
Copilot is really good at certain things#
The second week I spent more time playing around with different TypeScript ORMs to see how they were different and which one I might want to use in my other projects.
This is when Copilot got good. For trying out new libraries or modules, I think LLMs can be super helpful.
Especially for things like ORMs and database stuff, which tend to be heavily template-oriented. LLMs pick up on the patterns quite well and speed up the process of writing typical DB access functions.
Not that this is a surprise, this is the promise of this technology, though some academic research suggests that productivity gains are present in some ways but not a definitive improvement.
My experience was that I still had to do a fair bit of fixing up and sometimes I’d get suggestions that would use deprecated patterns that I’d still have to look up. But overall, writing DB/ORM code felt like less of a lift with Copilot.
Copilot, flow and other psychological factors#
Also within the first week, I noticed the suggestions were having an impact on my flow state while developing.
I would set out to write a function, knowing how I was going to do it, and then partway through I would get a suggestion that would make me stop and think about whether I should accept the suggestion.
And then I’d realize I’d lost my train of thought.
Even when the suggestions were good, I found this very disruptive.
This could be due to my own experience and being able to get into flow quickly in the first place, but I didn’t like being interrupted and feeling like I had to think about someone else’s idea of what should be written next.
Driving development vs letting Copilot lead#
The constant stream of auto-suggestions seems to push the development experience to a feeling of being “pulled along”, where I’d start by writing a good docstring or comments… then wait to see what the suggestion was and decide whether to accept it… then move to the next suggestion and repeat.
This workflow is very different from my typical development cadence, and to me it felt unsatisfying and lazy.
Programming by accepting suggestions felt like deferring active control of how a function should be written… which felt similar to over-reliance on type-checking, where you just keep writing/fixing a function until the warnings go away instead of making sure the function handles all corner cases correctly.
I understand that inline suggestions are something you can turn off, but I also feel that beyond being the default; these suggestions are what makes it feel “magic”. So I think it’s more fair to represent this as a potential downside in the intended experience.
Can you trust Copilot? Can you trust yourself?#
This idea of deferring responsibility and trusting suggested code actually screwed with my mind a lot more than I thought it would.
When the suggestions were bad, I felt I had to be defensive about them, but I still accepted some of the suggestions, and those became part of my code.
But since there’s no indication of what I wrote vs what I accepted, later I started to feel suspicious of the code I actually wrote, because I was concerned that it was wrong because of a Copilot hallucination!
This was a very unsettling feeling; my first programming experience that made me question my own sanity.
Perhaps even weirder (but less psychologically troubling) was when I noticed that after getting a lot of bad autocompletions from Copilot, I felt myself distrusting autocompletions in completely different applications!
This was disconcerting but also just plain unhelpful since many apps only auto-suggest based on substring matches that are pretty much never wrong.
Weird how the human brain can connect things that look similar even when it knows they are different.
So yeah, using Copilot resulted in the most disturbing cognitive effects I’ve ever experienced from a programming tool.
Sometimes it feels like magic, sometimes betrayal#
When Copilot’s suggestions were on-point, sometimes it felt like it was reading my mind and it did help me write code without having to Google it a few times… I understand how this feeling has turned some folks into AI advocates.
But I still had to go back and fix silly stuff.
Just like when you look stuff up on the web, you can get results that are out-of-date, slightly incorrect, or flat-out wrong.
But somehow it feels worse with Copilot; it feels more personal.
Maybe it’s because I don’t have the satisfaction of blaming that jerk hypnoToad97
from the forums or some other rando from the Internet.
Or maybe it’s because when you look at social resources like Reddit or StackOverflow, you have the benefit of humans downvoting suggestions that don’t work.
Either way, to me it felt like working with a human pair-programmer; you learn their strengths and weaknesses and you develop a feeling on whether you can trust them…
The risk of learning the wrong lesson#
By the end of the second week I’d grown accustomed to accepting more suggestions, but this reliance was only enabled by strict type-checking, otherwise it would be too easy for bugs to slip in.
One specific example that surprised me was when I got a suggestion that looked right but type-checking told me was wrong. It turned out to be a function where I had changed the interface and Copilot re-suggested the old interface that no longer worked.
This was an unexpected downside, but really just the other side of the coin I mentioned earlier, where LLMs tend toward consistency. Since LLMs suggest patterns they’ve seen before, if there’s an incorrect usage in one place, using something like Copilot can spread it like an infection across your codebase.
I got lucky because in my dev setup the IDE warned me almost as soon as the mistake happened, but this is definitely something to be wary of.
When Copilot suggests vulnerable code#
In the third week of the experiment, I finally witnessed firsthand Copilot suggesting vulnerable patterns in the form of SQL injections via string interpolation.
Who needs to use the ORM when you can just jam strings into raw queries? The suggestion I received looked very similar to this:
As a security-minded person, I was waiting for this, but I didn’t go out of my way at all to force it. I had other examples of proper database access that weren’t vulnerable, so it surprised me when this popped up.
The good news is that if you’re already working in a codebase that has a lot of safe patterns, it should be less likely that you get vulnerable patterns suggested… but this is still concerning because it’s part of a larger pattern.
If you’re interested, I recommend the following paper that investigated Copilot’s performance across a variety of vulnerability types: Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions.
My guess is that string interpolation and variable substitution will continue to cause problems for LLMs, because this is such a general pattern in programming that requires deeper understanding to avoid pitfalls.
When you think about it, what Copilot and similar tools really enable is the use of snippets and functions without fully understanding what they do… which is the essence of its magical speed, but also sounds like a recipe for bugs.
Is Copilot worth it?#
My final week of using Copilot I was writing less library-wrapping duct tape code and more custom stuff, and I reminded again of the issues with flow interruption.
Overall the suggestions were good, but not great… and that’s just not good enough for me if you’re going to be putting text right in front of my cursor.
In general, it did feel like working with a pair programmer, but one that I didn’t feel like I could fully trust.
But what about speed?
I tried to gauge whether I thought I was more productive using Copilot, because autocomplete can throw down a lot of text, but it also cost me time in assessing the suggestions. Or when I needed to request several suggestions and read through them to see if any worked and which one was the best.
I think this would be an interesting thing to try to measure, but you’d have to be pretty rigorous to control and measure for specific things, like how much time people spend on debugging when things go wrong.
It would also be really easy to arrange such an experiment to make Copilot look really good or look bad, because it definitely is good at some things things (writing repetitive error handling blocks) but bad at others (writing a function that doesn’t have clear precedent in your codebase or public code).
From where I stand, I can’t say that my development felt noticeably more productive with Copilot.
But I can say I holistically felt better developing without it.
GitHub Copilot Pros and Cons#
Overall, Copilot is a powerful tool with benefits and drawbacks that I don’t think should be taken lightly:
Pros:
- Really good at repetitive tasks like autocompleting common exception handling or logging blocks
- Good for encouraging consistency within projects
- Speeds up developing with a new library or language (compared to looking things up on Stack Overflow)… at least until you have to debug something that isn’t working
- Feels like magic when it’s right
Cons:
- Super annoying when autocomplete suggestions are wrong; seeing suggestions that vary from what you’re intending to write can take you out of flow
- May change how you program: from driving the ship to getting pulled along because you stop to wait and see what suggestions come up
- Introduces risk in terms of insecure code, or by not being fully engaged in handling everything that could happen in suggested code, or by spreading bad patterns through a codebase
- GitHub seems to be somewhat cagey about saying what a “prompt” is in plain English (weirdly suspicious if you ask me), but the prompt material they send out to the cloud includes your code, as clearly stated here. While they (currently) claim it’s not stored or used for training at least in some cases, this is certainly storing less compared to housing all your code on GitHub in the first place… but you still might have feelings about this aspect.
Also, LLMs present a new paradigm with potential legal ramifications that have yet to be seen… I avoided talking about this because I’m not a lawyer and this isn’t unique to Copilot, but to me the entry in Copilot’s FAQ is worrying. It kinda sounds like “we are almost positive someone will get sued for using Copilot”.
Despite GitHub’s vested interest in winning any such cases, my personal (non-lawyer) impression from the fine print in the linked doc is they’re not going to help at all if you don’t have the “duplicate detection” feature set to “block”.
Back to human craftsmanship#
After uninstalling Copilot, I initially felt a little slower because I wasn’t autocompleting as much, and it took me a minute to get used to just auto-completing with Intellisense. Despite this (and the reminder that Intellisense isn’t perfect either), I felt myself having better clarity of thought and less “pulled along”.
I prefer to develop this way, but then again I’ve been coding for a long time and am very comfortable doing it. I’m sure more powerful AI/LLM tools will come along, and I’ll do my best to keep an open mind about them.
But for now, I’m happy flying solo in my IDE.
If you’ve tried out these tools and had any similar experiences (or any other weird or disconcerting stuff), I’d be really interested to hear about them! Find me on your favorite social and drop me a line.