Breaking Down The Latest AI Developer Benchmark From CodeSignal

CodeSignal, which makes skills assessment and AI-powered learning tools, recently released an interesting new benchmark study on the performance of AI code assistance against human developers. The big headline is that many models are outperforming the average developer and are starting to catch up to the top developers. However, this irresistible clickbait was backed up by much more than a headline, and there are some very tangible takeaways.

Before we get into the results, let's talk about the methodology. CodeSignal makes skills testing and evaluation tools for developers. So if you were asked to take a skills assessment during a hiring process, it may have come from CodeSignal. Often these tests require you to actually write code (say, 40 to 60 lines), and to answer important questions about the development process. CodeSignal now has a dataset of 500,000 developers who have taken the test. This rich set of data allows the company to have a good feel for developer skills, areas of competency and the like.

In this case, the assessment gave a bunch of LLMs the same test to see how they compared to humans. But what was so interesting was that CodeSignal also measured LLM efficacy using different numbers of examples. In the LLM prompt engineering world, this is called "few-shot" engineering (as opposed to "many-shot"), and it's a valuable way for models to deliver more precise results. To some extent, trialing a coding task based on just a few examples also mimics how developers learn, because when they get stuck they will seek examples from peers or Google.

This is a particularly intriguing test given that it is (a) not vendor-sponsored, and (b) built on a huge set of control data cultivated over years. These are the results:

The results suggest that three shots (examples) yielded the most optimal results for the LLMs. It was not known how many "shots" humans took, since they got as many shots as they wanted and their tests were constrained by an overall time limit. Here is the three-shot LLM data compared to human benchmarks:

Going through the results -- and drawing from my own long experience leading software development efforts -- has led me to a few conclusions.

When I saw the headline for the CodeSignal article, I just knew I had to read it, but I was pleasantly surprised to see the methodology and results. The protocol made sense to me and was able to paint a picture of how far we have come, while also suggesting how far we can still go as we learn to harness AI. It is also a reminder that while the big stewards of AI have a distinct capital advantage, the only way AI will become real in both the consumer and the enterprise space is through efforts -- like those from CodeSignal -- that establish best practices and pragmatic approaches.

Breaking Down The Latest AI Developer Benchmark From CodeSignal

POPULAR CATEGORY

corporate

tech

entertainment

research

misc

wellness

athletics