Who wants to make judging better?

There's an often overlooked role in the smart contract security industry. It's shapeshifted throughout the years, but it's always been around.

Nowadays, with the rise of crowdsourced security, no doubt it's taken the spotlight.

I'm talking about bug triagers. Also known as "judges" in contest platforms, they're the ones in charge of deciding the severity of reported bugs.

After some first-hand experiences and asking around, I can't stop thinking their work could be 10x better. Let's think together how.

In the traditional form of security assessments ("audits"), the reported severity is not crucial. The security reviewers triage the bug themselves, by evaluating things like impact and likelihood of exploit. This classification (low, medium, high, etc.) is just a way to suggest developers an order of priorities to fix issues. In turn, developers do their own triaging. They come up with their own order of priorities – and likely challenge some of the issues raised in the security report. If they turn a critical into a low, or whatever, they may argue a bit with the reviewers. But ultimately, the cost of the engagement is not tied to the number nor severity of issues reported. So after some compromises on their egos, security reviewers and developers will agree on severities, and move on with their lives.

Bug bounty and contest platforms are a different game for triaging. Because the severity that judges determine is fundamental to decide the actual payout for a bug disclosure. No surprise that judges have been (put) in the public spotlight more than once when they've failed.

Judges face many challenges today. They must stand the mentally draining effort of going through hundreds and hundreds of poorly written, unclear, confusing, deceiving, spammy issues, only to be able to find and validate the ones that are worth it manually, or with custom hacky tooling. And when they do, somehow not be influenced by the reputation of reporters, nor their deceiving tricks to exaggerate severities. And when they do, make sure to classify all issues well without any standard methodology. And when they do, validate similar issues in past times where judged similarly. All of this sometimes working as individual contractors, without anybody standing for their work. Sometimes only receiving breadcrumbs of the award pots.

No surprise we see this kind of comments from judges:

Judging the most insane contest at @code4rena ever.

- Small pot ($36k)
- Small codebase (550 sloc)
- ~200 competitors
- ~1500 submissions

I need a hug
— alcueca (@alcueca) February 5, 2024

Hug?
Just finishing something very similar (1501) at codehawks. I need CPR
— Hrishikesh (@hrishi_bhat) February 5, 2024

Being a judge on @sherlockdefi is very rewarding, but ruthless

- 1 judge vs many watsons and hundreds of submissions
- Correct escalations punish judge monetarily vs incorrect escalations do not punish watsons monetarily
- LSW often times has 10x pay of lead judge
— 0xnevi (@0xnevi) January 17, 2024

Also judging contests don't seem very useful. I bet nobody benefits from crowdsourced classifications. Better to spend that money actually funding people that will review findings with more care and avoid escalations.
— Antonio Viggiano (@agfviggiano) July 4, 2023

And it's not only the individuals who get criticized. The reputation of bounty and contest platforms also takes a toll when judging goes wrong.

Wouldn't they be glad to outsource the process to anyone who can do it consistently well? They'd delegate most of the responsibility, focus instead on their core businesses, and hopefully alleviate this bottleneck in their pipelines. I mean, nobody wants to wait +4 weeks (with luck) to be paid in a contest.

Today, the whole process of triaging seems slow, inefficient, not transparent, not objective, can be gamed, not well compensated, not reproducible, not well specified, nor doesn't use common evaluation criteria nor methodologies across the ecosystem.

Isn't there room to make this much much better? I'm of the idea that individual efforts won't solve judging. So what do I envision?

The professionalization of judges. They must unite and become a real service provider in the smart contract security industry.

Could a security team specialize in judging / triaging, and sell it as a B2B service to bug bounty and contest platforms, as well as to projects with self-managed bounty programs?

I wonder if it'd be profitable enough for a standalone business, or if it'd be best built as a new capability in existing security firms.

Whoever provides this service must provide reassurance and objectiveness. It should be able to build (or already have) a reputation of indisputable independence. That'd give confidence to all parties involved that if they're involved in judging, all sides will be treated fairly. Being a dedicated team, it should be harder to be influenced by individuals. They could also negotiate more favorable terms with the security platforms (fixed fees, % of awards, fees based on number of issues and escalations, etc).

As they professionalize and scale, triaging teams could create reliable open-source infrastructure and tooling for faster and more accurate classification, spam filtering, detection of duplicates, comparisons with similar issues in other public reports. It should be possible to build open-source, auditable, automated triaging pipelines to improve efficiency and effectiveness, reducing errors and therefore number of escalations. Also, a team structure would allow them for specializations (think sub-teams of bridge specialists, L2 specialists, DeFi specialists, etc), division of tasks and escalation tiers in their manual work.

I'd like to see dedicated professional teams willing to excel at judging. When these teams grow influential enough, and work across multiple platforms in the space, they could position themselves to standardize judging criteria in this industry, and produce evaluation methodologies and scoring systems that become a standard in all platforms. They could even become the de-facto agnostic court for any dispute.

The standardization of judging criteria seems difficult. Particularly if we have different groups judging that may not share enough information nor experiences nor personal criteria. Feels early to ask for STANDARDS. But I'm of the idea that common criteria may emerge and become de-facto guidelines once some badass professional teams that are full-time dedicated to judging become widely recognized and accepted.

In any case, now that I'm at it, let me take the whole idea a step further.

Because I also want to ~~replace~~ complement money-based leaderboards that say little about the quality, context and importance of issues reported.

Wouldn't a recognized triaging team be able to rate hunters and competitors, and build a transparent and credible reputation system across the ecosystem?

I mean, if they find ways to standardize scoring and criteria of reported issues, then they could produce metrics to rate whoever is reporting them. Thinking of a platform-agnostic scoring for the reputation of security researchers (similar to ELO-like scoring used in online games, chess, etc). Too far-fetched?

Closing the feedback loop, this scoring for hunters and competitors could be used by the same crowdsourced security platforms! They could host contests tailored for specific people based on their score. That would also allow projects to better allocate money, based on the rank of researchers they want to attract.

What do you think?