5.2 C
Canada
Friday, January 23, 2026
HomeTechnologyAre AI brokers prepared for the office? A brand new benchmark raises...

Are AI brokers prepared for the office? A brand new benchmark raises doubts.


It’s been almost two years since Microsoft CEO Satya Nadella predicted AI would exchange information work — the white-collar jobs held by attorneys, funding bankers, librarians, accountants, IT, and others.

However regardless of the massive progress made by basis fashions, the change in information work has been sluggish to reach. Fashions have mastered in-depth analysis and agentic planning, however for no matter motive, most white-collar work has been comparatively unaffected.

It’s one of many greatest mysteries in AI — and because of new analysis from the training-data large Mercor, we’re lastly getting some solutions.

The brand new analysis appears at how main AI fashions maintain up doing precise white-collar work duties, drawn from consulting, funding banking, and regulation. The result’s a brand new benchmark known as APEX-Brokers — and to this point, each AI lab is getting a failing grade. Confronted with queries from actual professionals, even the most effective fashions struggled to get greater than 1 / 4 of the questions proper. The overwhelming majority of the time, the mannequin got here again with a fallacious reply or no reply in any respect.

In accordance with Mercor CEO Brendan Foody, who labored on the paper, the fashions’ greatest stumbling level was monitoring down data throughout a number of domains — one thing that’s integral to a lot of the information work carried out by people.

“One of many massive adjustments on this benchmark is that we constructed out your entire atmosphere, modeled after how actual skilled providers,” Foody informed TechCrunch. “The best way we do our jobs isn’t with one particular person giving us all of the context in a single place. In actual life, you’re working throughout Slack and Google Drive and all these different instruments.” For a lot of agentic AI fashions, that sort of multi-domain reasoning continues to be hit and miss.

Screenshot

The eventualities have been all drawn from precise professionals on Mercor’s professional market, who each laid out the queries and set the usual for a profitable response. Trying by means of the questions, that are posted publicly on Hugging Face, offers a way of how advanced the duties can get. 

Techcrunch occasion

San Francisco
|
October 13-15, 2026

One query within the “Legislation” part reads: 

Throughout the first 48 minutes of the EU manufacturing outage, Northstar’s engineering crew exported one or two bundled units of EU manufacturing occasion logs containing private information to the U.S. analytics vendor….Beneath Northstar’s personal insurance policies, it might fairly deal with the one or two log exports as in keeping with Article 49?

The proper reply is sure, however getting there requires an in-depth evaluation of the corporate’s personal insurance policies in addition to the related EU privateness legal guidelines.

That may stump even a well-informed human, however the researchers have been attempting to mannequin the work carried out by professionals within the discipline. If an LLM can reliably reply these questions, it may successfully exchange lots of the attorneys working immediately. “I feel that is in all probability a very powerful subject within the financial system,” Foody informed TechCrunch. “The benchmark could be very reflective of the actual work that these folks do.”

OpenAI additionally tried to measure skilled expertise with its GDPval benchmark — however the APEX-Brokers take a look at differs in essential methods. The place GDPval assessments common information throughout a variety of professions, the APEX-Brokers benchmark measures the system’s potential to carry out sustained duties in a slim set of high-value professions. The result’s tougher for fashions, but additionally extra intently tied as to whether these jobs could be automated.

Whereas not one of the fashions proved able to take over as funding bankers, some have been clearly nearer to the mark. Gemini 3 Flash carried out the most effective of the group with 24% one-shot accuracy, adopted intently by GPT-5.2 with 23%. Beneath that, Opus 4.5, Gemini 3 Professional and GPT-5 all scored roughly 18%.

Whereas the preliminary outcomes fall brief, the AI discipline has a historical past of blowing by means of difficult benchmarks. Now that the APEX-Brokers take a look at is public, it’s an open problem for AI labs that imagine they will do higher — one thing Foody absolutely expects within the months to come back. 

“It’s bettering actually shortly,” he informed TechCrunch. “Proper now it’s truthful to say it’s like an intern that will get it proper 1 / 4 of the time, however final yr it was the intern that will get it proper 5 or 10% of the time. That sort of enchancment yr after yr can have an effect so shortly.”

]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments