IndustryBench exposes safety gaps in LLM industrial procurement—top model scores 2.083/3
A new benchmark for industrial procurement QA reveals that even top models struggle with standards compliance and introduce safety violations when reasoning longer, reshuffling the leaderboard when source-grounded checks are applied.

IndustryBench, a 2,049-item benchmark for industrial procurement question-answering, exposes critical safety gaps in how large language models handle real-world industrial standards. Developed by researchers including Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, and Yuyang Sheng, the benchmark grounds evaluation in Chinese national standards (GB/T) and structured product records, spanning seven capability dimensions, ten industry categories, and three difficulty tiers. The dataset includes aligned English, Russian, and Vietnamese translations. During construction, the team rejected 70.3 percent of LLM-generated candidates at an external-verification stage—a stark reminder that industrial QA remains unreliable even after LLM-only filtering.
The evaluation separates raw correctness from safety violations. A Qwen3-Max judge, validated at κ_w = 0.798 against domain experts, scored correctness on a 0–3 rubric. Across 17 models tested in Chinese and an 8-model intersection spanning four languages, the best system reached only 2.083—leaving substantial headroom. Standards and terminology emerged as the most persistent weakness across all languages. More troubling: extended reasoning lowered safety-adjusted scores for 12 of 13 models by introducing unsupported safety-critical details into longer answers. In industrial procurement, partial correctness masks safety violations that standard benchmarks miss. An answer is useful only if recommended materials match operating conditions, every parameter respects regulated thresholds, and no procedure contradicts safety clauses.
Safety-violation rates reshuffled the leaderboard dramatically. GPT-5.4 climbed from rank 6 to rank 3 after adjustment, while Kimi-k2.5-1T-A32B dropped seven positions. The researchers argue that industrial LLM evaluation requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. The benchmark, released with all prompts, scoring scripts, and dataset documentation, sets a foundation for future work—but the persistence of safety violations under extended reasoning suggests that industrial deployments may need to constrain generation length or mandate source-citation layers before any answer reaches a procurement decision.