Four open-weight models play One Night Werewolf; Gemma 4 31B wins as best liar
A LocalLLaMA user ran Gemma 4 and Qwen 3.6 models through a custom One Night Werewolf game to test deception and reasoning, finding Gemma 4 31B the strongest deceiver.
A practitioner built a custom llama.cpp UI that switches between models mid-conversation and used it to pit four open-weight models against each other in One Night Werewolf, the social deduction game where players accuse each other of being werewolves based on limited information. The setup assigned each model a role card—werewolf, seer, villager, or troublemaker—and ran them through a night phase where they recorded private observations in separate markdown files, then a day phase where they defended themselves and voted in a shared chat over 8–10 turns. The Qwen models had their thinking mode disabled to prevent them from reasoning aloud into the public channel.
Gemma 4 31B Q4 emerged as the best liar with the clearest private notes, while Gemma 4 26B Q5 struggled with tool use despite fast inference. Qwen 3.6 35B Q4 excelled at tool calls but misread its role as villager and played too aggressively, and Qwen 3.6 27B Q5 was slow and weak at reasoning without thinking mode enabled. The test compared four quantized checkpoints: Gemma 4 at 31B and 26B, Qwen 3.6 at 35B and 27B.
