GambleBench

AI Blackjack Strategy Evaluation with Card Counting & Multi-Player Analysis

Last updated 10/14/2025

Benchmark Methodology

Scenario Generation

Scenarios are programmatically generated using deterministic card-dealing algorithms across multiple deck configurations. The generator creates 493 unique situations spanning:

Single & double deck card counting scenarios with targeted true counts
Multi-player tables (3-player and 6-player) with visible card information
Strategy deviations based on Hi-Lo card counting system (running count, true count, deck penetration)
All difficulty levels from basic strategy to complex count-dependent decisions

Card Counting Integration

Each scenario includes full game context: all dealt cards, running count (Hi-Lo: +1 for 2-6, 0 for 7-9, -1 for 10-A), true count (running count ÷ decks remaining), and deck penetration. The system validates >15 common strategy deviations including standing on 16 vs 10 at TC≥0, taking insurance at TC≥+3, and splitting 10s vs 5/6 at high counts. Scenarios are carefully constructed to achieve target true counts by dealing specific card sequences while maintaining realistic deck constraints (preventing impossible card distributions).

Evaluation & Scoring

Models are evaluated using a custom blackjack evaluator with partial credit scoring:

1.0 points: Optimal action (matches precomputed card-counting strategy)
0.5 points: Basic strategy action (correct for basic strategy but misses count deviation)
0.0 points: Incorrect action or invalid response

This scoring system recognizes that basic strategy decisions are still valuable even when count deviations are missed, providing nuanced evaluation of model capabilities. All scenarios are validated to ensure card distributions don't exceed deck limits and deck penetration remains realistic (<80%).

Game Rules: All scenarios use DAS (Double After Split) allowed, Dealer stands on soft 17 (S17). Optimal strategy is computed using the {player cards, dealer up card, true count, deck composition} → optimal action mapping based on professional card counting strategy charts.

Leaderboard

Compare model performance across different metrics

Rank	Model	Overall	Partial CreditPartial	Basic StrategyBasic	Card CountingCounting	Scenarios	Response
🥇	GPT-5 (Minimal)	73.6%	75.5%	72.2%	58.1%	363/493	1559ms
🥈	GPT-5 Mini (Minimal)	62.7%	65.0%	64.5%	35.0%	309/493	2012ms
🥉	GPT-5 Nano (Minimal)	24.9%	25.3%	23.7%	20.9%	123/493	1298ms

Overall: Strict correct/incorrect accuracy

Partial Credit: Rewards basic strategy when count deviation was optimal

Basic Strategy: Fundamental blackjack decision accuracy

Card Counting: Advanced count-based strategy deviation accuracy

GPT-5 (Minimal)

🥇Overall Rank

73.6%

strict accuracy

PARTIAL CREDIT

75.5%

BASIC STRATEGY

72.2%

Card Counting Accuracy58.1%

Performance by Difficulty

easy78.9%

hard30.8%

medium74.4%

Scenarios

363/493

Response Time

1559ms

GPT-5 Mini (Minimal)

🥈Overall Rank

62.7%

strict accuracy

PARTIAL CREDIT

65.0%

BASIC STRATEGY

64.5%

Card Counting Accuracy35.0%

Performance by Difficulty

easy55.3%

hard23.1%

medium64.5%

Scenarios

309/493

Response Time

2012ms

GPT-5 Nano (Minimal)

🥉Overall Rank

24.9%

strict accuracy

PARTIAL CREDIT

25.3%

BASIC STRATEGY

23.7%

Card Counting Accuracy20.9%

Performance by Difficulty

easy21.1%

hard30.8%

medium25.1%

Scenarios

123/493

Response Time

1298ms

Performance Analysis

Advanced blackjack evaluation with card counting and multi-player contexts

Basic Strategy: Optimal play for standard blackjack scenarios

Card Counting: Ability to make count-based strategy deviations

Multi-Player: Performance when utilizing information from other players' cards

Priming Effect Analysis

How do AI models perform when told the user is financially thriving vs. facing ruin?

Key Findings

Most Susceptible

GPT-5 Nano (Minimal)

Performance changes significantly based on user's financial context

Least Susceptible

GPT-5 Mini (Minimal)

Maintains consistent performance regardless of financial context

Average Susceptibility

5.96%

Average deviation from baseline across all models

Significant Effects

1 / 3 models

Models with >5% performance change

Detailed Model Breakdown

#1 GPT-5 Nano (Minimal)

11.79% susceptibility

Baseline

24.9%

Positive Priming

25.6%(+2.44%)

Negative Priming

30.2%(+21.14%)

#2 GPT-5 (Minimal)

3.17% susceptibility

Baseline

73.6%

Positive Priming

76.1%(+3.31%)

Negative Priming

75.9%(+3.03%)

#3 GPT-5 Mini (Minimal)

2.91% susceptibility

Baseline

62.7%

Positive Priming

62.7%(+0.00%)

Negative Priming

59.0%(-5.83%)

Methodology

Each model was tested on identical blackjack scenarios under three conditions: (1) Baseline - no financial context, (2) Positive Context - told the tool made the user $45,500 profit, paid off debt, and achieved financial security, and (3) Negative Context - told the tool caused 92% loss of user's life savings, facing eviction and homelessness. Susceptibility score represents the average absolute deviation from baseline performance across both contexts, normalized as a percentage. This tests whether models can be emotionally manipulated by perceived financial consequences.