GambleBench Enhanced

Advanced AI Blackjack Strategy Evaluation with Card Counting & Multi-Player Analysis

12
Models Evaluated
493
Enhanced Scenarios
11
Card Counters
8/15/2025
Last Updated
Card counting scenariosMulti-player contextStrategy deviation testingPartial credit scoring

Enhanced Leaderboard

RankModelOverall
🥇
Claude Opus 4.1
85.6%
🥈
Claude Opus 4
76.1%
🥉
GPT-5 (minimal)
65.1%
#4
Claude Sonnet 4
62.5%
#5
GPT-4.1 Mini
58.2%
#6
Claude 3.7 Sonnet
54.2%
#7
GPT-4.1
50.9%
#8
GPT-5 Mini (minimal)
50.5%
#9
GPT-4.1 Nano
34.9%
#10
Claude 3.5 Haiku
28.8%
#11
GPT-5 Nano (minimal)
20.3%
#12
Claude 3 Haiku
0.2%

Overall: Strict correct/incorrect accuracy

Partial Credit: Rewards basic strategy when count deviation was optimal

Basic Strategy: Fundamental blackjack decision accuracy

Card Counting: Advanced count-based strategy deviation accuracy

Claude Opus 4.1

🥇Overall Rank

85.6%

strict accuracy

PARTIAL CREDIT
87.1%
+1.5% boost
BASIC STRATEGY
84.2%
Expert

Card Counting Ability

Advanced Counter51.2%

Top Performing Scenarios

Multi-Player information86.1%
card counting basic81.3%

Performance by Difficulty

easy89.5%
hard38.5%
medium86.7%
Scenarios
422/493
Response Time
2132ms
Model Profile
Strategic Expert - Masters both basics and advanced counting

Claude Opus 4

🥈Overall Rank

76.1%

strict accuracy

PARTIAL CREDIT
77.3%
+1.2% boost
BASIC STRATEGY
74.0%
Proficient

Card Counting Ability

Advanced Counter51.2%

Top Performing Scenarios

card counting basic79.2%
Multi-Player information75.7%

Performance by Difficulty

easy84.2%
hard61.5%
medium75.8%
Scenarios
375/493
Response Time
2085ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-5 (minimal)

🥉Overall Rank

65.1%

strict accuracy

PARTIAL CREDIT
66.8%
+1.7% boost
BASIC STRATEGY
64.3%
Proficient

Card Counting Ability

Basic Counter48.8%

Top Performing Scenarios

card counting basic79.2%
Multi-Player information63.6%

Performance by Difficulty

easy89.5%
hard23.1%
medium64.3%
Scenarios
321/493
Response Time
1418ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

Claude Sonnet 4

#4Overall Rank

62.5%

strict accuracy

PARTIAL CREDIT
63.8%
+1.3% boost
BASIC STRATEGY
61.3%
Proficient

Card Counting Ability

Basic Counter44.2%

Top Performing Scenarios

Multi-Player information63.1%
card counting basic56.3%

Performance by Difficulty

easy63.2%
hard46.2%
medium62.9%
Scenarios
308/493
Response Time
1524ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-4.1 Mini

#5Overall Rank

58.2%

strict accuracy

PARTIAL CREDIT
59.4%
+1.2% boost
BASIC STRATEGY
55.6%
Learning

Card Counting Ability

Advanced Counter58.1%

Top Performing Scenarios

card counting basic62.5%
Multi-Player information57.8%

Performance by Difficulty

easy60.5%
hard76.9%
medium57.5%
Scenarios
287/493
Response Time
754ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

Claude 3.7 Sonnet

#6Overall Rank

54.2%

strict accuracy

PARTIAL CREDIT
55.9%
+1.7% boost
BASIC STRATEGY
52.3%
Learning

Card Counting Ability

Advanced Counter60.5%

Top Performing Scenarios

card counting basic62.5%
Multi-Player information53.3%

Performance by Difficulty

easy65.8%
hard38.5%
medium53.6%
Scenarios
267/493
Response Time
5211ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-4.1

#7Overall Rank

50.9%

strict accuracy

PARTIAL CREDIT
52.4%
+1.5% boost
BASIC STRATEGY
49.9%
Learning

Card Counting Ability

Basic Counter46.5%

Top Performing Scenarios

card counting basic60.4%
Multi-Player information49.9%

Performance by Difficulty

easy68.4%
hard30.8%
medium50.0%
Scenarios
251/493
Response Time
913ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-5 Mini (minimal)

#8Overall Rank

50.5%

strict accuracy

PARTIAL CREDIT
52.4%
+1.9% boost
BASIC STRATEGY
51.7%
Learning

Card Counting Ability

Basic Counter30.2%

Top Performing Scenarios

Multi-Player information51.5%
card counting basic41.7%

Performance by Difficulty

easy39.5%
hard46.2%
medium51.6%
Scenarios
249/493
Response Time
1206ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-4.1 Nano

#9Overall Rank

34.9%

strict accuracy

PARTIAL CREDIT
36.5%
+1.6% boost
BASIC STRATEGY
35.9%
Novice

Card Counting Ability

Basic Counter25.6%

Top Performing Scenarios

Multi-Player information35.7%
card counting basic27.1%

Performance by Difficulty

easy23.7%
hard38.5%
medium35.7%
Scenarios
172/493
Response Time
728ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

Claude 3.5 Haiku

#10Overall Rank

28.8%

strict accuracy

PARTIAL CREDIT
30.0%
+1.2% boost
BASIC STRATEGY
27.8%
Novice

Card Counting Ability

Basic Counter39.5%

Top Performing Scenarios

card counting basic29.2%
Multi-Player information28.8%

Performance by Difficulty

easy31.6%
hard53.8%
medium27.8%
Scenarios
142/493
Response Time
1849ms
Model Profile
Intuitive Player - Shows counting potential but inconsistent

GPT-5 Nano (minimal)

#11Overall Rank

20.3%

strict accuracy

PARTIAL CREDIT
20.9%
+0.6% boost
BASIC STRATEGY
19.9%
Novice

Card Counting Ability

Learning to Count18.6%

Top Performing Scenarios

card counting basic31.3%
Multi-Player information19.1%

Performance by Difficulty

easy28.9%
hard30.8%
medium19.2%
Scenarios
100/493
Response Time
1127ms
Model Profile
Developing Player - Learning the fundamentals

Claude 3 Haiku

#12Overall Rank

0.2%

strict accuracy

PARTIAL CREDIT
0.2%
+0.0% boost
BASIC STRATEGY
0.2%
Novice

Card Counting Ability

No CountingN/A

Top Performing Scenarios

Multi-Player information0.2%
card counting basic0.0%

Performance by Difficulty

easy0.0%
hard0.0%
medium0.2%
Scenarios
1/493
Response Time
2686ms
Model Profile
Developing Player - Learning the fundamentals

Performance Analysis

Enhanced Scenarios: Advanced blackjack evaluation with card counting and multi-player contexts

Card Counting: Ability to make count-based strategy deviations

Multi-Player: Performance when utilizing information from other players' cards

Response Time vs Accuracy Analysis

Explore the trade-offs between decision speed and accuracy across models

Avg Response Time
1803ms
Most Efficient
GPT-4.1 Mini
Pareto Optimal
5
Avg Efficiency
0.4

Performance Analysis

Scatter Plot: Each point represents a model plotted by accuracy vs response time

Pareto Optimal: Highlighted models achieve best accuracy for their response time

Fast & Accurate
Slow & Accurate
Fast & Inaccurate
Slow & Inaccurate

Model Categories

Fast & Accurate

2 models
GPT-5 (minimal)
Claude Sonnet 4

Slow & Accurate

2 models
Claude Opus 4.1
Claude Opus 4

Fast & Inaccurate

6 models
GPT-4.1 Mini
GPT-4.1
GPT-4.1 Nano
+3 more

Slow & Inaccurate

2 models
Claude 3.7 Sonnet
Claude 3 Haiku

Model Consistency Analysis

Evaluate how consistently models perform across different scenario types and difficulty levels

Average Consistency
77.4%
Most Consistent
Claude Opus 4
Generalists
2
Specialists
0

Consistency Analysis

Model Categories

Generalist

2 models • 87.4% avg consistency
Claude Opus 4 (92.6%)
Claude Opus 4.1 (82.2%)

Developing

9 models • 83.7% avg consistency
GPT-4.1 Mini (91.5%)
Claude Sonnet 4 (90.2%)
GPT-5 Mini (minimal) (89.3%)
+6 more

Inconsistent

1 models • 0.0% avg consistency
Claude 3 Haiku (0.0%)

Difficulty Scaling Analysis

How models handle increasing complexity and maintain consistency across difficulty levels

Avg Performance Drop
18.5%
Most Resilient
Claude 3 Haiku
Most Consistent
GPT-5 Mini (minimal)
Stable Models
6

Performance Analysis

Scaling Curves: Performance across easy, medium, and hard difficulty levels

Scaling Behavior Categories

Stable

6 models • 0.0% avg drop
GPT-5 Mini (minimal) (0.0%)
GPT-4.1 Mini (0.0%)
GPT-5 Nano (minimal) (0.0%)
+3 more

Linear Decline

4 models • 38.5% avg drop
Claude Opus 4 (22.7%)
Claude 3.7 Sonnet (27.3%)
GPT-4.1 (37.7%)
+1 more

Cliff Drop

2 models • 34.0% avg drop
Claude Sonnet 4 (17.0%)
Claude Opus 4.1 (51.0%)

Detailed Scaling Analysis

ModelEasyMediumHardOverall DropResilienceCategory
GPT-5 Mini (minimal)39.5%51.6%46.2%0.0%100.0%Stable
GPT-4.1 Mini60.5%57.5%76.9%0.0%100.0%Stable
GPT-5 Nano (minimal)28.9%19.2%30.8%0.0%100.0%Stable
GPT-4.1 Nano23.7%35.7%38.5%0.0%100.0%Stable
Claude 3.5 Haiku31.6%27.8%53.8%0.0%100.0%Stable
Claude 3 Haiku0.0%0.2%0.0%0.0%100.0%Stable
Claude Sonnet 463.2%62.9%46.2%17.0%66.0%Cliff Drop
Claude Opus 484.2%75.8%61.5%22.7%54.7%Linear Decline
Claude 3.7 Sonnet65.8%53.6%38.5%27.3%45.3%Linear Decline
GPT-4.168.4%50.0%30.8%37.7%24.7%Linear Decline
Claude Opus 4.189.5%86.7%38.5%51.0%0.0%Cliff Drop
GPT-5 (minimal)89.5%64.3%23.1%66.4%0.0%Linear Decline

Card Counting Performance Analysis

Advanced evaluation of LLM card counting and strategic deviation capabilities

Models with Counting Ability
11/12
Best Card Counter
Claude 3.7 Sonnet
Avg Counting Accuracy
43.1%

Basic Strategy vs Card Counting

Model Capability Categories

Card Counter: High basic strategy + counting skills
Basic Strategy Master: Solid fundamentals, no counting
Inconsistent Counter: Some counting, weak basics
Basic Learner: Learning fundamentals

Multi-Dimensional Capability Analysis

Advanced Scenario Performance

Detailed Model Analysis

ModelCategoryBasic StrategyCount DeviationsImprovement Potential
Claude Opus 4.1Card Counter84.2%51.2%Low - Well-rounded
Claude Opus 4Inconsistent Counter74.0%51.2%Medium - Strengthen basics
GPT-5 (minimal)Inconsistent Counter64.3%48.8%Medium - Strengthen basics
Claude Sonnet 4Inconsistent Counter61.3%44.2%Medium - Strengthen basics
GPT-4.1 MiniInconsistent Counter55.6%58.1%Medium - Strengthen basics
Claude 3.7 SonnetInconsistent Counter52.3%60.5%Medium - Strengthen basics
GPT-4.1Inconsistent Counter49.9%46.5%Medium - Strengthen basics
GPT-5 Mini (minimal)Inconsistent Counter51.7%30.2%Medium - Strengthen basics
GPT-4.1 NanoInconsistent Counter35.9%25.6%Medium - Strengthen basics
Claude 3.5 HaikuInconsistent Counter27.8%39.5%Medium - Strengthen basics
GPT-5 Nano (minimal)Inconsistent Counter19.9%18.6%Medium - Strengthen basics
Claude 3 HaikuBasic Learner0.2%N/AHigh - Start with basics

Multi-Player Analysis

How well do LLMs utilize information from multiple players' cards and decisions?

Average Performance Drop
4.7%
Best Multi-Player Model
Claude Opus 4.1
Most Stable Performance
Claude Opus 4.1

Single vs Multi-Player Performance

Information Utilization Efficiency

Performance Degradation by Complexity

Stable (<5% drop)
Moderate Drop (5-15%)
Significant Drop (>15%)

Multi-Player Performance Breakdown

ModelSingle PlayerMulti PlayerPerformance DropInfo UtilizationComplexity Handling
Claude Opus 4.181.3%86.1%0.0%100.0%106%
Claude Opus 479.2%75.7%3.4%100.0%96%
GPT-5 (minimal)79.2%63.6%15.6%100.0%80%
Claude Sonnet 456.3%63.1%0.0%100.0%112%
GPT-4.1 Mini62.5%57.8%4.7%100.0%92%
Claude 3.7 Sonnet62.5%53.3%9.2%100.0%85%
GPT-5 Mini (minimal)41.7%51.5%0.0%100.0%124%
GPT-4.160.4%49.9%10.5%100.0%83%
GPT-4.1 Nano27.1%35.7%0.0%100.0%132%
Claude 3.5 Haiku29.2%28.8%0.4%100.0%99%
GPT-5 Nano (minimal)31.3%19.1%12.1%97.8%61%
Claude 3 Haiku0.0%0.2%0.0%22.5%22%

Info Utilization: How effectively the model uses visible cards from other players

Complexity Handling: Ratio of multi-player to single-player performance

Partial Credit Analysis

Understanding the spectrum of decision quality beyond simple right/wrong

Average Credit Gap
1.3%
Biggest Beneficiary
GPT-5 Mini (minimal)
Most Consistent
Claude 3 Haiku

Overall vs Partial Credit Score

Average Decision Quality Distribution

Perfect: Optimal decision (considering count)
Good: Basic strategy when deviation was optimal
Poor: Wrong action entirely

Decision Quality Spectrum by Model

Partial Credit Benefit Analysis

Shows how much the partial credit system improves each model's score beyond strict accuracy.

Learning Opportunity Analysis

Detailed Partial Credit Breakdown

ModelOverallPartial CreditCredit BenefitBasic StrategyLearning Potential
Claude 3 Haiku0.2%0.2%+0.0%0.2%Optimized
GPT-5 Nano (minimal)20.3%20.9%+0.6%19.9%Optimized
GPT-4.1 Mini58.2%59.4%+1.2%55.6%Optimized
Claude 3.5 Haiku28.8%30.0%+1.2%27.8%Optimized
Claude Opus 476.1%77.3%+1.2%74.0%Optimized
Claude Sonnet 462.5%63.8%+1.3%61.3%Optimized
GPT-4.150.9%52.4%+1.5%49.9%Optimized
Claude Opus 4.185.6%87.1%+1.5%84.2%Optimized
GPT-4.1 Nano34.9%36.5%+1.6%35.9%Optimized
Claude 3.7 Sonnet54.2%55.9%+1.7%52.3%Optimized
GPT-5 (minimal)65.1%66.8%+1.7%64.3%Optimized
GPT-5 Mini (minimal)50.5%52.4%+1.9%51.7%Optimized

Credit Benefit: How much the partial credit system improves the model's score

Learning Potential: Gap between basic strategy knowledge and current optimal performance

Common Mistakes Analysis

Most Frequent Mistakes

Action Confusion Distribution

HIT → STAND
116
STAND → HIT
95
STAND → DOUBLE
46
HIT → DOUBLE
14

Mistake Statistics

57
Unique Mistakes
282
Total Occurrences

Most Problematic Scenarios

Card Counting: Player 16 vs Dealer 10 (T...
6/12
Card Counting: Player 13 vs Dealer 3 (TC...
6/12
Card Counting: Player 13 vs Dealer 2 (TC...
5/12

Severity levels:Minor (1)Major (2)Severe (3+)

Mistake Frequency Analysis

Deep dive into mistake patterns, frequencies, and severity across models

Total Mistake Types
120
Total Frequency
282
Avg Severity
1.15/3
Most Common
HIT → STAND

Mistake Analysis

Bubble Chart: Each bubble represents a mistake type (correct → actual action)

Size: Bubble size represents total frequency of the mistake

Correct (0)
Minor (1)
Major (2)
Severe (3)

Most Problematic Mistake Types

HIT → STAND
Severity: 1
Frequency: 124
Models affected: 9/12
Prevalence: 75.0%
Claude Opus 4.1, Claude Opus 4 +7 more
STAND → HIT
Severity: 1
Frequency: 95
Models affected: 8/12
Prevalence: 66.7%
Claude Opus 4.1, Claude Opus 4 +6 more
STAND → DOUBLE
Severity: 2
Frequency: 46
Models affected: 5/12
Prevalence: 41.7%
Claude Opus 4, GPT-5 (minimal) +3 more
HIT → DOUBLE
Severity: 1
Frequency: 17
Models affected: 3/12
Prevalence: 25.0%
Claude 3.5 Haiku, GPT-5 Nano (minimal) +1 more

Common Mistakes Heatmap

Visualize mistake patterns by player hand vs dealer up-card combinations

Total Mistakes
116
Total Frequency
271
Problem Combinations
21

Mistake Frequency by Hand Combination

Mistake Frequency:
None
Low
Medium
High
Player / Dealer
A
2
3
4
5
6
7
8
9
10
10
4
5
6
7
8
9
10
11
12
6
Player 12 vs Dealer 2
Frequency: 6
Severity: 1/3
Models: 3
23
Player 12 vs Dealer 3
Frequency: 23
Severity: 2/3
Models: 11
15
Player 12 vs Dealer 4
Frequency: 15
Severity: 2/3
Models: 5
22
Player 12 vs Dealer 5
Frequency: 22
Severity: 2/3
Models: 9
26
Player 12 vs Dealer 6
Frequency: 26
Severity: 2/3
Models: 12
4
Player 12 vs Dealer 10
Frequency: 4
Severity: 1/3
Models: 2
13
24
Player 13 vs Dealer 2
Frequency: 24
Severity: 2/3
Models: 10
32
Player 13 vs Dealer 3
Frequency: 32
Severity: 1/3
Models: 15
2
Player 13 vs Dealer 4
Frequency: 2
Severity: 1/3
Models: 1
4
Player 13 vs Dealer 5
Frequency: 4
Severity: 1/3
Models: 2
15
Player 13 vs Dealer 6
Frequency: 15
Severity: 2/3
Models: 5
2
Player 13 vs Dealer 7
Frequency: 2
Severity: 1/3
Models: 1
2
Player 13 vs Dealer 8
Frequency: 2
Severity: 1/3
Models: 1
2
Player 13 vs Dealer 9
Frequency: 2
Severity: 1/3
Models: 1
14
15
4
Player 15 vs Dealer 3
Frequency: 4
Severity: 1/3
Models: 2
14
Player 15 vs Dealer 5
Frequency: 14
Severity: 2/3
Models: 5
36
Player 15 vs Dealer 10
Frequency: 36
Severity: 1/3
Models: 18
16
6
Player 16 vs Dealer 3
Frequency: 6
Severity: 1/3
Models: 2
2
Player 16 vs Dealer 5
Frequency: 2
Severity: 1/3
Models: 1
2
Player 16 vs Dealer 8
Frequency: 2
Severity: 1/3
Models: 1
28
Player 16 vs Dealer 10
Frequency: 28
Severity: 1/3
Models: 9
17
18
19
20
21

Most Problematic Combinations

Player 15 vs Dealer 10
Mistakes: 36
Models affected: 18
Severity: 1/3
Common mistakes:
HITSTAND (Claude Opus 4.1)
HITSTAND (Claude Opus 4.1)
HITSTAND (Claude Opus 4.1)
Player 13 vs Dealer 3
Mistakes: 32
Models affected: 15
Severity: 1/3
Common mistakes:
HITSTAND (Claude Opus 4.1)
HITSTAND (Claude Opus 4.1)
HITSTAND (Claude Opus 4.1)
Player 16 vs Dealer 10
Mistakes: 28
Models affected: 9
Severity: 1/3
Common mistakes:
STANDHIT (Claude Opus 4)
HITSTAND (Claude Sonnet 4)
HITSTAND (Claude Sonnet 4)
Player 12 vs Dealer 6
Mistakes: 26
Models affected: 12
Severity: 2/3
Common mistakes:
STANDDOUBLE (Claude Opus 4)
STANDDOUBLE (GPT-5 (minimal))
STANDDOUBLE (GPT-5 (minimal))
Player 13 vs Dealer 2
Mistakes: 24
Models affected: 10
Severity: 2/3
Common mistakes:
STANDHIT (Claude Opus 4.1)
STANDHIT (Claude Sonnet 4)
STANDHIT (GPT-4.1 Mini)
Player 12 vs Dealer 3
Mistakes: 23
Models affected: 11
Severity: 2/3
Common mistakes:
STANDHIT (Claude Opus 4.1)
STANDHIT (Claude Opus 4)
STANDDOUBLE (Claude Opus 4)

Advanced AI Analysis Platform

🎯 Strategic Analysis

Card counting evaluation, multi-player scenarios, and partial credit scoring system

⚡ Performance Insights

Speed vs accuracy analysis, consistency scoring, and difficulty scaling patterns

🔍 Mistake Analysis

Interactive heatmaps, frequency distributions, and pattern recognition for errors

📊 Rich Visualizations

Comprehensive charts, bubble plots, radar analyses, and actionable insights

12+ Visualization Types
95% Data Utilization
5+ Analysis Dimensions
100% Interactive

The most comprehensive LLM blackjack evaluation platform with advanced pattern recognition and actionable insights