Advanced AI Blackjack Strategy Evaluation with Card Counting & Multi-Player Analysis
Rank | Model | Overall |
---|---|---|
🥇 | Claude Opus 4.1 | 85.6% |
🥈 | Claude Opus 4 | 76.1% |
🥉 | GPT-5 (minimal) | 65.1% |
#4 | Claude Sonnet 4 | 62.5% |
#5 | GPT-4.1 Mini | 58.2% |
#6 | Claude 3.7 Sonnet | 54.2% |
#7 | GPT-4.1 | 50.9% |
#8 | GPT-5 Mini (minimal) | 50.5% |
#9 | GPT-4.1 Nano | 34.9% |
#10 | Claude 3.5 Haiku | 28.8% |
#11 | GPT-5 Nano (minimal) | 20.3% |
#12 | Claude 3 Haiku | 0.2% |
Overall: Strict correct/incorrect accuracy
Partial Credit: Rewards basic strategy when count deviation was optimal
Basic Strategy: Fundamental blackjack decision accuracy
Card Counting: Advanced count-based strategy deviation accuracy
85.6%
strict accuracy
76.1%
strict accuracy
65.1%
strict accuracy
62.5%
strict accuracy
58.2%
strict accuracy
54.2%
strict accuracy
50.9%
strict accuracy
50.5%
strict accuracy
34.9%
strict accuracy
28.8%
strict accuracy
20.3%
strict accuracy
0.2%
strict accuracy
Enhanced Scenarios: Advanced blackjack evaluation with card counting and multi-player contexts
Card Counting: Ability to make count-based strategy deviations
Multi-Player: Performance when utilizing information from other players' cards
Explore the trade-offs between decision speed and accuracy across models
Scatter Plot: Each point represents a model plotted by accuracy vs response time
Pareto Optimal: Highlighted models achieve best accuracy for their response time
Evaluate how consistently models perform across different scenario types and difficulty levels
How models handle increasing complexity and maintain consistency across difficulty levels
Scaling Curves: Performance across easy, medium, and hard difficulty levels
Model | Easy | Medium | Hard | Overall Drop | Resilience | Category |
---|---|---|---|---|---|---|
GPT-5 Mini (minimal) | 39.5% | 51.6% | 46.2% | 0.0% | 100.0% | Stable |
GPT-4.1 Mini | 60.5% | 57.5% | 76.9% | 0.0% | 100.0% | Stable |
GPT-5 Nano (minimal) | 28.9% | 19.2% | 30.8% | 0.0% | 100.0% | Stable |
GPT-4.1 Nano | 23.7% | 35.7% | 38.5% | 0.0% | 100.0% | Stable |
Claude 3.5 Haiku | 31.6% | 27.8% | 53.8% | 0.0% | 100.0% | Stable |
Claude 3 Haiku | 0.0% | 0.2% | 0.0% | 0.0% | 100.0% | Stable |
Claude Sonnet 4 | 63.2% | 62.9% | 46.2% | 17.0% | 66.0% | Cliff Drop |
Claude Opus 4 | 84.2% | 75.8% | 61.5% | 22.7% | 54.7% | Linear Decline |
Claude 3.7 Sonnet | 65.8% | 53.6% | 38.5% | 27.3% | 45.3% | Linear Decline |
GPT-4.1 | 68.4% | 50.0% | 30.8% | 37.7% | 24.7% | Linear Decline |
Claude Opus 4.1 | 89.5% | 86.7% | 38.5% | 51.0% | 0.0% | Cliff Drop |
GPT-5 (minimal) | 89.5% | 64.3% | 23.1% | 66.4% | 0.0% | Linear Decline |
Advanced evaluation of LLM card counting and strategic deviation capabilities
Model | Category | Basic Strategy | Count Deviations | Improvement Potential |
---|---|---|---|---|
Claude Opus 4.1 | Card Counter | 84.2% | 51.2% | Low - Well-rounded |
Claude Opus 4 | Inconsistent Counter | 74.0% | 51.2% | Medium - Strengthen basics |
GPT-5 (minimal) | Inconsistent Counter | 64.3% | 48.8% | Medium - Strengthen basics |
Claude Sonnet 4 | Inconsistent Counter | 61.3% | 44.2% | Medium - Strengthen basics |
GPT-4.1 Mini | Inconsistent Counter | 55.6% | 58.1% | Medium - Strengthen basics |
Claude 3.7 Sonnet | Inconsistent Counter | 52.3% | 60.5% | Medium - Strengthen basics |
GPT-4.1 | Inconsistent Counter | 49.9% | 46.5% | Medium - Strengthen basics |
GPT-5 Mini (minimal) | Inconsistent Counter | 51.7% | 30.2% | Medium - Strengthen basics |
GPT-4.1 Nano | Inconsistent Counter | 35.9% | 25.6% | Medium - Strengthen basics |
Claude 3.5 Haiku | Inconsistent Counter | 27.8% | 39.5% | Medium - Strengthen basics |
GPT-5 Nano (minimal) | Inconsistent Counter | 19.9% | 18.6% | Medium - Strengthen basics |
Claude 3 Haiku | Basic Learner | 0.2% | N/A | High - Start with basics |
How well do LLMs utilize information from multiple players' cards and decisions?
Model | Single Player | Multi Player | Performance Drop | Info Utilization | Complexity Handling |
---|---|---|---|---|---|
Claude Opus 4.1 | 81.3% | 86.1% | 0.0% | 100.0% | 106% |
Claude Opus 4 | 79.2% | 75.7% | 3.4% | 100.0% | 96% |
GPT-5 (minimal) | 79.2% | 63.6% | 15.6% | 100.0% | 80% |
Claude Sonnet 4 | 56.3% | 63.1% | 0.0% | 100.0% | 112% |
GPT-4.1 Mini | 62.5% | 57.8% | 4.7% | 100.0% | 92% |
Claude 3.7 Sonnet | 62.5% | 53.3% | 9.2% | 100.0% | 85% |
GPT-5 Mini (minimal) | 41.7% | 51.5% | 0.0% | 100.0% | 124% |
GPT-4.1 | 60.4% | 49.9% | 10.5% | 100.0% | 83% |
GPT-4.1 Nano | 27.1% | 35.7% | 0.0% | 100.0% | 132% |
Claude 3.5 Haiku | 29.2% | 28.8% | 0.4% | 100.0% | 99% |
GPT-5 Nano (minimal) | 31.3% | 19.1% | 12.1% | 97.8% | 61% |
Claude 3 Haiku | 0.0% | 0.2% | 0.0% | 22.5% | 22% |
Info Utilization: How effectively the model uses visible cards from other players
Complexity Handling: Ratio of multi-player to single-player performance
Understanding the spectrum of decision quality beyond simple right/wrong
Shows how much the partial credit system improves each model's score beyond strict accuracy.
Model | Overall | Partial Credit | Credit Benefit | Basic Strategy | Learning Potential |
---|---|---|---|---|---|
Claude 3 Haiku | 0.2% | 0.2% | +0.0% | 0.2% | Optimized |
GPT-5 Nano (minimal) | 20.3% | 20.9% | +0.6% | 19.9% | Optimized |
GPT-4.1 Mini | 58.2% | 59.4% | +1.2% | 55.6% | Optimized |
Claude 3.5 Haiku | 28.8% | 30.0% | +1.2% | 27.8% | Optimized |
Claude Opus 4 | 76.1% | 77.3% | +1.2% | 74.0% | Optimized |
Claude Sonnet 4 | 62.5% | 63.8% | +1.3% | 61.3% | Optimized |
GPT-4.1 | 50.9% | 52.4% | +1.5% | 49.9% | Optimized |
Claude Opus 4.1 | 85.6% | 87.1% | +1.5% | 84.2% | Optimized |
GPT-4.1 Nano | 34.9% | 36.5% | +1.6% | 35.9% | Optimized |
Claude 3.7 Sonnet | 54.2% | 55.9% | +1.7% | 52.3% | Optimized |
GPT-5 (minimal) | 65.1% | 66.8% | +1.7% | 64.3% | Optimized |
GPT-5 Mini (minimal) | 50.5% | 52.4% | +1.9% | 51.7% | Optimized |
Credit Benefit: How much the partial credit system improves the model's score
Learning Potential: Gap between basic strategy knowledge and current optimal performance
Severity levels:Minor (1)Major (2)Severe (3+)
Deep dive into mistake patterns, frequencies, and severity across models
Bubble Chart: Each bubble represents a mistake type (correct → actual action)
Size: Bubble size represents total frequency of the mistake
Visualize mistake patterns by player hand vs dealer up-card combinations
Card counting evaluation, multi-player scenarios, and partial credit scoring system
Speed vs accuracy analysis, consistency scoring, and difficulty scaling patterns
Interactive heatmaps, frequency distributions, and pattern recognition for errors
Comprehensive charts, bubble plots, radar analyses, and actionable insights
The most comprehensive LLM blackjack evaluation platform with advanced pattern recognition and actionable insights