ChinaRxiv

AI4Games: A General Strategy Search Framework for Evolutionary Games

Wang Hongyu, Wang Long

Submitted 2025-08-29 | ChinaXiv: chinaxiv-202508.00255

Note: Figures in this paper have not yet been translated.

Abstract

Systematically excavating strategies with long-term evolutionary advantages in multi-agent systems constitutes a pivotal challenge in evolutionary game theory, complex systems science, and artificial intelligence research. This paper proposes and implements a unified strategy search framework—AI4Games, which systematically transforms the problem of strategy construction into a reinforcement learning-driven strategy mining task. The framework abstracts five universal steps: strategy representation, behavioral interaction, reward construction, strategy optimization, and strategy selection, thereby constructing a generalizable and reusable technical pathway applicable to diverse game environments and behavioral modeling tasks. As validation, we apply AI4Games to the evolutionary iterated prisoner's dilemma and successfully discover a two-step memory strategy with bilateral reciprocity structure (MTBR), whose behavioral rules are concise and interpretable, exhibiting significant payoff capabilities and evolutionary stability across multiple classes of game environments. This strategy is not manually designed but rather an automatically generated product of the framework, demonstrating AI4Games' capability to emerge high-quality behavioral patterns in high-dimensional strategy spaces. The proposal of AI4Games not only elevates the level of strategy modeling in evolutionary games but also manifests the methodological value of artificial intelligence in strategy evolution, mechanism design, and complex behavioral system modeling. As an extension of the AI for Science paradigm in the domain of game theory, AI4Games provides theoretical support and a tool foundation for promoting interdisciplinary intelligent modeling under the 'AI+' background.

Full Text

Preamble

AI4Games: A General Strategy Discovery Framework for Evolutionary Games

Hongyu Wang¹, Long Wang¹∗
(¹ Center for Systems and Control, Peking University, Beijing 100871, China)

Discovering strategies with long-term evolutionary advantages in multi-agent systems is a fundamental problem at the intersection of evolutionary game theory, complex systems science, and artificial intelligence. This paper presents and implements a unified strategy discovery framework—AI4Games—that systematically transforms strategy construction into a reinforcement learning–driven optimization task. The framework abstracts five generalizable components: strategy representation, behavioral interaction, reward construction, policy optimization, and strategy evaluation, forming a reusable pipeline applicable to diverse game-theoretic and behavioral modeling settings. As validation, we apply AI4Games to the evolutionary Iterated Prisoner's Dilemma and successfully uncover a memory-two bilateral reciprocity strategy (MTBR) with a simple, interpretable behavioral rule that demonstrates significant payoff performance and evolutionary stability across multiple game environments. This strategy emerges automatically from the framework rather than through manual design, showcasing AI4Games' capacity to elicit high-quality behavioral patterns from high-dimensional strategy spaces. The introduction of AI4Games not only advances strategy modeling in evolutionary games but also exemplifies the methodological value of artificial intelligence in strategy evolution, mechanism design, and complex behavioral system modeling. As an extension of the AI for Science paradigm to game theory, AI4Games provides both theoretical support and a practical tool for interdisciplinary intelligent modeling under the "AI+" initiative.

Keywords: evolutionary game; reinforcement learning; strategy discovery; multi-agent system; AI for science

AI4Games: A General Strategy Discovery Framework for Evolutionary Games

Hongyu Wang¹, Long Wang¹∗
(¹ Center for Systems and Control, Peking University, Beijing 100871, China)

Abstract: Discovering strategies with long-term evolutionary advantages in multi-agent systems is a fundamental problem at the intersection of evolutionary game theory, complex systems science, and artificial intelligence. This paper presents a general strategy discovery framework, AI4Games, which systematically transforms strategy design into a reinforcement learning–driven optimization task. The framework abstracts five generalizable components—strategy representation, interaction design, reward construction, optimization, and evaluation—forming a reusable pipeline applicable to diverse game-theoretic and behavioral modeling settings.

To validate the framework, we apply AI4Games to the evolutionary Iterated Prisoner's Dilemma and successfully uncover a memory-two bilateral reciprocity strategy (MTBR) that emerges naturally from training. MTBR exhibits interpretable behavioral rules, robust performance across heterogeneous opponents, and strong evolutionary stability. Its emergence as a non-predefined outcome highlights the framework's capability in navigating high-dimensional strategy spaces and discovering effective behavioral patterns.

AI4Games advances strategy modeling beyond hand-crafted heuristics and exemplifies the methodological contribution of AI for Science (AI4S) in the game-theoretic domain. It provides both a theoretical foundation and a practical tool for cross-disciplinary modeling under the "AI+" national initiative.

Keywords: evolutionary game; reinforcement learning; strategy discovery; multi-agent system; AI for science

Corresponding author: Long Wang, E-mail: longwang@pku.edu.cn

In recent years, artificial intelligence (AI) has profoundly reshaped social structures, research paradigms, and economic logic, becoming a core driving force behind a new round of scientific and technological revolution. Its influence has expanded from auxiliary computing and information processing to fundamental innovation in scientific research methodology, propelling the scientific system from an "experience-driven" to an "intelligence-driven" paradigm. In August 2025, the State Council officially released the "Opinions on Deepening the Implementation of the 'AI Plus' Action," calling for accelerated exploration of "new AI-driven research paradigms" and explicit support for AI to empower the entire process of technology development, engineering implementation, and scientific discovery. The document emphasizes promoting AI-led interdisciplinary traction and systematic collaborative innovation, aiming to achieve original scientific breakthroughs from "0 to 1," accelerate the cultivation of new quality productive forces, and build intelligent scientific research infrastructure. This national strategy aligns closely with the emerging "AI for Science (AI4S)" paradigm in the international academic community.

In 2021, multiple domestic and international AI researchers proposed that AI should serve as a crucial driving force for scientific research, assisting or even reconstructing traditional research paradigms [1]. In 2023, several renowned scholars and technology organization representatives jointly published an article in Nature that systematically梳理ed the pathways through which AI contributes to scientific discovery, highlighting AI's significant advantages in data-driven hypothesis generation, experimental design optimization, and complex system modeling, and positioning it as a new bridge connecting theory and practice [2]. It has become recognized that AI is no longer merely a technical tool for improving research efficiency but is becoming an ontological force participating in scientific discovery itself. Domestic scholars have also proposed top-level research layouts for "AI4S," calling for the construction of an intelligent scientific research system covering multiple foundational disciplines [3]. This series of works marks a paradigm shift in scientific research from "human-dominated" to "human-machine collaboration," providing powerful theoretical support and a technical engine for Chinese-style modernization and independent scientific innovation.

Currently, the deep integration of AI across multiple scientific research directions continues to demonstrate disruptive potential. For example, in biomedicine, AI has been widely used for drug screening, protein structure prediction, and clinical pathway optimization. In 2025, domestic scholars noted in Nature Medicine that AI is poised to reshape the entire new drug development process, significantly reducing development cycles and failure rates [4]. In physics, from quantum computing and particle simulation to materials discovery, AI tools are assisting physicists in solving complex modeling problems that are difficult to enumerate using traditional methods [5]. The practice of AI for Science has already covered multiple foundational and interdisciplinary fields, gradually shifting from an auxiliary tool to a key methodology and becoming a core engine for driving original innovation.

However, despite the breakthrough progress of the "AI for Science" research model in fields such as biology, materials science, and physics, it remains relatively weak in social behavior modeling and decision-making mechanism research, particularly lacking systematic AI research frameworks for game theory. In fact, game models widely exist in many critical issues in the real world, such as resource allocation in epidemic prevention and control, cooperation incentive mechanisms in social platforms, attack-defense games in cybersecurity, and even multi-agent coordination and adversarial games among AI systems themselves. These problems can all be abstracted as repeated interactive choices made by individuals under bounded rationality and feedback—precisely the core scenario focused on by evolutionary game theory.

In this context, discovering game strategies that can persist in dynamic environments and possess evolutionary stability is not only of theoretical value but also directly relates to understanding and guiding mechanisms of cooperation, trust, punishment, and reward in social systems. Traditional strategies mostly rely on manual construction and heuristic design, which, while interpretable, struggle to systematically cover complex strategy spaces and adapt to dynamically changing game structures. Therefore, there is an urgent need to establish an "AI4Games" framework with interpretability, transferability, and scalability, enabling AI to not only drive scientific discovery in natural sciences like biology and physics but also uncover decision-making mechanisms with practical guiding significance in social behavior and agent systems. This direction not only fills the theoretical gap of AI in game modeling but also aligns with the interdisciplinary research paradigm innovation emphasized in the national "AI Plus" strategy.

2 Strategy Discovery in Repeated Games

The evolutionary mechanisms of cooperation and defection are core concerns across complex systems science, economics, and behavioral ecology. Among numerous game models, repeated games provide a fundamental framework for revealing how individuals form stable behavioral patterns through long-term interactions [6-9]. In reality, individuals rarely interact only once; instead, they gradually build trust and develop strategies through sustained interaction, making behavioral evolution dependent not only on immediate payoffs but also driven by long-term adaptability. Evolutionary game theory takes strategies as the basic unit of population evolution, studying which behavioral patterns can survive and spread under mechanisms such as natural selection, learning, and imitation. From this perspective, strategies include not only those that can sustain reciprocal cooperation but also mechanisms capable of punishment and defense when facing defection. Consequently, long-term advantageous strategies often do not simply encourage cooperation but can flexibly respond to different types of opponents in complex, dynamic environments to achieve the goal of maximizing average payoffs. Thus, systematically discovering strategies with long-term evolutionary advantages in complex multi-agent environments has become a key challenge for advancing evolutionary game theory toward automation and generalization. This process concerns not only strategy generation itself but also adaptive evaluation of behavior evolution and mechanism analysis, representing a crucial step toward achieving "AI-driven behavioral modeling."

Existing research has proposed various classical strategies (such as TFT, GTFT, WSLS) [10-12] that perform well in pairwise interactions within specific environments. These strategies rely on human expert knowledge for construction and typically possess good intuitive interpretability, yet human expertise struggles to systematically cover vast strategy spaces or adapt to changing strategic demands in complex environments [13-14]. Therefore, constructing a general strategy discovery framework to automatically identify strategies with long-term evolutionary advantages is crucial for advancing research in evolutionary games. Beyond strategy design itself, numerous studies have examined how game environments and population structures influence cooperation evolution, such as the role of feedback mechanisms in cooperation formation [15-16] and how complex network structures regulate cooperation maintenance and propagation [17-18]. Additionally, research has shown that reputation-based partner selection mechanisms can effectively promote cooperative behavior in social networks [19]; other work has explored transition mechanisms of game rules themselves during evolution, demonstrating how strategies can spread and stabilize in dynamic game contexts [20]; recent analyses of strategy evolution on higher-order network structures have further revealed the important role of complex topologies in cooperation evolution [21]. These works underscore the complexity and structural dependency of strategy discovery problems, further highlighting the necessity of developing general strategy search methods.

In recent years, reinforcement learning research and applications have gradually expanded. Particularly in scenarios with high-dimensional strategy spaces and complex behavioral feedback, reinforcement learning demonstrates superior adaptability and discovery capabilities compared to traditional analytical methods [22-27]. Against this backdrop, we propose and implement the AI4Games framework, aiming to provide a universal and systematic strategy search methodology for evolutionary games. Centered on reinforcement learning and combined with custom payoff functions, the framework enables efficient search and optimization in vast strategy spaces. Unlike strategy design relying on human experience, AI4Games proposes a set of general strategy search principles for transforming specific game tasks into forms solvable by reinforcement learning.

The general principles of the AI4Games framework are as follows:

Strategy Representation and Encoding Design: First, abstract strategic behavior in evolutionary games into a state-action mapping structure processable by reinforcement learning. By setting memory length and action sets, formally encode candidate strategies to construct the strategy space.
Behavioral Interaction and Experience Collection: Purposefully design training environments containing representative opponent strategies, enabling agents to accumulate experience through representative interactive feedback, conduct adaptive evaluation of agents, and obtain effective information to drive strategy improvement.
Objective Function and Feedback Mechanism: Design and construct specific reward structures according to task requirements to guide strategies toward desired behavioral patterns, such as improving payoffs, resisting exploitation, or maintaining cooperation.
Strategy Exploration and Optimization Mechanism: Utilize reinforcement learning methods for continuous optimization in the strategy space, adjusting the balance between exploration and exploitation during learning to improve strategy performance.
Strategy Evaluation and Selection Mechanism: Construct systematic evaluation criteria to identify outstanding strategies from candidates and verify their long-term dominance in evolutionary dynamics.

These five steps constitute the general skeleton of the AI4Games framework, applicable to various evolutionary game tasks. For specific problems, simply transform them into the required construction items for these five steps, and AI4Games can be employed for strategy discovery. Figure 1 [FIGURE:1] illustrates the logical relationships and functional divisions among the five modules of the AI4Games framework, with each component detailed in subsequent sections.

To validate AI4Games' effectiveness, we showcase a typical output in evolutionary repeated games—the Memory-Two Bilateral Reciprocity (MTBR) strategy. This strategy emerges naturally during training, possesses simple and interpretable response logic, demonstrates high payoffs across multiple adversarial environments, and can significantly improve overall average payoffs in both non-evolutionary and evolutionary simulations while dominating evolutionary populations. This result proves that AI4Games can successfully mine evolutionarily advantageous strategies from complex strategy spaces.

As an exploratory attempt of the AI4S paradigm in game theory, AI4Games is not merely an algorithmic implementation but a systematic and generalizable research framework for strategy discovery. Its proposal expands the theoretical boundaries of AI in complex behavioral modeling. This paper proceeds from the general design of AI4Games, elaborates its specific implementation in evolutionary games, further demonstrates how the framework promotes spontaneous emergence and selection of strategies in multi-agent games, and finally showcases its search capability through MTBR's behavioral characteristics and evolutionary performance. Research shows that AI4Games, as a strategy mining platform, possesses strong extensibility and generality, providing new tools and research pathways for future exploration of high-dimensional memory strategies and more complex game structures.

3 The AI4Games Framework: Strategy Discovery via Reinforcement Learning

This section systematically introduces the intelligent strategy search framework AI4Games, a unified methodological platform for evolutionary game tasks. The framework is dedicated to systematizing and formalizing the strategy discovery process, with reinforcement learning as its core technological driver, while preserving high interpretability together with broad generality and scalability. Its design philosophy not only serves specific game problems but can also be viewed as a universal tool for AI介入behavioral modeling research. Unlike traditional methods that construct strategies based on human experience, AI4Games systematically transforms the strategy search problem into a reinforcement learning task. By constructing structured training and evaluation mechanisms, it automatically mines dominant strategies with evolutionary advantages from complex strategy spaces. The AI4Games framework follows the five general principles proposed in the previous section. Below, we introduce its specific implementation for the problem of mining dominant strategies in evolutionary repeated Prisoner's Dilemma environments. Mining dominant strategies in repeated games faces two core challenges: first, combinatorial explosion of the strategy space (especially when considering memory strategies), and second, the multi-round game interaction and feedback required for strategy evaluation. Multi-agent Q-learning methods provide a "no human construction needed, self-evolvable" approach for strategy discovery in complex game systems. On the one hand, its state-action value encoding can systematically express strategy rules under finite memory. On the other hand, its experience replay and reward update mechanisms allow agents to gradually optimize their response strategies through multi-round games, thereby adapting to diverse opponents and achieving stable convergence. Compared to manual design relying on expert knowledge, multi-agent Q-learning methods are more easily systematized and algorithmically implemented, and are more suitable for embedding into interdisciplinary "AI-driven scientific modeling" platforms.

3.1 Strategy Representation and Encoding Design

Each agent's strategy is represented as a Q-table, which records the mapping between all possible historical states (i.e., action combinations from past rounds of interaction) and current actions (for the Prisoner's Dilemma, cooperation or defection). For a memory-two scenario, the state space includes action combinations from both agents over the previous two rounds, and the strategy selects the action with the highest Q-value for each state.

For a two-player repeated game with M possible actions and memory length ℓ, there are Nstate = M^(2ℓ+2) - M^2 possible states. Specifically, in our Iterated Prisoner's Dilemma, we set agents to have two-step memory, involving 24 + 22 = 20 states.

M^(2ℓ+2) - M^2

In this paper, a state refers to the interaction history an agent uses for decision-making. For a memory-two agent, decisions are based only on the actions taken by that agent and its opponent in the previous two rounds. Specifically, the relevant information includes the opponent's action two rounds ago, the agent's own action two rounds ago, the opponent's action in the previous round, and the agent's own action in the previous round. We represent the state as a four-element tuple: (opponent's action two rounds ago, agent's action two rounds ago, opponent's previous action, agent's previous action). Note that because interaction history is limited at the beginning of a game and the agent cannot yet accumulate a complete ℓ-step memory, the formula for Nstate includes the additional states required.

Each agent is assigned an independent Nstate × M Q-table, where the entry in row i and column j reflects the agent's expected long-term cumulative reward for choosing action j in state s_i (see Figure 2 [FIGURE:2]).

3.2 Behavioral Interaction and Experience Collection

Our training environment contains 98 individuals, including 49 agents with empty Q-tables and 49 sparring individuals with preset strategies. During training, reinforcement learning agents engage in repeated games with sparring individuals. These sparring individuals include equal numbers of TFT, GTFT, WSLS, Hold-a-Grudge, Fool-Me-Once, GradualTFT, and OmegaTFT.

In the Iterated Prisoner's Dilemma used for training, each round's payoff depends on the action combination of both parties, specifically configured as follows:
- If both cooperate, each receives R = 2
- If one cooperates and one defects, the cooperator receives S = 0 and the defector receives T = 3
- If both defect, each receives P = 0.1

This parameter combination intentionally reduces the payoff for mutual defection, thereby weakening the attractiveness of defection and providing a more favorable environment for the emergence of cooperative strategies.

3.3 Objective Function and Feedback Mechanism

Figure 2: Schematic diagram of dominant strategy discovery in evolutionary repeated games using the AI4Games framework. We consider a population of N_a = 49 agents and N_m = 49 mentors. Each agent is equipped with an independent N_state × M Q-table, where N_state represents the number of possible states and M represents the number of available actions. Each mentor carries a preset artificial strategy, with seven categories established. In each iteration, two individuals p1 and p2 are randomly selected from the agent and mentor pools to play an L-round Iterated Prisoner's Dilemma. In round t, agent p1 consults its Q-table based on the current state s_{p1,t} (i.e., the combination of both parties' actions in the most recent ℓ interactions) and selects an action. Action selection uses an ϵ-greedy policy to balance exploration and exploitation. Mentors always act according to their preset fixed strategies. After completing the L-round game, the weighted payoff W_{p1}(s_{p1,t}, a_{p1,t}) is calculated for each action step based on the agent's historical payoffs and interaction results. The agent then updates its Q-table using the Bellman formula (see Equation 3).

In Q-learning, agents optimize their strategies by maximizing expected payoff. To more effectively encourage long-term cooperation in repeated games and guide strategies to resist exploitation by other individuals, we design an objective function for updating the Q-table.

We define each agent p_X's state at time step t as s_{p_X,t}, action as a_{p_X,t}, and action payoff as U_{p_X,t}. We also define the agent's average payoff in this repeated game as \bar{U}{p_X} and its opponent's average payoff as \bar{U}. The objective function is defined as:}

W_{p_X}(s_{p_X,t}, a_{p_X,t}) =
\begin{cases}
\theta U_{p_X,t} + (1 - \theta) \bar{U}{p_X} & \text{if } \bar{U}} \geq \bar{U{p \}
\theta U_{p_X,t} & \text{if } \bar{U}{p_X} < \bar{U}}
\end{cases}

where θ ∈ [0, 1] is a parameter that adjusts the weight between immediate payoff and long-term cooperative payoff.

The average payoff is defined as:

\bar{U}{p_X} = \frac{\sum}^{L} U_{p_X,t}}{L

where L is the number of rounds in the repeated game.

The Q-table is then updated as follows:

NewQ_{p_X}(s_{p_X,t}, a_{p_X,t}) = Q_{p_X}(s_{p_X,t}, a_{p_X,t}) + \alpha \left[ W_{p_X}(s_{p_X,t}, a_{p_X,t}) + \gamma \max_{a'} Q_{p_X}(s', a') - Q_{p_X}(s_{p_X,t}, a_{p_X,t}) \right]

where α is the learning rate, γ is the discount factor, and s' is the next state. We believe this design effectively balances short-term instantaneous payoffs with stable long-term benefits from cooperation, thereby helping agents better survive in evolving populations and improving the overall cooperation level of the evolutionary population.

3.4 Strategy Exploration and Optimization Mechanism

In each training round, an agent plays a 20-round repeated game with an opponent and updates its Q-table based on the payoffs. The Q-learning parameters are as follows: learning rate α = 0.2, discount factor γ = 0.5, and preference parameter in weighted payoff θ = 0.8. This parameter combination demonstrates good stability and balance in experiments, capable of considering both the convergence speed of reinforcement learning, the emphasis on future payoffs, and the tendency for long-term cooperation in games. Training uses an ϵ-greedy policy for exploration, with ϵ gradually decaying to a small value as rounds increase to achieve a transition from exploration to convergence.

3.5 Strategy Evaluation and Selection Mechanism

After training, we evaluate the agent strategies obtained through the AI4Games framework to assess their stability and cooperation level in different game environments. If a strategy demonstrates high average payoffs against multiple types of opponents and can persist in evolutionary simulations, it is considered to have the potential to become a dominant strategy. In practice, we conduct multiple independent training runs and evaluate the results to filter out strategies that are behaviorally stable, structurally simple, and outstanding in performance. Among numerous strategies, we ultimately extract a strategy with clear structure and interpretable behavioral rules—the Memory-Two Bilateral Reciprocity (MTBR) strategy described below.

4 Analysis of AI4Games Output: Decision Logic and Evolutionary Advantages of the Memory-Two Bilateral Reciprocity Strategy

To validate the strategy mining capability of the AI4Games framework in complex game tasks, this section presents its representative output in the classic Iterated Prisoner's Dilemma environment. This task is not only widely used in evolutionary studies of cooperative behavior but has also become a standard benchmark for strategy evolution algorithms due to its simple structure and flexible parameter control. We select this task as a typical application scenario to demonstrate how AI4Games can automatically elicit evolutionarily advantageous strategies from high-dimensional strategy spaces.

Although the AI4Games framework itself is applicable to repeated games with arbitrary memory lengths and action set sizes, to more clearly characterize the structural features of its output strategies, we focus on a challenging experimental setting: a finite-horizon (20-round) Iterated Prisoner's Dilemma. In this setting, the temptation of short-term payoffs is stronger, posing greater resistance to long-term reciprocal behavior. We adopt the classic parameter configuration: R = 3, S = 0, T = 5, P = 1, which exhibits typical "defection advantage" and thus places higher demands on a strategy's fault tolerance and recovery mechanisms.

4.1 The Memory-Two Bilateral Reciprocity Strategy

Through analysis and evaluation of the trained agent strategies, we obtained a high-performance agent strategy. This strategy makes decisions based on two rounds of memory and demonstrates good cooperation capability and payoff advantages when interacting with various sparring partners. We term it the "Memory-Two Bilateral Reciprocity" (MTBR) strategy.

The core decision logic of MTBR is as follows:
- When the opponent defects in the first round while the agent cooperates, MTBR continues to cooperate in the second round. This choice exhibits fault tolerance and can encourage the opponent to reciprocate cooperation.
- When both parties have defected in the past two rounds, MTBR proactively switches to cooperation, attempting to break the persistent defection pattern.
- In all other cases, MTBR mimics the opponent's previous-round action, i.e., executing the "TFT" strategy.

These rules are not manually designed through mathematical analysis or inspired by biology but are automatically formed by the agent through multi-round reinforcement learning. In terms of strategy structure, MTBR possesses both the ability to recognize cooperative trends and mechanisms for forgiveness and retaliation, enabling flexible adjustment of responses when facing different types of strategies.

To further demonstrate its behavioral characteristics, we conducted a visual analysis of the game process between two MTBR strategies. As shown in Figure 3 [FIGURE:3], when initial states are inconsistent (e.g., one cooperates, one defects), MTBR can quickly recover and establish stable reciprocal cooperation; under the same scenario, the TFT strategy falls into a repeated "cooperate-defect" cycle, while GradualTFT requires many rounds to restore cooperation.

Figure 3b shows the scenario where both strategies defect in the first round. Both MTBR and GradualTFT can restore cooperation within a short time, while TFT remains trapped in continuous mutual defection.

These behavioral differences indicate that MTBR possesses stronger fault tolerance and recovery capabilities, able to rapidly guide cooperation formation without complex structures. This capability is particularly important in noise-free repeated games and lays the foundation for its subsequent evolutionary stability.

Figure 3: Interaction schematic of MTBR, GradualTFT, and TFT strategies in the Iterated Prisoner's Dilemma. The ability of a strategy to quickly establish reciprocal cooperation after initial defection is one of the key factors for success. Yellow circles in the figure indicate states where individuals transition from defection to reciprocal cooperation. Red letters indicate that other strategies (compared to MTBR) exhibit more defection, leading to lower overall payoffs. (a) Two individuals using the same strategy randomly choose "cooperate" and "defect" in the first round. The MTBR strategy can quickly establish reciprocal cooperation, while the TFT strategy falls into an alternating "cooperate-defect" cycle. (b) Both individuals defect in the first round. Both MTBR and GradualTFT can recover to reciprocal cooperation, while TFT remains trapped in continuous mutual defection.

4.2 MTBR Strategy Enhances Population Payoffs

This section examines the impact of introducing the MTBR strategy on individual payoffs and average population payoff in populations with fixed strategy compositions. We constructed two different strategic combination environments to evaluate its ability to promote cooperation by comparing game payoffs before and after introducing MTBR.

First, consider Strategy Set 1, containing seven classical strategies: GradualTFT, OmegaTFT, TFT, GTFT (tolerance 0.3), Fool-Me-Once, WSLS, and Hold-a-Grudge. In Figure 4a [FIGURE:4], purple bars show the average payoffs when strategies within this set play against each other. After introducing the MTBR strategy (blue bars), the average payoff of each strategy in the set improves significantly, and the entire population's average payoff also increases substantially.

Notably, the MTBR strategy achieves the second-highest payoff in this environment, only slightly lower than GradualTFT, with minimal difference between them. GradualTFT is the most challenging opponent for MTBR in this environment, yet both can stably maintain cooperation during games. This result demonstrates that MTBR not only performs well itself but also enhances the overall payoff level of the entire population.

We then introduce Strategy Set 2, which adds eight typical zero-determinant (ZD) strategies to Set 1. ZD strategies are characterized by their ability to control opponents' payoffs and are strongly exploitative in various scenarios. Figure 4b shows that without MTBR, the addition of ZD strategies reduces the population's average payoff from 2.15 to 1.93, indicating that ZD strategies (exploitative strategies) significantly damage population cooperation. However, introducing MTBR into this environment raises the population's average payoff again, and MTBR itself achieves payoffs far above the average when interacting with other strategies. These results indicate that MTBR is not only highly competitive in repeated games but can also improve the overall cooperation level of the population under non-evolutionary conditions.

4.3 Evolutionary Performance of MTBR in Mixed Populations

The previous results demonstrate MTBR's payoff advantages in fixed strategy sets. This section further examines its long-term evolutionary capability in evolving mixed populations. We constructed an evolutionary system where individuals initially randomly choose either MTBR or any strategy from Strategy Set 2, then achieve strategy spread through repeated games and strategy imitation processes.

In each generation, all individuals in the population are randomly paired to play 20-round Iterated Prisoner's Dilemma games. The game matrix remains consistent with previous sections: R = 3, S = 0, T = 5, P = 1. After completing the games, all individuals obtain average payoffs, after which an individual in the population may imitate another individual's strategy. The imitation probability is determined by the payoff difference between the two individuals through the following formula:

p_{i→j} = \frac{1}{1 + \exp\left[-\delta(\bar{U}_j - \bar{U}_i)\right]}

where δ represents the selection intensity, controlling how payoff differences affect imitation probability. When δ approaches 0, selection becomes random; when δ is large, strategy spread depends more on payoff differences.

Figure 5 [FIGURE:5] shows comparative results for two evolutionary scenarios. If the initial population does not include MTBR, the evolutionarily stable state of the population consists of GTFT0.3, GradualTFT, and ZDGTFT2, with an average population payoff of approximately 2.900 (Figure 5a, 5c blue line). After introducing MTBR (Figure 5b), its frequency rises rapidly and gradually replaces other strategies, eventually dominating the entire population, with average payoff increasing to 2.938 (Figure 5c red line).

Further analysis reveals that GradualTFT and MTBR obtain identical payoffs when interacting with each other. However, compared to GradualTFT, MTBR achieves higher payoffs when interacting with itself, thus gradually replacing the former during evolution. Meanwhile, in mixed populations containing multiple exploitative strategies, MTBR demonstrates stable ability to resist exploitation and can continuously guide the population toward higher cooperation levels.

This result not only validates MTBR's evolutionary advantages but also demonstrates AI4Games' capability to automatically generate interpretable strategies, reflecting its effectiveness and generality as a strategy search framework.

5 Conclusion and Outlook

This paper proposes and implements a unified framework for evolutionary game strategy discovery—AI4Games—which systematically transforms the traditional process of human-experience-based strategy construction into a reinforcement learning–driven optimization task. The framework abstracts five general steps: strategy representation, behavioral interaction, reward construction, strategy optimization, and strategy evaluation, forming a reusable and extensible strategy discovery pipeline applicable to diverse game modeling and evolutionary mechanism research tasks.

To validate the framework's effectiveness, we applied it to the evolutionary Iterated Prisoner's Dilemma environment and successfully discovered a bilateral reciprocity strategy with a memory-two structure (MTBR). This strategy is not preset but emerges spontaneously from complex strategy spaces, possessing clear, interpretable behavioral rules and good evolutionary stability, demonstrating AI4Games' ability to automatically produce high-quality strategies without human intervention. This result not only validates AI4Games' functional effectiveness but also highlights its potential as an intelligent scientific research tool.

From a broader perspective, the proposal of AI4Games represents a leap of AI in game modeling and behavioral strategy research from "auxiliary tool" to "problem framework." It is not only a strategy discovery method but also a new research paradigm connecting game theory, evolutionary dynamics, and intelligent decision-making, providing a general tool platform for complex system modeling in the AI4S era. Its methodological thinking also aligns closely with the cross-disciplinary collaborative innovation, high-dimensional problem modeling, and intelligent mechanism design emphasized in the national "AI Plus" action plan.

Figure 4: MTBR can effectively improve population cooperation levels. We consider two strategy sets: Strategy Set 1 (including GradualTFT, OmegaTFT, TFT, GTFT0.3, Fool-Me-Once, WSLS, and Hold-a-Grudge) and Strategy Set 2 (adding eight zero-determinant strategies to Set 1). (a) Interaction results between a population composed of Strategy Set 1 (purple bars) and a population with MTBR added to Set 1 (blue bars)—showing each strategy's average payoff when interacting with other strategies in the set. (b) Interaction results between a population composed solely of Strategy Set 2 (purple bars) and a population with MTBR added to Set 2 (blue bars). It can be seen that introducing MTBR improves the average payoff of the entire population in both cases. MTBR itself achieves high average payoffs in both scenarios. Each interaction is a 20-round Iterated Prisoner's Dilemma, with all results averaged over 10,000 independent experiments.

In summary, AI4Games not only provides solutions for specific game tasks but also drives a paradigm shift in strategy evolution research from rule design to intelligent modeling. It will serve as an important foundation for future "AI + Game Theory" interdisciplinary research, providing a replicable and generalizable new pathway for promoting intelligent scientific modeling, understanding evolutionary social mechanisms, and designing complex behavioral systems.

Figure 5: The memory-two bilateral reciprocity strategy dominates during evolution, driving the population to a higher-payoff state. We consider two evolutionary populations: one composed of Strategy Set 2 (see Methods) without MTBR, and another with MTBR introduced into Strategy Set 2. (a) Strategy frequency trajectories for the evolutionary population without MTBR. (b) Strategy frequency trajectories for the evolutionary population with MTBR. (c) Changes in population average payoff during evolution, where the blue curve represents the scenario without MTBR and the red curve represents the scenario with MTBR. Solid lines show evolutionary trajectories predicted by replicator dynamics equations, while dashed lines show average results from 50 repeated experiments. Parameters: selection intensity δ = 3.0 in a, b, and c; population size N = 7000 for the population without MTBR and N = 7500 for the population with MTBR.

Author Contributions: Hongyu Wang and Long Wang: conceived the research idea, designed the study, conducted experiments, drafted the manuscript, and revised the manuscript.

[1] XU Y, LIU X, CAO X, et al. Artificial intelligence: A powerful paradigm for scientific research[J]. The Innovation, 2021, 2(4).
[2] WANG H, FU T, DU Y, et al. Scientific discovery in the age of artificial intelligence[J]. Nature, 2023, 620(7972): 47-60.
[3] Li G. AI4S Milestone Major Achievements Review[J]. Computing, 2025, 1(4): 6-15.
[4] ZHANG K, YANG X, WANG Y, et al. Artificial intelligence in drug development[J]. Nature Medicine, 2025, 31(1): 45-59.
[5] JIAO L, SONG X, YOU C, et al. AI meets physics: a comprehensive survey[J]. Artificial Intelligence Review, 2024, 57(9): 256.
[6] NOWAK M A. Five Rules for the Evolution of Cooperation[J]. Science, 2006, 314(5805):
[7] Wang L, Fu F, Chen X, et al. Evolutionary Game and Self-Organizing Cooperation[J]. Journal of Systems Science and Mathematical Sciences, 2007(03): 330-343.
[8] Wang L, Fu F, Chen X, et al. Evolutionary Games on Complex Networks[J]. CAAI Transactions on Intelligent Systems, 2007(02): 1-10.
[9] HILBE C, CHATTERJEE K, NOWAK M A. Partners and rivals in direct reciprocity[J]. Nature Human Behaviour, 2018, 2(7): 469-477.
[10] AXELROD R, HAMILTON W D. The Evolution of Cooperation[J]. Science, 1981, 211(4489): 1390-1396.
[11] NOWAK M, SIGMUND K. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game[J]. Nature, 1993, 364(6432): 56-58.
[12] PRESS W H, DYSON F J. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent[J]. Proceedings of the National Academy of Sciences, 2012, 109(26): 10409-10413.
[13] GRÜNE-YANOFF T. Evolutionary game theory, interpersonal comparisons and natural selection: a dilemma[J]. Biology & Philosophy, 2011, 26(5): 637-654.
[14] ADAMI C, SCHOSSAU J, HINTZE A. Evolutionary game theory using agent-based methods[J]. Physics of Life Reviews, 2016, 19: 1-26.
[15] Wang L, Wu T, Zhang Y. Feedback Mechanisms in Co-evolutionary Games[J]. Control Theory & Applications, 2014(07): 823-836.
[16] Wang L, Cong R, Li K. Feedback Mechanisms in Cooperation Evolution[J]. Scientia Sinica: Informationis, 2014(12): 1495-1514.
[17] Wang L, Fu F, Chen X, et al. Group Decision-Making on Complex Networks[J]. CAAI Transactions on Intelligent Systems, 2008(02): 95-108.
[18] Wang L, Wu B, Du J, et al. Analysis of Spreading Behavior on Complex Networks[J]. Scientia Sinica: Informationis, 2020, 50(11):
[19] FU F, HAUERT C, NOWAK M A, et al. Reputation-based partner choice promotes cooperation in social networks[J]. Physical Review E, 2008, 78: 026117.
[20] SU Q, MCAVOY A, WANG L, et al. Evolutionary dynamics with game transitions[J]. Proceedings of the National Academy of Sciences, 2019, 116(51): 25398-25404.
[21] SHENG A, SU Q, WANG L, et al. Strategy evolution on higher-order networks[J]. Nature Computational Science, 2024, 4(4): 274-284.
[22] LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning[C]//Machine Learning Proceedings 1994. San Francisco, CA: Morgan Kaufmann, 1994: 157-163.
[23] HU J, WELLMAN M P. Multiagent reinforcement learning: theoretical framework and an algorithm[C]//Proceedings of the International Conference on Machine Learning: vol. 98. 1998: 242-250.
[24] CHALKIADAKIS G, BOUTILIER C. Coordination in multiagent reinforcement learning: a Bayesian approach[C]//Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS ’03). New York, NY, USA: Association for Computing Machinery, 2003: 709-716.
[25] MATIGNON L, LAURENT G J, FORT-PIAT N L. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31.
[26] NOWÉ A, VRANCX P, HAUWERE Y M D. Game Theory and Multi-agent Reinforcement Learning[M]. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 441-470.
[27] HARPER M, KNIGHT V, JONES M, et al. Reinforcement Learning Produces Dominant Strategies for the Iterated Prisoner’s Dilemma[J]. PLOS ONE, 2017, 12(12): e0188046.

Submission history

[v1] 2025-08-29