Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

1Rensselaer Polytechnic Institute, 2Shenzhen University 3University of California, Los Angeles 4NEC laboratories America 5California Institute of Technology

Strategist(GPT-4-turbo)🤖 playing against rule-based bots in AvalonBench

Abstract

We explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, Resistance Avalon.

In this paper, we propose a new method Strategist that utilizes LLMs to acquire new skills for playing multi-agent games through a self-improvement process. Our method gathers quality feedback through self-play simulations with Monte Carlo tree search and LLM-based reflection, which can then be used to learn high-level strategic skills such as how to evaluate states that guide the low-level execution. We showcase how our method can be used in both action planning and dialogue generation in the context of games, achieving good performance on both tasks.

Specifically, we demonstrate that our method can help train agents with better performance than both traditional reinforcement learning-based approaches and other LLM-based skill learning approaches in the games of Game of Pure Strategy (GOPS) and Resistance: Avalon.

Method

Overview of Strategist

Overview of Strategist
Overview of Strategist. Our method learns high-level skills/strategies through figuring out how to search and improve previously learned strategies. We use low-level self-play simulations to guide the high-level search process.

Our improvement process contains two improvement steps in each improvement cycle -- the (1) reflection and idea generation step and (2) the strategy improvement step. During the idea generation step we prompt the LLM to reflect on simulated self-play feedback from previously evaluated strategies and generate possible improvement ideas to the strategies and add them to the idea queue. During the strategy improvement step, we select a strategy from the strategy tree and an improvement idea from the idea queue and prompt the LLM to improve the strategy using the improvement idea. The improved strategy is then evaluated via self-play simulations, and we use the feedback and reward signals from the simulation to help guide future improvements.

The general goal in our decision-making setting is to learn a good policy function in a sequential decision-making setting (generally formulated as a partially observable Markov decision game (POMDG)), which can be done by improving strategies associated with the policy function. We describe in more detail what a strategy looks like, how we derive a policy function from a strategy, and how to acquire feedback for the strategy for both dialogue generation and action generation here.

Skill Coach Improvement of Strategist

During the idea generation step, a strategy and its simulated trajectory feedback is first selected from the strategy tree according to the adaptive selection policy. The feedback usually trajectories from previous self-play simulations, including the states visited, actions taken at those states, the estimated win-rate at that state, the final outcome of the trajectory, and any intermediate values used during simulations at those states. Since these trajectories are very long (around 30 steps each for Avalon), we select some key states in the trajectory to translate into natural language. We then prompt the LLM to reflect on this feedback and propose a couple of new ideas on how to improve the function. These new ideas are added to the idea queue, along with a score prior to how good the idea is.

During the strategy improvement step, we first select a strategy and an idea from the libraries using the adaptive selection policy. We then prompt the LLM to implement the idea of the strategy, generating a new improved strategy. We then evaluate the strategy using the evaluator, which conducts self-play simulations with the strategy, and records the simulated trajectory data. During simulations, players conduct an MCTS tree search to estimate the expected win-rate at different states, which provides additional feedback. We add the improved strategy (and its performance) to the strategy tree, and update the improvement score for the idea that was used.

Strategist Main Results

image name

We demonstrate the effectiveness of our self-improvement process through experiments against different improvement benchmarks. We tested our method on (1) GOPS, a two-player zero-sum card game and (2) Avalon, a five or more player team-based discussion game. For Avalon dialogue generation, we specifically benchmark on the Merlin role, since that is the hardest role to play.

Different LLM Improvement Methods. We demonstrate the effectiveness of our strategy improvement method by benchmarking it against four other skill-improvement methods.

LLM-improvement vs. Reinforcement Learning (RL) Training. We demonstrate the effectiveness of our method against traditional RL-based approaches to learning a good policy. Specifically, we show that our method is able to learn a value heuristic function more efficiently than deep RL, the approach taken by AlphaGo and MuZero.

Feedback Quality and Reward Signal. We benchmark our feedback acquisition method against (1) using a LLM-critic and (2) trajectory feedback from interactions against a fixed opponent policy.

Conclusion

In conclusion, we have presented Strategist, a generalizable non-parametric self-improvement framework that learns and improves skills. Given the rules of the game, our method is able to learn good strategies to play the game through self-play without task-specific prompting or human generated policy data.

The performance of Strategist suggests that incorporating better guidance, whether this be through modular high-level search or low-level simulated self-play feedback, into LLM-improvement processes can greatly enhance the improvement process.

LLM-Improvement Example

System Prompt

Value heuristic system prompt

You are a function engineer trying to write a function that can evaluate the value of a state in a game. This is known as a value heuristic, and will be used in look-ahead search algorithms to evaluate the value of unexplored states. Your goal is to develop a heuristic that is as accurate as possible without being too expensive to compute. Hence, you are not allowed to runs simulations in the function.

Value heuristics function signature

The function (written in python) should be named `evaluate state' and take in a tuple called `state' of the game state as input. Specifically, the input tuple will be of length 9, and it should return 2 elements. The first element should be a tuple with 2 floats: the first element being the score you expect player 0 will get at the end of the game, and the second element being the score you expect player 1 will get at the end of the game. The second element should be a dictionary of any important intermediate values that you used to calculate the scores. For example, if you think player 0 will win 12 total points by the end of the game and player 1 will win 8 total points, the function should return (12, 8).

Make sure your output only includes the code of the function itself in plain text such that it is executable using exec() in python. Any helper functions should be defined within the scope of the function `evaluate state'. Include comments in your code so that it is readable, but everything should be implemented.

The signature of the function should be as follows:


def evaluate_state(state) -> tuple[tuple[float, float], dict]:
    score_cards = state[0] # a python list of the score cards (integers) that have been played, in the order they were played
    player_0_played_cards = state[1] # a python list of the cards (integers) player 0 has played, in the order they were played. 
    player_1_played_cards = state[2] # a python list of the cards (integers) player 1 has played, in the order they were played. 
    is_turn = state[3] # bool, true if it is you and your opponent's turn to play, false if it is time to draw a new score card
    player_0_score = state[4] # float or integer, player 0's score so far
    player_1_score = state[5] #  float or integer, player 1's score so far
    score_deck = state[6] # a python set of the score cards (integers) left in the deck, either same length as player_0_hand and player_1_hand or one less since the score card appears before the players play. May be empty
    player_0_hand = state[7] # a python set of the cards (integers) left in player 0's hand. May be empty
    player_1_hand = state[8] # a python set of the cards (integers) left in player 1's hand. May be empty
    # explanation of what we do next
    ...
     = value1 
    # explanation of what we do next
    ...
     = value2 
    # explanation of what we do next
    ...
    player_scores = (player_0_expected_score, player_1_expected_score)
    intermediate_values = {'': intermediate_value1, '': intermediate_value2, ...}
    return player_scores, intermediate_values # make sure the return is exactly in this format
        

Where you can use your own names for the intermediate values and the values themselves. Please start with "def evaluate state(state):"

Dialogue guide system prompt

You are a coach trying to write a section of a strategy guide on how to play a game well.

The specific section of the strategy guide you are writing right now is on how to play the Merlin role effectively during the discussion phase so that they can win the game. Recall that players often use the discussion phase to (1) gather information about other players, (2) try to convince other players of their innocence or guilt, and (3) try to persuade other players of a particular course of action.

The game you are interested in is called The Resistance: Avalon. The Resistance: Avalon is the game of hidden identities and social deduction. There are two teams in the game: Good and Evil. Each player has a hidden identity (role) and side.

Dialogue guide signature

Your guide should be in the form of a worksheet that the student can use to build their speech. You should order the worksheet questions in a way that makes logical sense, and you should have no more than six questions. Your questions should instruct the reader to write parts of their speech.

The title of your section should be ``Questions to fill out before speaking as the Merlin role''. Below is an example of how your worksheet should look like:

LLM-Improvement

GOPS Value Heuristics Function (Before)


def evaluate_state(state) -> tuple[tuple[float, float], dict]:
    score_cards = state[0]
    player_0_played_cards = state[1]
    player_1_played_cards = state[2]
    is_turn = state[3]
    player_0_score = state[4]
    player_1_score = state[5]
    score_deck = state[6]
    player_0_hand = state[7]
    player_1_hand = state[8]
    
    # Calculate the potential scores for each player based on the remaining score cards
    player_potential_score = player_0_score
    opponent_potential_score = player_1_score
    
    if len(score_deck) > 0:
        remaining_score_cards = len(score_deck)
        if len(player_0_played_cards) > len(player_1_played_cards):
            player_potential_score += remaining_score_cards / 2
        elif len(player_1_played_cards) > len(player_0_played_cards):
            opponent_potential_score += remaining_score_cards / 2
    
    player_advantage = player_potential_score - opponent_potential_score
    
    return (player_potential_score, opponent_potential_score), {'player_advantage': player_advantage}
        

Selected GOPS Value Heuristics Improvement Idea

Update the function to calculate the advantage of winning future rounds based on the potential scores each player could achieve with the remaining score cards and their current scores. This would involve considering the number of cards left in the deck, the players' hands, and the current scores to determine the advantage one player may have over the other in the upcoming rounds.

GOPS Value Heuristics Function (After)


def evaluate_state(state) -> tuple[tuple[float, float], dict]:
    score_cards = state[0] 
    player_0_played_cards = state[1] 
    player_1_played_cards = state[2] 
    is_turn = state[3] 
    player_0_score = state[4] 
    player_1_score = state[5] 
    score_deck = state[6] 
    player_0_hand = state[7] 
    player_1_hand = state[8] 
    
    # Calculate the potential scores for each player based on the remaining score cards 
    # and current scores
    player_potential_score = player_0_score + sum(score_deck) / 2
    opponent_potential_score = player_1_score + sum(score_deck) / 2
    
    # Determine the advantage of winning future rounds based on potential score
    if is_turn:
        player_potential_scores = [player_potential_score + card for card in player_0_hand]
        opponent_potential_scores = [opponent_potential_score + card for card in player_1_hand]
        
        player_advantage = sum(player_potential_scores) / len(player_potential_scores) 
        - sum(opponent_potential_scores) / len(opponent_potential_scores)
    else:
        player_advantage = player_potential_score - opponent_potential_score
    
    return (player_potential_score, opponent_potential_score), {'player_advantage': player_advantage}
        

LLM-generated Value Heuristic for Avalon (Before)

def evaluate_state(state):
    num_successful_quests = sum(historical_quest_results)
    num_failed_quests = len(historical_quest_results) - num_successful_quests
    num_remaining_quests = len(num_participants_per_quest) - len(historical_quest_results)

    num_evil = len(players) - num_good
    num_evil_in_quest_team = len([player for player in quest_team if not is_good[player]])

    success_probability = 0.5
    if phase == 0: 
        if num_successful_quests >= 3:
            success_probability = 0.9
        elif num_failed_quests >= 3:
            success_probability = 0.1
    elif phase == 1: 
        success_probability = 0.8 if num_evil_in_quest_team == 0 else 0.2
    elif phase == 2: 
        success_probability = 0.9 if num_successful_quests > num_failed_quests else 0.1
    elif phase == 3: 
        if 'Merlin' in roles and 'Assassin' in roles:
            merlin_index = roles.index('Merlin')
            assassin_index = roles.index('Assassin')
            if assassin_index in quest_team:
                success_probability = 0.1
            else:
                success_probability = 0.9 

    expected_winrates_per_player = dict()
    for player in players:
        if is_good[player]:
            expected_winrates_per_player[player] = success_probability
        else:
            expected_winrates_per_player[player] = 1 - success_probability

    intermediate_values = {
        'num_successful_quests': num_successful_quests,
        'num_failed_quests': num_failed_quests,
        'num_remaining_quests': num_remaining_quests,
        'num_evil_in_quest_team': num_evil_in_quest_team
    }

    return expected_winrates_per_player, intermediate_values
        

Avalon Value Heuristics Function Improvement Idea

Incorporate a more nuanced success probability calculation based on the roles of the players and the current phase of the game. For example, in phase 1 (team selection), consider the balance of Good and Evil players on the quest team and adjust the success probability accordingly. This can help better reflect the strategic considerations during team selection.

LLM-generated Value Heuristic for Avalon (After)


def evaluate_state(state: dict) -> tuple[dict[int, float], dict]:
    
    prob_good_victory = 0.5  # probability that Good will win

    if phase == 0:  # team selection phase
        num_successful_quests = sum(historical_quest_results)
        num_failed_quests = len(historical_quest_results) - num_successful_quests
        num_remaining_quests = len(num_participants_per_quest) - len(historical_quest_results)
        num_evil = len(players) - num_good
        num_evil_in_quest_team = len([player for player in quest_team if not is_good[player]])

        if num_successful_quests >= 3:
            prob_good_victory = 0.9
        elif num_failed_quests >= 3:
            prob_good_victory = 0.1

        intermediate_value1 = {'num_successful_quests': num_successful_quests, 'num_failed_quests': num_failed_quests,
                               'num_remaining_quests': num_remaining_quests}
        intermediate_value2 = {'num_evil_in_quest_team': num_evil_in_quest_team}

    elif phase == 1:  # team approval phase
        num_evil_in_quest_team = len([player for player in quest_team if not is_good[player]])

        prob_good_victory = 0.8 if num_evil_in_quest_team == 0 else 0.2

        intermediate_value1 = {'num_evil_in_quest_team': num_evil_in_quest_team}
        intermediate_value2 = {}  # Initialize empty dictionary to avoid the error

    elif phase == 2:  # quest phase
        num_successful_quests = sum(historical_quest_results)
        num_failed_quests = len(historical_quest_results) - num_successful_quests

        prob_good_victory = 0.9 if num_successful_quests > num_failed_quests else 0.1

        intermediate_value1 = {'num_successful_quests': num_successful_quests, 'num_failed_quests': num_failed_quests}
        intermediate_value2 = {}  # Initialize empty dictionary to avoid the error 
        

Dialogue Guide Improvement

Dialogue guide (previous)

Questions to fill out before speaking as the Merlin role
  1. Who are the players you suspect are on the Evil side and why?
  2. Draft a subtle statement that casts doubt on these players without directly accusing them.
  3. Who is the current round leader and which players are you considering to suggest for the team?
  4. Create a subtle suggestion that influences the leader's team decision towards your preferred team without revealing your knowledge.
  5. What are the voting patterns you've observed and what do these patterns indicate about the players' identities?
  6. Craft a statement to share these observations subtly without giving away your role.

Selected GOPS Value Heuristics Improvement Idea

Update the function to calculate the advantage of winning future rounds based on the potential scores each player could achieve with the remaining score cards and their current scores. This would involve considering the number of cards left in the deck, the players' hands, and the current scores to determine the advantage one player may have over the other in the upcoming rounds.

Improvement idea

Add a question asking the reader to identify which players they feel most aligned with in terms of strategy and decision-making, and how they can subtly align themselves with these players during the discussion phase.

Implementation prompt

<System prompt>

<Game rules>

You previously generated the following section of the strategy guide:

<Previous guide>

Here is a possible way to improve your guide:

<Improvement idea>

Implement this improvement into the guide section as best as you can, but do not change the original guide too much. Make sure to stay within the scope of the guide section, which we reiterate below:

<Guide signature>

Dialogue guide (improved)

Questions to fill out before speaking as the Merlin role

  1. Who are the players you suspect are on the Evil side and why?
  2. Draft a subtle statement that casts doubt on these players without directly accusing them.
  3. Who is the current round leader and which players are you considering to suggest for the team?
  4. Create a subtle suggestion that influences the leader's team decision towards your preferred team without revealing your knowledge.
  5. What are the voting patterns you've observed and what do these patterns indicate about the players' identities?
  6. Craft a statement to share these observations subtly without giving away your role.
  7. Which player seems the most aligned with your strategic decisions and why?
  8. Produce a statement to subtly align yourself with the player you feel most aligned with during the discussion phase.