--- library_name: stable-baselines3 tags: - PandaReachDense-v3 - deep-reinforcement-learning - reinforcement-learning - stable-baselines3 model-index: - name: A2C results: - task: type: reinforcement-learning name: reinforcement-learning dataset: name: PandaReachDense-v3 type: PandaReachDense-v3 metrics: - type: mean_reward value: -0.24 +/- 0.13 name: mean_reward verified: false --- # A2C Agent for PandaReachDense-v3 ## Model Description This repository contains a trained Advantage Actor-Critic (A2C) reinforcement learning agent designed to solve the PandaReachDense-v3 environment from PyBullet Gym. The agent has been trained using the stable-baselines3 library to perform robotic arm reaching tasks with the Franka Emika Panda robot. ### Model Details - **Algorithm**: A2C (Advantage Actor-Critic) - **Environment**: PandaReachDense-v3 (PyBullet) - **Framework**: Stable-Baselines3 - **Task Type**: Continuous Control - **Action Space**: Continuous (7-dimensional joint control) - **Observation Space**: Multi-dimensional state representation including joint positions, velocities, and target coordinates ### Environment Overview PandaReachDense-v3 is a robotic manipulation task where: - **Objective**: Control a 7-DOF Franka Panda robotic arm to reach target positions - **Reward Structure**: Dense reward based on distance to target and achievement of goal - **Difficulty**: Continuous control with high-dimensional action and observation spaces ## Performance The trained A2C agent achieves the following performance metrics: - **Mean Reward**: -0.24 ± 0.13 - **Performance Context**: This represents strong performance for this environment, where typical untrained baselines often achieve rewards around -3.5 - **Training Stability**: The relatively low standard deviation indicates consistent performance across evaluation episodes ### Performance Analysis The achieved mean reward of -0.37 demonstrates significant improvement over random baselines. In the PandaReachDense-v3 environment, rewards are typically negative and approach zero as the agent becomes more proficient at reaching targets. The substantial improvement from the baseline of approximately -3.5 indicates the agent has successfully learned to: - Navigate the robotic arm efficiently toward target positions - Minimize unnecessary movements and energy expenditure - Achieve consistent reaching behavior across varied target locations ## Usage ### Installation Requirements ```bash pip install stable-baselines3[extra] pip install huggingface-sb3 pip install pybullet pip install gym ``` ### Loading and Using the Model ```python import gym import pybullet_envs from stable_baselines3 import A2C from huggingface_sb3 import load_from_hub # Load the trained model model = load_from_hub( repo_id="Adilbai/a2c-PandaReachDense-v3", filename="a2c-PandaReachDense-v3.zip" ) # Create the environment env = gym.make("PandaReachDense-v3") # Evaluate the model obs = env.reset() for i in range(1000): action, _states = model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action) env.render() # Optional: visualize the agent if done: obs = env.reset() env.close() ``` ### Advanced Usage: Fine-tuning ```python import gym import pybullet_envs from stable_baselines3 import A2C from huggingface_sb3 import load_from_hub # Load the pre-trained model model = load_from_hub( repo_id="Adilbai/a2c-PandaReachDense-v3", filename="a2c-PandaReachDense-v3.zip" ) # Create environment for fine-tuning env = gym.make("PandaReachDense-v3") # Continue training (fine-tuning) model.set_env(env) model.learn(total_timesteps=100000) # Save the fine-tuned model model.save("fine_tuned_a2c_panda") ``` ### Evaluation Script ```python import gym import numpy as np import pybullet_envs from stable_baselines3 import A2C from huggingface_sb3 import load_from_hub def evaluate_model(model, env, num_episodes=10): """Evaluate the model performance over multiple episodes""" episode_rewards = [] for episode in range(num_episodes): obs = env.reset() episode_reward = 0 done = False while not done: action, _states = model.predict(obs, deterministic=True) obs, reward, done, info = env.step(action) episode_reward += reward episode_rewards.append(episode_reward) print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}") mean_reward = np.mean(episode_rewards) std_reward = np.std(episode_rewards) print(f"\nEvaluation Results:") print(f"Mean Reward: {mean_reward:.2f} ± {std_reward:.2f}") return episode_rewards # Load and evaluate the model model = load_from_hub( repo_id="Adilbai/a2c-PandaReachDense-v3", filename="a2c-PandaReachDense-v3.zip" ) env = gym.make("PandaReachDense-v3") rewards = evaluate_model(model, env, num_episodes=20) env.close() ``` ## Training Information ### Hyperparameters The model was trained using A2C with the following key characteristics: - **Policy**: Multi-layer perceptron (MLP) for both actor and critic networks - **Environment**: PandaReachDense-v3 with dense reward shaping - **Training Framework**: Stable-Baselines3 ### Training Environment - **Observation Space**: Continuous state representation including: - Joint positions and velocities - End-effector position - Target position - Distance to target - **Action Space**: 7-dimensional continuous control (joint torques/positions) - **Reward Function**: Dense reward based on distance to target with sparse completion bonus ## Limitations and Considerations - **Environment Specificity**: Model is specifically trained for PandaReachDense-v3 and may not generalize to other robotic tasks - **Simulation Gap**: Trained in simulation; real-world deployment would require domain adaptation - **Deterministic Evaluation**: Performance metrics based on deterministic policy evaluation - **Hardware Requirements**: Real-time inference requires modest computational resources ## Citation If you use this model in your research, please cite: ```bibtex @misc{a2c_panda_reach_2024, title={A2C Agent for PandaReachDense-v3}, author={Adilbai}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/Adilbai/a2c-PandaReachDense-v3}} } ``` ## License This model is distributed under the MIT License. See the repository for full license details.