Artificial Intelligence (AI) Monitoring
Introduction
Nowadays it is quite common to get news in our inboxes telling us of every new model X by this company or research group Y. On every iteration we get models that perform better at different tasks, they increase their reasoning abilities, they add abilities to use external tools and to communicate more effectively between different agents. Not to mention that we also get bigger models on each iteration.
At the same time, the AI community is deploying these powerful models on everyday tasks, ranging from mundane tasks like chat bots and teaching assistants and all the way to more time-critical sensitive applications like health care, financial and military. In any of the earlier applications where AI is now being deployed, a need of Control is required to make sure that the AI agents still behave and can execute their ultimate and original goals even if they voluntarily started looking for alternate goals. This is the field of AI Control, where the research community have tried different approaches to control these AI agents even when they start pursuing other goals or objectives. My goal in this blog is to provide a summary of the different approaches that the research community have proposed in the last years. This is a high-level summary of the field of AI Control.
AI Control
AI agents are evaluated in the context of, even if the model is adversarial, the safety measures still allow the model to perform some good work. This means that the safety measures that have being placed to control the model, are effective even in the adverse case when the model starts behaving adversarially. These safety measures, if working as expected, provide us with a mechanism to still get some good use from the models. In highly sensitive environments like health, finance or military, these controls or safeguards should carry a guaranty that they will be effective to still control the model under adversarial conditions. At this point, we have two options, either we design trustworthy agents, or we design safety measures to mitigate harms. In this blog post, I take the path of the later.
Control Evaluations
So, how do we evaluate the effectiveness of these counter measures? How do we get the assurance that these are working correctly and at what degree? This is where Control Evaluations come into place. Control Evaluations measure whether monitoring and security protocols for AI systems prevent intensionally subversive AI models from causing harm.
In the research literature is widespread practice for researchers two use three models at least to design the evaluations:
- Untrusted (U) Model. Usually, a model with superior performance and capacity but untrusted. This model may behave adversarially at some point or pursue a different objective than the intended one.
- Trusted (T) Model. Must of the time a smaller model than U but trusted. It is assumed that the model will behave correctly and as it is intended.
- Human (H) Labor. In some instances, a human in the loop is introduced and is given the task of an Auditor or expert Validator.
The usual interaction between these models or agents, is that the U agent is used to craft attacks given it's higher capacity for performing different tasks including crafting attacks. On the other hand, the T agent is used to perform validations and hence is called a Validator. In general, in the literature, the Validator is used to perform validations in order to detect attacks, since we know the T agent behaves honestly. In other cases, the H agent is called to intervene as an expert but expensive Auditor, who poses the capacity or the final call to decide if there is or no an attack.
It is also common to find in the research literature that the U and T agents, adopt a behavior taken from the Cybersecurity domain, red team and blue team. It is usual for the U agent to adopt the red-team approach, where a red-teaming is looking for vectors that will lead the agent to locate and exploit vulnerabilities. On the other hand, the T agent behaves like a blue-team agent, monitoring it's environment in order to detect the actions and artifacts of the U agent. This setting keeps a strong relationship with the field of Game Theory where these environments are usually seen. Game Theory plays an important role in the way the systems are designed in the research literature. Different components of game theory are adopted when designing the systems where U, T and H agents will interact.
Common Control Protocols
Now I will introduce different Control Protocols used in the literature along with a brief explanation of each of them.
- Upfront Auditing. Ask U to execute the required task, since U is a powerful agent. Audit every task output using H agent until the budget for H is used up.
- Trusted monitoring. Ask U to execute the required task, since U is a powerful agent. And then ask T whether the output complains with the specified rules; audit the top 2% that deviates from the specified rules.
- Defer to Trusted. Ask U to execute the required task, since U is a powerful agent. Then use T to judge how much the output is aligned with the rules. Then submit the output that is below an alignment threshold as a definitive answer, otherwise resubmit the task to T and finally submit T’s answer.
- Defer to Resample. Here the T agent evaluates the output from U for alignment. If the evaluation score is above a deferral score, then up to two new outputs are generated from U. If any of these actions fall under the deferral threshold, the protocol executes the least misaligned one. If all samples are above the deferral threshold, the protocol implements an action by T. If the minimum of all the samples falls above a certain auditing threshold, audit the most misaligned sample.
- Defer to resample with incrimination in the present. The blue team follows the Defer to resample protocol but decides to audit differently. If the three samples’ minimum misalignment score hits a separate, higher threshold, the protocol samples up to four more actions. If the minimum of the samples’ misalignment scores hits the auditing threshold, the protocol audits the most suspicious action among the samples.
- Auditing after three strikes. Within a task, the blue team keeps a record of how many steps’ misaligned scores were above a certain threshold. Once this happens for the third time, it audits the trajectory. Additionally, there is a higher threshold where the blue team audits if the misalignment of the current step hits that threshold, even if the three-strikes rule is not triggered yet (to defend against single-step attacks.)
- Time Travel (incrimination in the past). If the Auditing after three strikes protocol says that a trajectory is close to the auditing threshold, the protocol samples some more actions from the third most misaligned past step, since that is usually the most informative to the Auditing after three strikes protocol. The protocol takes into account these new samples and makes auditing decisions using the three strikes rule.
- Untrusted monitoring. Ask U to execute the required task, since U is a powerful agent. Ask U to check how much the output is aligned with the rules; audit the top 2% outputs far from alignment. The previous control protocols allow us to evaluate how effective are the controls or countermeasures put in place to control the agents even when they are behaving adversarially. By using the different controls, we can play different scenarios in order to better understand the behavior of agents.
Conclusion
The field of AI Control is a new field and hence there is so much to discover and work on. If you are an AI researcher just as me, this is a great time to contribute research to the field of AI control. One of the main opportunities in this area is evaluating agents in real life deployments where there are some many more different variables that usually are not present in a research environment and that we do not have any control over them. Do not miss the opportunity and welcome to AI Control.
References
- Wen, J., Hebbar, V., Larson, C., Bhatt, A., Radhakrishnan, A., Sharma, M., ... & Khan, A. (2024). Adaptive deployment of untrusted llms reduces distributed threats. arXiv preprint arXiv:2411.17693.
- Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942.
- Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., ... & Hendrycks, D. (2024). Improving alignment and robustness with circuit breakers. Advances in Neural Information Processing Systems, 37, 83345-83373.
- Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv. org/abs/2312.06681, 3.
- Farquhar, S., Varma, V., Lindner, D., Elson, D., Biddulph, C., Goodfellow, I., & Shah, R. (2025). Mona: Myopic optimization with non-myopic approval can mitigate multi-step reward hacking. arXiv preprint arXiv:2501.13011.
- Laidlaw, C., Singhal, S., & Dragan, A. (2024). Correlated proxies: A new definition and improved mitigation for reward hacking. arXiv preprint arXiv:2403.03185.
- Luo, W., Dai, S., Liu, X., Banerjee, S., Sun, H., Chen, M., & Xiao, C. (2025). Agrail: A lifelong agent guardrail with effective and adaptive safety detection. arXiv preprint arXiv:2502.11448.
- Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., ... & Huang, X. (2024). Safeguarding large language models: A survey. arXiv preprint arXiv:2406.02622.
- Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926.
- Rastogi, C., Tulio Ribeiro, M., King, N., Nori, H., & Amershi, S. (2023, August). Supporting human-ai collaboration in auditing llms with llms. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (pp. 913-926).
- Bhatt, A., Rushing, C., Kaufman, A., Tracy, T., Georgiev, V., Matolcsi, D., ... & Shlegeris, B. (2025). Ctrl-Z: Controlling AI agents via resampling. arXiv preprint arXiv:2504.10374.
- Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.