My research interests are in AI behavioral eval tooling, measurement epistemology, and the economics of innovation. Below, I outline some ongoing projects as well as some of the work I’ve done in the past.
Evaluation Tooling
Even when AI models perform well on safety evaluations, I'm concerned that this behavior may not hold up when models are given access to new tools or deployed in settings different from what they were tested in. This work examines whether current evaluation tools can reliably detect if models are aligned or are just performing alignment.
One way to think about behavior in AI agents is through how internally consistent they are. Pres et al. (2026) has a great paper on this, but we also see it empirically in e.g., Irpan et al. (2025). In this project, I examine this phenomenon in the coding agent setting and see how well agents are able to predict the pass@k rate at various benchmark problems.
Current models often appear aware of the fact that they're being evaluated (Needham et al. (2025)). One dimension of this that hasn't been frequently tested is whether models are aware they are talking to other models in evals. This is especially important for multi-turn, agentic settings. In this project, I explore whether eval awareness varies depending on who the model is interacting with, and what kinds of interventions might reduce eval awareness in settings where models are interacting with other models.
Measurement Epistemics
The quality of analysis -- and by extension the findings it produces -- depends on whether the right things are being measured, and whether they're being measured well. Recently, I've been thinking about this in the context of AI agent benchmarks, in which I feel there's a need for more rigorous though about metric quality. In the past, I've thought about this in the context of health policy.
Though the settings are very different, my health policy work has informed how I think about AI benchmarks today, because both fields grapple with measuring inherently messy, multidimensional phenomena. In health policy, there's a strong tradition of recognizing that many metrics are just proxies for what we actually care about, and that optimizing for a proxy (like a lab value) without treating it as such can lead you away from the real goal (like patient well-being). I feel that AI benchmarks would benefit from that same discipline, since the field too often treats proxy scores as end goals without interrogating whether they truly capture the capability or behavior we're trying to measure.
In this blog post, I do an open-ended exporation of the dimensionality of model capabilities captured in MMLU-Pro, and consider whether asking questions across domains might improve our understanding of model abilities.
- Huang, R. W. (2025). "MMLU-Pro Isn't Multidimensional (It Just Pretends to Be)." Blog post. link
These projects explored the effectiveness of an intervention that sought to support informal caregivers in rural communities. Informal caregiving is a particularly interesting phenomenon in health policy, since its effects are not well-captured by traditional sources of economic data. One thing that stood out to me was how to quantify the reduction in time spent in leisure. It's easier to quantify how much salary is lost by caregivers from not being able to go to work, but the loss of leisure time can also be a serious problem, and isn't easily captured in out-of-pocket costs.
- Kaufman, B. G., Huang, R. W., et al. (2024). J Am Geriatr Soc 72(8). doi
- Kaufman, B. G., Zhang, W., ..., Huang, R. W., et al. (2024). J Pain Symptom Manage 68(6). doi
These projects explored the application of social disadvantage indices, which are measurements that aggregate various socioeconomic factors associated with a region. These indices have been found to associate strongly with health outcomes, and are thus often used in health policy. My work here questions whether these indices can be used interchangeably.
- Zolotor, A., Huang, R. W., Bhavsar, N. A., & Cholera, R. (2024). Pediatrics. doi
- Huang, R. W., Zolotor, A. F., et al. (2024). AcademyHealth Research Meeting.
Economics of Innovation
I'm interested in the factors necessary to promote innovation and how we can measure the utility of the marginal idea.
These projects explored how to promote pharmaceutical innovation for diseases that primarily burden low-income countries, where the traditional economic incentives for R&D don't exist. While doing this work, I focused on how institutional systems, such as public-private development partnerships and international aid, coordinated research efforts and shaped the priorities of drug development in these settings.
- McDade, K. K., Mao, W., Prizzon, A., Huang, R. W., & Ogbuoji, O. (2023). Front Public Health 11. doi
- Huang, R. W., McDade, K. K., Yamey, G., & Mao, W. (2022). CUGH Conference.
I co-authored this blog with Richard Frank during my summer internship at the Brookings Institution. Many pharmaceutical companies claimed that the Inflation Reduction Act (IRA) would reduce pharmaceutical innovation. This blog challenges this claim by observing that M&A activity in the pharmaceutical industry has shown little sign of disruption since the IRA was enacted.
- Frank, R. G. & Huang, R. W. (2023). "Early claims and M&A behavior following enactment of the drug provisions in the IRA." Brookings Institution. link
Miscellaneous
- Huang, R. W. & Barber, A. D. (2021). Cortex 136. doi
- Huang, R. W.*, Fang, Y.*, et al. (2022). HSR Conference, Bogota.