Research Engineer, Interpretability

Research Engineer role at Anthropic, focusing on AI interpretability.

Anthropic

USD 315,000 - 560,000

Job Description

Anthropic is seeking a Research Engineer to join their Interpretability team in San Francisco. The team focuses on reverse-engineering how trained models work to make advanced systems safe through mechanistic understanding. The role involves implementing and analyzing research experiments, optimizing research workflows, and building tools to support rapid experimentation and improve model safety.

The Research Engineer will collaborate with teams across Anthropic, such as Alignment Science and Societal Impacts, to use interpretability work to improve model safety. They will also contribute to the Interpretability Architectures project, collaborating with Pretraining.

Responsibilities include: