Anthropic - Core Views on AI Safety: When, Why, What, and How

https://www.anthropic.com/index/core-views-on-ai-safety

Unfortunately, I find this awfully unpersuasive on AI risk.

  • There’s really no discussion anywhere here of anyone’s incentives. Incentives of the founders, of the funders, of individual employees, of ex-employees, of the org-as-entity. This is troubling because in every case, these entities will face enormous net incentives to accelerate. If the claim is that they’ll not do this, despite that, then that claim needs extended justification.
  • There’s no discussion of the ecosystem: how will Anthropic’s actions affect (or fail to affect) other labs? If they publish safety work, will other labs pay any attention? Why/how? How will Anthropic relate to governments?
  • Their two-part model of safety risks seems to completely ignore the category of “bad people doing bad things”, which is at the moment much more acutely concerning to me than risks of technical alignment failures.
  • They mention a research program focused on societal impacts and evaluations, but I notice that I have zero faith in this. Maybe I just don’t know the appropriate people at Anthropic, but my impression is that they’re not connected to the professional spheres they need for this, at all.
  • Systematically, why should we trust Anthropic? Why shouldn’t we advocate for a complete halt to their work? (another way of looking at the penalties point you make…)
  • They’re really excited about Mechanistic interpretability. I find the results there super interesting personally, but I notice that I’m awfully pessimistic, given that direction’s pace to date, that it’s likely to help much “in time”, particularly for their second safety category of “massive disruption”. Maybe I’m being too pessimistic.

Per MN: they made a terrible, awful call in trusting FTX/SBF, and this document utterly fails to grapple with what that says about their judgment. From Oliver at Lightcone:

I feel quite worried that the alignment plan of Anthropic currently basically boils down to "we are the good guys, and by doing a lot of capabilities research we will have a seat at the table when AI gets really dangerous, and then we will just be better/more-careful/more-reasonable than the existing people, and that will somehow make the difference between AI going well and going badly". That plan isn't inherently doomed, but man does it rely on trusting Anthropic's leadership, and I genuinely only have marginally better ability to distinguish the moral character of Anthropic's leadership from the moral character of FTX's leadership, and in the absence of that trust the only thing we are doing with Anthropic is adding another player to an AI arms race.

Last updated 2023-04-07.