This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.
These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,1 but specific details may be suboptimal. And some actions are not fully specified; some are of the form do good things in area X or avoid risk Y, without pinning down what to do. And this is not certain to be correct in the future; I just think it's a good plan for now. Some of these actions are costly, but I claim that they are cheap and important enough that a responsible frontier AI company should do them unilaterally — the gain in safety is worth the cost in competitiveness. I mostly don't justify these actions in this post.
(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)
Security for model weights and code
Extremely strong security will be crucial when AIs can massively accelerate AI R&D. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly2—before it would be very bad for AI model weights to be stolen.3 It seems such AI could plausibly appear as early as 2027; if so, labs should plan to achieve extreme security optionality before 2027.4 The lab should develop a roadmap to achieve extreme security optionality, publicly explain and justify its high-level security goals, and maybe share some details and publish stuff to validate claims (e.g. a pentester's summary of their pentest results) with sensitive details redacted.5 It is also reasonable to aim for a weaker level of security, SL5, for now, but that will probably be insufficient when AIs can massively accelerate research.
Control
Anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape during internal deployment.6 The lab should do evals for scheming-relevant capabilities, and by the time models can plausibly scheme coherently, be ready to ensure control, e.g.:
Use monitoring and auditing techniques in deployment
Use control evals based on the deployment protocol: determine how good the protocol is assuming the model is scheming
Try to directly test whether the model is scheming (even though this is unreliable): test the model on inputs that might cause it to misbehave in simulated high-stakes deployment settings, and test it in simulated successful escapes
If the lab catches scheming, take it seriously: tell everyone, pause deployment—or go into high alert, disabling some risky stuff and lowering thresholds for editing/replacing/auditing suspicious model outputs—and use that example of misbehavior to improve safety
Misuse
Anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users, maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.
Accountability for future deployment decisions
Plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:7
Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.
Plan for AGI
Labs should have a clear, good plan for what to do when they build AGI—i.e., AI that can obsolete top human scientists—beyond just making the AGI safe.8 For example, one plan is perfectly solve AI safety, then hand things off to some democratic process. This plan is bad because solving AI safety will likely take years after systems are transformative and superintelligence is attainable, and by default someone will build superintelligence without strong reason to believe they can do so safely or do something else unacceptable during that time.
The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer9 to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects10), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control). Use AGI to increase the government's willingness to pay for nonproliferation and to decrease the cost of nonproliferation (build carrots to decrease cost of others accepting nonproliferation and build sticks to decrease cost to US of enforcing nonproliferation).11
For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.
Safety research
Do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.12 Also boost external safety research, especially by giving safety researchers deeper access13 to externally deployed models.
Policy engagement
Inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.
Let staff share views
Encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)
Public statements
The lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).
If-then planning (meta)
Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)
More
I list some less important or more speculative actions (very nonexhaustive) in this footnote.14 See The Checklist: What Succeeding at AI Safety Will Involve (Bowman 2024) for a collection of goals for a lab. See A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed but rough, out-of-date, and very high-context plan. See Racing through a minefield: the AI deployment problem (Karnofsky 2022) on the dilemma labs face when deciding how to deploy powerful AI and high-level actions for a lab to improve safety. See Towards best practices in AGI safety and governance: A survey of expert opinion (Schuett et al. 2023) for an old list.
Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.
Discuss on LessWrong.
With two exceptions, where I feel unconfident: misuse prevention (misuse has positive second-order effects) and accountability for risk assessment (its causal benefit may be small if done unilaterally).
Extreme security optionality is the ability to implement extreme security quickly, e.g. within 3 months. Also extreme security should be implementable at a reasonable maximum cost, e.g $10B to implement plus $10B per year and a 75% slowdown in research — the property the lab could implement extreme security but it would require basically stopping research or incurring other prohibitive costs doesn't suffice.
I focus on extreme security for model weights and code, not necessarily for secrets. Securing weights will be more important; it seems easier; and extreme security for secrets entails that staff are much less able to talk to people outside the lab, which could hurt their ability to think clearly.
Labs will likely need substantially better security than defend against top-priority operations by top state actorssoon after AIs can massively accelerate AI R&D: model weights will be among the most valuable and strategically important assets ever, so labs should be prepared for "unprecedentedly well-resourced attacks," not to mention that AI will boost offense.
Proliferation of model weights would lead to others quickly getting more powerful AI, since the AI can massively boost AI R&D. This removes the option of coordination that requires nonproliferation, enables misuse and state conflict, makes it hard for US labs to gain a substantial lead, and makes it impossible for US labs to unilaterally slow progress while retaining a lead.
By the time extreme security is necessary, AIs will be able to substantially boost security. So for now, labs should focus on the things that will be hard for AI to accelerate: things that require nonparallelizable physical work. For example, labs should probably focus on physically constructing or retrofitting datacenters such that they can be extremely secure.
If it becomes clear that the lab will achieve extreme security optionality before it's necessary, the lab could choose to revise the plan and slow its progress toward extreme security.
Would an if-then security commitment—don't create a very dangerous model until its weights would be secure—suffice? Sure, but I worry that extreme security takes substantial time to implement (in part because it likely requires physical changes to datacenters) and the labs are not on track to implement it before they need it, so a lab will either proceed unsafely or pause while the less scrupulous race ahead. Regardless of whether a lab has such a commitment, I think it should start sprinting for extreme security optionality now.
Maybe a large fraction of AI risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored."
Escape means "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this:
At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories.
This is partially a substitute for responsible scaling policies. Labs should still make at least high-level RSP-y (or "if-then") commitments too.
A plan could include how to use AGI, what further AI development to do, and what the exit state is if all goes well. A plan should consider not just what the lab will do but also what will be happening with the US government, other labs, and other states.
Really the goal is a playbook, not just a plan — it should address what happens if various assumptions aren't met or things are different than expected, should include ideas that are worth exploring/trying but not depending on, etc.
We don't need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.
To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.
This plan, and desire for this kind of plan, is largely stolen from Ryan Greenblatt. He doesn't necessarily endorse my version.
Dario Amodei recently said something related:
it seems very important that democracies have the upper hand on the world stage when powerful AI is created. . . . My current guess at the best way to do this is via an “entente strategy”, in which a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust military superiority (the stick) while at the same time offering to distribute the benefits of powerful AI (the carrot) to a wider and wider group of countries in exchange for supporting the coalition’s strategy to promote democracy (this would be a bit analogous to “Atoms for Peace”). The coalition would aim to gain the support of more and more of the world, isolating our worst adversaries and eventually putting them in a position where they are better off taking the same bargain as the rest of the world: give up competing with democracies in order to receive all the benefits and not fight a superior foe.
(Among other things, this does not address the likely possibility that the democracies should delay building wildly superintelligent AI.)
This is slightly complicated because (1) some safety research boosts capabilities and (2) some safety practices work better if their existence (in the case of control) or details (in the case of adversarial robustness) are secret. In some cases you can share with other labs to get most of the benefits and little of the cost. Regardless, lots of safety research is good to publish. And publishing model evals is good for some of the same reasons plus transparency and accountability.
E.g. fine-tuning, helpful-only, filters/moderation off, logprobs, activations.
RSP details (but it's not currently possible to do all of this very well)
See generally Key Components of an RSP
Do evals regularly
And have predefined high-level capability thresholds
And have predefined low-level operationalizations
And publish evals and thresholds, or have an auditor comment
And have thresholds trigger specific responses
And all these details should be good
And prepare for a pause (to make it less costly, in case it's necessary)
And accountability/transparency/meta stuff: publish updates or be transparent to an auditor, have a decent update process, elicit external review, perhaps adopt a mechanism for staff to flag false statements
Elicitation details (for model evals for dangerous capabilities): see the summary in my blogpost (or my table)
Don't publish capabilities research, modulo Thoughts on sharing information about language model capabilities (Christiano 2023) — don't publish stuff that boosts black-box capabilities (especially stuff on training) unless it improves safety
Internal governance
Have structure, mandate, and incentives set up such that the lab can prioritize safety in key decisions.
Plan what the lab and its staff would do if it needed to pause for safety
Have a good process for staff to escalate concerns about safety internally
Don't discourage certain kinds of whistleblowing — details unclear
Don't use nondisparagement provisions in contracts or severance agreements
Object-level security practices
Upload limits to protect weights
Multiparty access controls
Ask the US government for help on red-teaming, physical security, and vetting employees
See generally Securing AI Model Weights: Preventing Theft and Misuse of Frontier Models (RAND: Nevo et al. 2024)
Various specific safety techniques (potentially high-priority but I just didn't focus this post on them) (not necessarily currently existing) (very nonexhaustive)
Scalable oversight and creating a good training signal (see e.g. 1, 2, 3, 4)
Unlearning
Interpretability? For auditing a model pre-deployment, or auditing suspicious model outputs, or maybe determining whether outputs are suspicious in a control setup, or maybe intervening a la activation engineering, and more. See e.g. A Longlist of Theories of Impact for Interpretability (Nanda 2022).
Safety scaffolding (discussed in control and misuse above)
Probing, including coup probes and defection probes
Self-whistleblowing training: labs could train the AI to report on itself (perhaps by emailing the lab) when it acts suspiciously, notices that it's in an escape attempt, or sees suspicious things (perhaps including vulnerable code, coordination signals, or steganographic messages). See "Hard core whistleblowing" in "A Plan for Technical AI Safety with Current Science" (Greenblatt 2023).