BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery Paper • 2501.01540 • Published 5 days ago • 5
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Paper • 2303.02536 • Published Mar 5, 2023 • 1
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Paper • 2305.08809 • Published May 15, 2023 • 2
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Paper • 2310.17230 • Published Oct 26, 2023
Generating Language Corrections for Teaching Physical Control Tasks Paper • 2306.07012 • Published Jun 12, 2023
Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness Paper • 2306.10015 • Published Jun 16, 2023
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought Paper • 2306.12672 • Published Jun 22, 2023 • 26
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Paper • 2401.12631 • Published Jan 23, 2024
Hypothesis Search: Inductive Reasoning with Language Models Paper • 2309.05660 • Published Sep 11, 2023 • 2
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Paper • 2403.07809 • Published Mar 12, 2024 • 1
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking Paper • 2403.09629 • Published Mar 14, 2024 • 75
Stream of Search (SoS): Learning to Search in Language Paper • 2404.03683 • Published Apr 1, 2024 • 29
Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models Paper • 2209.08141 • Published Sep 16, 2022