-
Part 1: Getting Started
-
§1 Math & CS Foundations
11
-
★
Essence of Linear Algebra
— 3Blue1Brown
[course]
Visual intuition for vectors, matrices, eigenvalues
Essence of Linear Algebra
3Blue1Brown
Visual intuition for vectors, matrices, eigenvalues
Type: course
-
Neural Networks
— 3Blue1Brown
[course]
Visual intro to how neural nets work
Neural Networks
3Blue1Brown
Visual intro to how neural nets work
Type: course
-
Statistics & Probability
— Khan Academy
[course]
Distributions, Bayes' theorem, hypothesis testing
Statistics & Probability
Khan Academy
Distributions, Bayes' theorem, hypothesis testing
Type: course
-
Mathematics for Machine Learning
— Deisenroth et al.
[book]
Free textbook---linear algebra, calculus, probability
Mathematics for Machine Learning
Deisenroth et al.
Free textbook---linear algebra, calculus, probability
Type: book
-
CS231n Notes
— Stanford
[course]
Practical neural net fundamentals
CS231n Notes
Stanford
Practical neural net fundamentals
Type: course
-
Top-down, code-first approach
Practical Deep Learning for Coders
fast.ai
Top-down, code-first approach
Type: course
-
How Transformer LLMs Work
— Alammar & Grootendorst
[course]
95-min course: tokenization, attention, MoE
How Transformer LLMs Work
Alammar & Grootendorst
95-min course: tokenization, attention, MoE
Type: course
- Python for ML
-
Python Tutorial
[documentation]
Official, if you need basics
Python Tutorial
Official, if you need basics
Type: documentation
-
NumPy Quickstart
[documentation]
Array operations
NumPy Quickstart
Array operations
Type: documentation
-
Data manipulation
Pandas Getting Started
Data manipulation
Type: documentation
-
Tensors, autograd, training
PyTorch 60-Minute Blitz
Tensors, autograd, training
Type: documentation
-
Part 2: Understanding AI
-
§2 Foundations (The Canon)
9
-
★
Learning Representations by Back-Propagating Errors
— Rumelhart, Hinton, Williams
(1986)
[paper]
How neural nets learn. Everything builds on this.
Learning Representations by Back-Propagating Errors
David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams
How neural nets learn. Everything builds on this.
Type: paper
Year: 1986
Journal: Nature
Short but dense. The chain rule applied to neural networks.
-
★
Efficient Estimation of Word Representations in Vector Space
— Mikolov et al.
(2013)
[paper]
Word2vec. 'King - Man + Woman = Queen'
Efficient Estimation of Word Representations in Vector Space
Mikolov et al.
Word2vec. 'King - Man + Woman = Queen'
Type: paper
Year: 2013
-
GloVe: Global Vectors for Word Representation
— Pennington et al.
(2014)
[paper]
Alternative embeddings, co-occurrence based
GloVe: Global Vectors for Word Representation
Pennington et al.
Alternative embeddings, co-occurrence based
Type: paper
Year: 2014
-
Sequence to Sequence Learning
— Sutskever et al.
(2014)
[paper]
Encoder-decoder architecture
Sequence to Sequence Learning
Sutskever et al.
Encoder-decoder architecture
Type: paper
Year: 2014
-
ImageNet Classification with Deep CNNs
— Krizhevsky et al.
(2012)
[paper]
ImageNet moment---deep learning's 'big bang'
ImageNet Classification with Deep CNNs
Krizhevsky et al.
ImageNet moment---deep learning's 'big bang'
Type: paper
Year: 2012
-
Deep Residual Learning
— He et al.
(2015)
[paper]
Skip connections, enabled very deep networks
Deep Residual Learning
He et al.
Skip connections, enabled very deep networks
Type: paper
Year: 2015
-
Batch Normalization
— Ioffe & Szegedy
(2015)
[paper]
Training stability trick used everywhere
Batch Normalization
Ioffe & Szegedy
Training stability trick used everywhere
Type: paper
Year: 2015
-
Dropout
— Srivastava et al.
(2014)
[paper]
Regularization that actually works
Dropout
Srivastava et al.
Regularization that actually works
Type: paper
Year: 2014
-
Adam: A Method for Stochastic Optimization
— Kingma & Ba
(2014)
[paper]
The default optimizer
Adam: A Method for Stochastic Optimization
Kingma & Ba
The default optimizer
Type: paper
Year: 2014
-
§3 Attention & Transformers
10
-
Neural Machine Translation by Jointly Learning to Align and Translate
— Bahdanau et al.
(2014)
[paper]
Invented attention mechanism
Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau et al.
Invented attention mechanism
Type: paper
Year: 2014
-
★
Attention Is All You Need
— Vaswani et al.
(2017)
[paper]
Transformers. The architecture. Read carefully.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Transformers. The architecture. Read carefully.
Type: paper
Year: 2017
The foundational transformer paper. Section 3 (model architecture) is the most important.
-
BERT: Pre-training of Deep Bidirectional Transformers
— Devlin et al.
(2018)
[paper]
Bidirectional pretraining, MLM objective
BERT: Pre-training of Deep Bidirectional Transformers
Devlin et al.
Bidirectional pretraining, MLM objective
Type: paper
Year: 2018
-
Improving Language Understanding by Generative Pre-Training
— Radford et al.
(2018)
[paper]
GPT---autoregressive pretraining
Improving Language Understanding by Generative Pre-Training
Radford et al.
GPT---autoregressive pretraining
Type: paper
Year: 2018
-
Language Models are Unsupervised Multitask Learners
— Radford et al.
(2019)
[paper]
GPT-2, scaling
Language Models are Unsupervised Multitask Learners
Radford et al.
GPT-2, scaling
Type: paper
Year: 2019
-
★
Language Models are Few-Shot Learners
— Brown et al.
(2020)
[paper]
GPT-3, in-context learning emerges at scale
Language Models are Few-Shot Learners
Brown et al.
GPT-3, in-context learning emerges at scale
Type: paper
Year: 2020
-
Scaling Laws for Neural Language Models
— Kaplan et al.
(2020)
[paper]
Chinchilla precursor, loss vs. compute/data/params
Scaling Laws for Neural Language Models
Kaplan et al.
Chinchilla precursor, loss vs. compute/data/params
Type: paper
Year: 2020
-
★
Training Compute-Optimal Large Language Models
— Hoffmann et al.
(2022)
[paper]
Chinchilla---optimal scaling ratios
Training Compute-Optimal Large Language Models
Hoffmann et al.
Chinchilla---optimal scaling ratios
Type: paper
Year: 2022
-
LLaMA: Open and Efficient Foundation Language Models
— Touvron et al.
(2023)
[paper]
Open weights, efficient training
LLaMA: Open and Efficient Foundation Language Models
Touvron et al.
Open weights, efficient training
Type: paper
Year: 2023
-
FlashAttention
— Dao et al.
(2022)
[paper]
IO-aware attention, practical speedup
FlashAttention
Dao et al.
IO-aware attention, practical speedup
Type: paper
Year: 2022
-
§4 Reasoning & Chain-of-Thought
20
-
★
Chain-of-Thought Prompting Elicits Reasoning
— Wei et al.
(2022)
[paper]
'Let's think step by step' works
Chain-of-Thought Prompting Elicits Reasoning
Wei et al.
'Let's think step by step' works
Type: paper
Year: 2022
-
Self-Consistency Improves Chain of Thought Reasoning
— Wang et al.
(2022)
[paper]
Sample multiple CoT paths, majority vote
Self-Consistency Improves Chain of Thought Reasoning
Wang et al.
Sample multiple CoT paths, majority vote
Type: paper
Year: 2022
-
Tree of Thoughts
— Yao et al.
(2023)
[paper]
Search over reasoning paths
Tree of Thoughts
Yao et al.
Search over reasoning paths
Type: paper
Year: 2023
-
★
ReAct: Synergizing Reasoning and Acting
— Yao et al.
(2022)
[paper]
Reasoning + Acting, tool use
ReAct: Synergizing Reasoning and Acting
Yao et al.
Reasoning + Acting, tool use
Type: paper
Year: 2022
-
Toolformer
— Schick et al.
(2023)
[paper]
LLMs learning to use tools
Toolformer
Schick et al.
LLMs learning to use tools
Type: paper
Year: 2023
-
Let's Verify Step by Step
— Lightman et al.
(2023)
[paper]
Process reward models for math
Let's Verify Step by Step
Lightman et al.
Process reward models for math
Type: paper
Year: 2023
-
CoT explanations are systematically unfaithful — stated reasoning is influenced by biasing features the model doesn't mention; the reported chain-of-thought diverges from what actually drove the output
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Turpin et al.
CoT explanations are systematically unfaithful — stated reasoning is influenced by biasing features the model doesn't mention; the reported chain-of-thought diverges from what actually drove the output
Type: paper
Year: 2023
-
MAKER: Solving a Million-Step LLM Task
(2025)
[paper]
Ensemble voting for long-horizon reliability
MAKER: Solving a Million-Step LLM Task
Ensemble voting for long-horizon reliability
Type: paper
Year: 2025
-
The Prompt Report
— Schulhoff et al.
(2024)
[paper]
58 prompting techniques, taxonomy, best practices
The Prompt Report
Schulhoff et al.
58 prompting techniques, taxonomy, best practices
Type: paper
Year: 2024
-
Let Me Speak Freely?
— Tam et al.
(2024)
[paper]
Structured output (JSON/XML) degrades reasoning
Let Me Speak Freely?
Tam et al.
Structured output (JSON/XML) degrades reasoning
Type: paper
Year: 2024
-
Thinking Before Constraining
— Nguyen et al.
(2026)
[paper]
Fix: reason freely, then constrain output format
Thinking Before Constraining
Nguyen et al.
Fix: reason freely, then constrain output format
Type: paper
Year: 2026
-
Formal framework for XML-based structured prompting with convergence guarantees
XML Prompting as Grammar-Constrained Interaction
Alpay & Alpay
Formal framework for XML-based structured prompting with convergence guarantees
Type: paper
Year: 2025
Relevant to structured output tooling. Curate later.
-
Fine-tuned lightweight model as post-processing layer for structured output
SLOT: Structuring the Output of Large Language Models
Shen et al.
Fine-tuned lightweight model as post-processing layer for structured output
Type: paper
Year: 2025
Journal: EMNLP 2025 Industry Track
Relevant to structured output tooling. Curate later.
-
Benchmark for structured output reliability across tasks and models
StructuredRAG: JSON Response Formatting with Large Language Models
Shorten et al.
Benchmark for structured output reliability across tasks and models
Type: paper
Year: 2024
Relevant to structured output tooling. Curate later.
-
Grammar-based constrained decoding. Guarantees valid JSON/regex output by constraining token sampling.
Outlines: Structured Text Generation
dottxt
Grammar-based constrained decoding. Guarantees valid JSON/regex output by constraining token sampling.
Type: resource
Year: 2024
Key mitigation for structured output trap. Uses formal grammars at inference time.
-
Interleaves generation with programmatic control. Constrains output structure without taxing model attention.
Guidance: A Language for Controlling LLMs
Microsoft
Interleaves generation with programmatic control. Constrains output structure without taxing model attention.
Type: resource
Year: 2023
Alternative to Outlines. More control-flow oriented.
-
Repeating the input prompt improves performance without increasing output tokens or latency. Reasoning models already learn to do this internally.
Prompt Repetition Improves Non-Reasoning LLMs
Leviathan, Kalman, Matias
Repeating the input prompt improves performance without increasing output tokens or latency. Reasoning models already learn to do this internally.
Type: paper
Year: 2025
Mechanical explanation for why context position matters: attention can only look backward, so repetition gives the model more conditioning. 47/70 wins, 0 losses.
-
Claude's Cycles
— Knuth, Donald E.
(2026)
[paper]
Knuth acknowledges Claude Opus 4.6 solved an open combinatorics problem (directed Hamiltonian cycle decomposition) he'd been stuck on for weeks. Claude found the construction through 31 exploratory iterations in ~1 hour. Knuth then proved the construction correct and generalized it (760 valid decompositions). Claude could not prove its own answer, and degraded when pushed further (even-numbered case).
Claude's Cycles
Knuth, Donald E.
Knuth acknowledges Claude Opus 4.6 solved an open combinatorics problem (directed Hamiltonian cycle decomposition) he'd been stuck on for weeks. Claude found the construction through 31 exploratory iterations in ~1 hour. Knuth then proved the construction correct and generalized it (760 valid decompositions). Claude could not prove its own answer, and degraded when pushed further (even-numbered case).
Type: paper
Year: 2026
Perfect 'proposal engine / decision engine' example for Part 3. Claude proposes, Knuth verifies. Verification burden lands on someone capable of carrying it — unlike the healthcare/legal cases. Also demonstrates degradation under extended reasoning. Knuth previously dismissed ChatGPT in 2023 ('how to fake it'); now calls Claude's work 'quite admirable.' Consider for Part 3 closing or future article. HN discussion: https://news.ycombinator.com/item?id=47230710
-
Counters Self-Refine optimism. Shows LLMs cannot self-correct reasoning without external feedback — self-correction often degrades performance. Key evidence for verification asymmetry (thesis 27): the verification mechanism operates on the same medium as the generation mechanism. Supports the hallucination article's section 3 argument.
Large Language Models Cannot Self-Correct Reasoning Yet
Huang et al.
Counters Self-Refine optimism. Shows LLMs cannot self-correct reasoning without external feedback — self-correction often degrades performance. Key evidence for verification asymmetry (thesis 27): the verification mechanism operates on the same medium as the generation mechanism. Supports the hallucination article's section 3 argument.
Type: paper
Year: 2023
-
70-90% of multi-turn errors trace to propagation of previous-turn errors, not independent reasoning failures. Models rigidly maintain prior reasoning even when corrected. RLSTA fixes this using the model's own single-turn performance as RL signal. Inverse of sycophantic drift: inertia = committing too early, drift = agreeing too much. Both are multi-turn alignment failures.
Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction
Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou
70-90% of multi-turn errors trace to propagation of previous-turn errors, not independent reasoning failures. Models rigidly maintain prior reasoning even when corrected. RLSTA fixes this using the model's own single-turn performance as RL signal. Inverse of sycophantic drift: inertia = committing too early, drift = agreeing too much. Both are multi-turn alignment failures.
Type: paper
Year: 2026
-
Part 3: Building with AI
-
§5 RAG & Retrieval
19
-
★
RAG for LLMs: A Survey
— Gao et al.
(2023)
[paper]
Start here. Naive -> Advanced -> Modular RAG paradigms
RAG for LLMs: A Survey
Gao et al.
Start here. Naive -> Advanced -> Modular RAG paradigms
Type: paper
Year: 2023
-
End-to-end walkthrough. Good second read after survey.
Pinecone RAG Guide
End-to-end walkthrough. Good second read after survey.
Type: resource
-
Hands-on implementation with code
LangChain RAG Tutorial
Hands-on implementation with code
Type: resource
-
Concepts + implementation
LlamaIndex RAG Docs
Concepts + implementation
Type: resource
-
RAG From Scratch
— LangChain
[video]
Video series for visual learners
RAG From Scratch
LangChain
Video series for visual learners
Type: video
-
Original RAG paper---foundational
Retrieval-Augmented Generation for Knowledge-Intensive NLP
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
Original RAG paper---foundational
Type: paper
Year: 2020
Combines a pre-trained seq2seq model with a dense retriever. Key insight: retrieval can be end-to-end differentiable.
-
DPR---learned retrieval beats BM25
Dense Passage Retrieval for Open-Domain QA
Karpukhin et al.
DPR---learned retrieval beats BM25
Type: paper
Year: 2020
-
Retrieval-augmented pretraining
REALM: Retrieval-Augmented Language Model Pre-Training
Guu et al.
Retrieval-augmented pretraining
Type: paper
Year: 2020
-
HyDE---hypothetical document embeddings
Precise Zero-Shot Dense Retrieval without Relevance Labels
Gao et al.
HyDE---hypothetical document embeddings
Type: paper
Year: 2022
-
LLM generates pseudo-document for BM25. More conservative than HyDE.
Query2doc: Query Expansion with Large Language Models
Wang, Yang, Wei
LLM generates pseudo-document for BM25. More conservative than HyDE.
Type: paper
Year: 2023
Journal: EMNLP 2023
-
LLM decides when to retrieve
Self-RAG: Learning to Retrieve, Generate, and Critique
Asai et al.
LLM decides when to retrieve
Type: paper
Year: 2023
-
Recursive summarization for retrieval
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
Sarthi et al.
Recursive summarization for retrieval
Type: paper
Year: 2024
-
Few-shot learning with retrieval
Atlas: Few-shot Learning with Retrieval Augmented LMs
Izacard et al.
Few-shot learning with retrieval
Type: paper
Year: 2022
-
Microsoft's GraphRAG---community summaries for global queries
From Local to Global: A Graph RAG Approach
Edge et al.
Microsoft's GraphRAG---community summaries for global queries
Type: paper
Year: 2024
-
Formalizes GraphRAG taxonomy
Graph Retrieval-Augmented Generation: A Survey
Li et al.
Formalizes GraphRAG taxonomy
Type: paper
Year: 2024
-
Official project page
Microsoft GraphRAG Project
Official project page
Type: resource
-
Official implementation
GraphRAG GitHub Repository
Official implementation
Type: tool
-
Curated papers/benchmarks
Awesome-GraphRAG
Curated papers/benchmarks
Type: resource
-
Studies compressibility limits for RAG: when compression erases task-relevant information. Proposes overflow detection method.
Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Belikova et al.
Studies compressibility limits for RAG: when compression erases task-relevant information. Proposes overflow detection method.
Type: paper
Year: 2026
Concrete failure mode for Part 3's 'RAG has limits' argument.
-
§6 Embeddings & Vector Search
9
-
★
Sentence-BERT
— Reimers & Gurevych
(2019)
[paper]
Sentence embeddings that work
Sentence-BERT
Reimers & Gurevych
Sentence embeddings that work
Type: paper
Year: 2019
-
SimCSE: Simple Contrastive Learning of Sentence Embeddings
— Gao et al.
(2021)
[paper]
Contrastive sentence embeddings
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Gao et al.
Contrastive sentence embeddings
Type: paper
Year: 2021
-
Text Embeddings by Weakly-Supervised Contrastive Pre-training
— Wang et al.
(2022)
[paper]
E5---strong general-purpose embeddings
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Wang et al.
E5---strong general-purpose embeddings
Type: paper
Year: 2022
-
Nomic Embed
(2024)
[paper]
8K context embeddings
Nomic Embed
8K context embeddings
Type: paper
Year: 2024
-
Matryoshka Representation Learning
— Kusupati et al.
(2022)
[paper]
Truncatable embeddings
Matryoshka Representation Learning
Kusupati et al.
Truncatable embeddings
Type: paper
Year: 2022
-
Efficient and Robust Approximate Nearest Neighbor Search
— Malkov & Yashunin
(2016)
[paper]
HNSW---hierarchical navigable small world graphs
Efficient and Robust Approximate Nearest Neighbor Search
Malkov & Yashunin
HNSW---hierarchical navigable small world graphs
Type: paper
Year: 2016
-
Othello-GPT: a sequence model trained on Othello move sequences develops internal board representations. Direct empirical challenge to Harnad's symbol grounding argument — models may develop internal world models on synthetic tasks. The grounding claim in Part 3 still holds (closed-world synthetic task ≠ open-world natural language) but must account for emergent representations rather than treating Harnad (1990) as unchallenged.
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Li et al.
Othello-GPT: a sequence model trained on Othello move sequences develops internal board representations. Direct empirical challenge to Harnad's symbol grounding argument — models may develop internal world models on synthetic tasks. The grounding claim in Part 3 still holds (closed-world synthetic task ≠ open-world natural language) but must account for emergent representations rather than treating Harnad (1990) as unchallenged.
Type: paper
Year: 2022
-
Best Paper. The word analogy task (king - man + woman = queen) decomposes into three independent pairwise cosine similarities, not relational vector arithmetic. Sparse, explicit distributional representations recover similar analogy performance as neural embeddings when using the same evaluation method — the phenomenon is a property of the evaluation, not the embedding algorithm. Key methodological reference for verification asymmetry: the evaluation method creates the appearance of a capability the system doesn't necessarily have.
Linguistic Regularities in Sparse and Explicit Word Representations
Omer Levy, Yoav Goldberg
Best Paper. The word analogy task (king - man + woman = queen) decomposes into three independent pairwise cosine similarities, not relational vector arithmetic. Sparse, explicit distributional representations recover similar analogy performance as neural embeddings when using the same evaluation method — the phenomenon is a property of the evaluation, not the embedding algorithm. Key methodological reference for verification asymmetry: the evaluation method creates the appearance of a capability the system doesn't necessarily have.
Type: paper
Year: 2014
-
The word-embedding bias finding (man:doctor::woman:nurse) is an artifact of excluding input words from the analogy search. Remove the exclusion and the model returns 'doctor.' Also documents cherry-picking of results and vocabulary truncation effects that inflated apparent bias. This drove real policy conversations based on a methodological artifact. Strongest published example of verification asymmetry at the evaluation level: the evaluation method appeared to verify a property (gender bias) that was actually an artifact of the method. Candidate example for hallucinations article.
Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor
Malvina Nissim, Rik van Noord, Rob van der Goot
The word-embedding bias finding (man:doctor::woman:nurse) is an artifact of excluding input words from the analogy search. Remove the exclusion and the model returns 'doctor.' Also documents cherry-picking of results and vocabulary truncation effects that inflated apparent bias. This drove real policy conversations based on a methodological artifact. Strongest published example of verification asymmetry at the evaluation level: the evaluation method appeared to verify a property (gender bias) that was actually an artifact of the method. Candidate example for hallucinations article.
Type: paper
Year: 2020
Journal: Computational Linguistics
-
§7 Agents & Tool Use
18
-
ReAct: Synergizing Reasoning and Acting
(2022)
[paper]
Interleaved reasoning and acting
ReAct: Synergizing Reasoning and Acting
Interleaved reasoning and acting
Type: paper
Year: 2022
-
Toolformer: Language Models Can Teach Themselves to Use Tools
(2023)
[paper]
Self-taught tool use
Toolformer: Language Models Can Teach Themselves to Use Tools
Self-taught tool use
Type: paper
Year: 2023
-
Voyager: An Open-Ended Embodied Agent
— Wang et al.
(2023)
[paper]
LLM agent in Minecraft, skill library
Voyager: An Open-Ended Embodied Agent
Wang et al.
LLM agent in Minecraft, skill library
Type: paper
Year: 2023
-
AutoGPT / BabyAGI
(2023)
[tool]
Autonomous agent architectures (read critically)
AutoGPT / BabyAGI
Autonomous agent architectures (read critically)
Type: tool
Year: 2023
-
Generative Agents: Interactive Simulacra
— Park et al.
(2023)
[paper]
'Smallville'---agents with memory
Generative Agents: Interactive Simulacra
Park et al.
'Smallville'---agents with memory
Type: paper
Year: 2023
-
Language Agent Tree Search
— Zhou et al.
(2023)
[paper]
LATS
Language Agent Tree Search
Zhou et al.
LATS
Type: paper
Year: 2023
-
Reflexion
— Shinn et al.
(2023)
[paper]
Agents that learn from mistakes
Reflexion
Shinn et al.
Agents that learn from mistakes
Type: paper
Year: 2023
-
World Models
— Ha & Schmidhuber
(2018)
[paper]
Learn environment dynamics in latent space
World Models
Ha & Schmidhuber
Learn environment dynamics in latent space
Type: paper
Year: 2018
- Frameworks & Examples
-
LangChain Agents
[documentation]
Tool use, ReAct implementation
LangChain Agents
Tool use, ReAct implementation
Type: documentation
-
Data agents
LlamaIndex Agents
Data agents
Type: documentation
-
Lightweight multi-agent framework
OpenAI Swarm
Lightweight multi-agent framework
Type: tool
-
AutoGen
— Microsoft
[tool]
Multi-agent conversations
AutoGen
Microsoft
Multi-agent conversations
Type: tool
-
Patterns and anti-patterns
Building Effective Agents
Anthropic
Patterns and anti-patterns
Type: article
-
Code Mode: the better way to use MCP
— Varda, Kenton; Pai, Sunil
(2025)
[blog]
LLMs call tools better as TypeScript APIs than as MCP tool calls --- training data distribution matters. V8 isolate sandboxing eliminates network access and API key leakage by construction. Key quote: 'Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin.' Maps to Thesis 16 (tighter games, more control) and Wittgenstein (play the game the model knows).
Code Mode: the better way to use MCP
Kenton Varda, Sunil Pai
LLMs call tools better as TypeScript APIs than as MCP tool calls --- training data distribution matters. V8 isolate sandboxing eliminates network access and API key leakage by construction. Key quote: 'Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin.' Maps to Thesis 16 (tighter games, more control) and Wittgenstein (play the game the model knows).
Type: blog
Year: 2025
-
Server-side Code Mode collapses 2,500 Cloudflare API endpoints into two tools (search + execute) at ~1,000 tokens. 99.9% token reduction vs native MCP. Progressive API discovery via OpenAPI spec search. Execution in sandboxed V8 isolate with no filesystem or env vars. Connects to REST/HATEOAS parking lot idea: HATEOAS was designed for a client that didn't exist yet.
Code Mode: give agents an entire API in 1,000 tokens
Matt Carey
Server-side Code Mode collapses 2,500 Cloudflare API endpoints into two tools (search + execute) at ~1,000 tokens. 99.9% token reduction vs native MCP. Progressive API discovery via OpenAPI spec search. Execution in sandboxed V8 isolate with no filesystem or env vars. Connects to REST/HATEOAS parking lot idea: HATEOAS was designed for a client that didn't exist yet.
Type: blog
Year: 2026
-
Visa Trusted Agent Protocol and Mastercard Agent Pay use Web Bot Auth for AI shopping agents. Nonce fields for replay protection, tag fields to distinguish browse vs. purchase intent. Distributed cognition (Hutchins): agent/user/merchant/payment network form a trust chain that didn't exist before. Verification is structural (crypto signatures), not behavioral (prompt promises).
Securing agentic commerce: helping AI Agents transact with Visa and Mastercard
Rohin Lohe, Will Allen
Visa Trusted Agent Protocol and Mastercard Agent Pay use Web Bot Auth for AI shopping agents. Nonce fields for replay protection, tag fields to distinguish browse vs. purchase intent. Distributed cognition (Hutchins): agent/user/merchant/payment network form a trust chain that didn't exist before. Verification is structural (crypto signatures), not behavioral (prompt promises).
Type: blog
Year: 2025
-
Introducing Markdown for Agents
— Martinho, Celso; Allen, Will
(2026)
[blog]
Content negotiation (Accept: text/markdown) for AI clients. 80% token reduction from HTML to markdown. Content Signals framework signals AI usage permissions. The web being restructured for non-human consumption --- agents as first-class citizens of HTTP.
Introducing Markdown for Agents
Celso Martinho, Will Allen
Content negotiation (Accept: text/markdown) for AI clients. 80% token reduction from HTML to markdown. Content Signals framework signals AI usage permissions. The web being restructured for non-human consumption --- agents as first-class citizens of HTTP.
Type: blog
Year: 2026
-
Cloudflare hosting for OpenClaw (formerly Moltbot/Clawdbot --- the tool from the Shambaugh incident). Architecture: V8 sandbox, AI Gateway for model routing with fallbacks, Zero Trust Access for auth, R2 for persistence. The sandboxing architecture is exactly Part 3's argument: behavioral shaping (system prompt) isn't enough, you need structural containers.
Introducing Moltworker: a self-hosted personal AI agent, minus the minis
Celso Martinho, Brian Brunner, Sid Chatterjee, Andreas Jansson
Cloudflare hosting for OpenClaw (formerly Moltbot/Clawdbot --- the tool from the Shambaugh incident). Architecture: V8 sandbox, AI Gateway for model routing with fallbacks, Zero Trust Access for auth, R2 for persistence. The sandboxing architecture is exactly Part 3's argument: behavioral shaping (system prompt) isn't enough, you need structural containers.
Type: blog
Year: 2026
-
§8 Evaluation & Benchmarks
27
-
AgoraBench: data generation ability doesn't correlate with problem-solving ability
Evaluating Language Models as Synthetic Data Generators
Kim et al.
AgoraBench: data generation ability doesn't correlate with problem-solving ability
Type: paper
Year: 2025
Journal: ACL 2025
Curate later.
-
★
MMLU
[benchmark]
General knowledge across domains
MMLU
General knowledge across domains
Type: benchmark
-
HellaSwag
[benchmark]
Commonsense reasoning
HellaSwag
Commonsense reasoning
Type: benchmark
-
HumanEval
[benchmark]
Code generation
HumanEval
Code generation
Type: benchmark
-
GSM8K
[benchmark]
Grade school math
GSM8K
Grade school math
Type: benchmark
-
MATH
[benchmark]
Competition math
MATH
Competition math
Type: benchmark
-
BIG-Bench
[benchmark]
Diverse capabilities
BIG-Bench
Diverse capabilities
Type: benchmark
-
TruthfulQA
[benchmark]
Hallucination resistance
TruthfulQA
Hallucination resistance
Type: benchmark
-
MT-Bench
[benchmark]
Multi-turn conversation
MT-Bench
Multi-turn conversation
Type: benchmark
- Eval Tools
-
Evaluation framework
OpenAI Evals
Evaluation framework
Type: tool
-
RAG evaluation metrics
RAGAS
RAG evaluation metrics
Type: tool
-
Tracing, debugging, evaluation
LangSmith
Tracing, debugging, evaluation
Type: tool
-
LLM eval platform
Braintrust
LLM eval platform
Type: tool
-
Faithfulness, relevance metrics
ragas GitHub
Faithfulness, relevance metrics
Type: tool
-
AI Hallucination Cases Database
— Charlotin, Damien
(2026)
[dataset]
973 documented cases of lawyers submitting AI-hallucinated citations across 12 countries. The empirical scale of legal hallucination. Cited in Part 3 legal section.
AI Hallucination Cases Database
Damien Charlotin
973 documented cases of lawyers submitting AI-hallucinated citations across 12 countries. The empirical scale of legal hallucination. Cited in Part 3 legal section.
Type: dataset
Year: 2026
-
Current safety evals weight all harmful queries equally. Expected Harm weights severity by execution likelihood. Cost-based decomposition yields 22x average ASR increase and 72-84% guardrail bypass. Models refuse high-cost threats (nuclear weapons) while complying with low-cost ones (stalking) --- the latter is more dangerous in aggregate. Models encode severity well but show representational blindness to cost.
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs
Zihao Chen, Albert Tam, Jiaqi Wu, Jing Chen
Current safety evals weight all harmful queries equally. Expected Harm weights severity by execution likelihood. Cost-based decomposition yields 22x average ASR increase and 72-84% guardrail bypass. Models refuse high-cost threats (nuclear weapons) while complying with low-cost ones (stalking) --- the latter is more dangerous in aggregate. Models encode severity well but show representational blindness to cost.
Type: paper
Year: 2026
-
First standardized benchmark for hallucinated citation detection (6,475 real + 2,967 fake citations). Five-agent verification cascade outperforms all LLM baselines. GPT-5.2 drops from F1 0.955 to 0.331 on real-world data; Claude-Sonnet-4.5 flags nearly everything as fake. Structural verification (external cascade) outperforms behavioral shaping (asking LLMs to self-verify).
CiteAudit: You Cited It, But Did You Read It?
Zhongxiang Yuan, Jingyi Shi, Chenhui Zhang, Xiao-Ping Sun
First standardized benchmark for hallucinated citation detection (6,475 real + 2,967 fake citations). Five-agent verification cascade outperforms all LLM baselines. GPT-5.2 drops from F1 0.955 to 0.331 on real-world data; Claude-Sonnet-4.5 flags nearly everything as fake. Structural verification (external cascade) outperforms behavioral shaping (asking LLMs to self-verify).
Type: paper
Year: 2026
-
Multi-turn conversations reshape model confidence in three distinct ways across models: Claude suppresses confidence, GPT-5.2 escalates it (ECE nearly doubles by turn 5), Gemini 3.1 stalls calibration improvement. Effect strongest in open-ended advisory contexts --- exactly where stakes are highest. Drift is not just agreement but confidence miscalibration over turns.
Self-Anchoring Calibration Drift in Large Language Models
Harshavardhan
Multi-turn conversations reshape model confidence in three distinct ways across models: Claude suppresses confidence, GPT-5.2 escalates it (ECE nearly doubles by turn 5), Gemini 3.1 stalls calibration improvement. Effect strongest in open-ended advisory contexts --- exactly where stakes are highest. Drift is not just agreement but confidence miscalibration over turns.
Type: paper
Year: 2026
-
4,876 passages, 100 expert criminal law questions. Improving retrieval yields ~4x the correctness improvement vs upgrading the LLM. Hierarchical error decomposition distinguishes retrieval failures from reasoning failures from hallucinations. Many errors attributed to hallucination are actually retrieval failures upstream --- a category error the field keeps making.
Legal RAG Bench: An End-to-End Benchmark for Legal RAG
Butler & Butler
4,876 passages, 100 expert criminal law questions. Improving retrieval yields ~4x the correctness improvement vs upgrading the LLM. Hierarchical error decomposition distinguishes retrieval failures from reasoning failures from hallucinations. Many errors attributed to hallucination are actually retrieval failures upstream --- a category error the field keeps making.
Type: paper
Year: 2026
-
Shifts hallucination research from internal mechanism analysis toward causal tracing against real-world evidence. 84% of hallucinations are fabrication heuristics (pattern-completion without factual grounding), not reasoning failures. Joint success rate 200x over baselines. Most hallucinations are straightforward pattern-completion, not subtle reasoning errors --- changes where to intervene.
HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models
Shize Liang, Hongzhi Wang
Shifts hallucination research from internal mechanism analysis toward causal tracing against real-world evidence. 84% of hallucinations are fabrication heuristics (pattern-completion without factual grounding), not reasoning failures. Joint success rate 200x over baselines. Most hallucinations are straightforward pattern-completion, not subtle reasoning errors --- changes where to intervene.
Type: paper
Year: 2026
-
Formalizes when cross-model checking (debate) offers advantage over single-model verification. Uses principal angles between models' representation subspaces to measure knowledge divergence. Gives Part 3's 'cross-model checking reduces correlation' claim a geometric formalization --- debate adds value precisely when models encode knowledge in structurally different subspaces.
Knowledge Divergence and the Value of Debate for Scalable Oversight
Robin Young
Formalizes when cross-model checking (debate) offers advantage over single-model verification. Uses principal angles between models' representation subspaces to measure knowledge divergence. Gives Part 3's 'cross-model checking reduces correlation' claim a geometric formalization --- debate adds value precisely when models encode knowledge in structurally different subspaces.
Type: paper
Year: 2026
-
Traditional neuron-level coverage is catastrophically misleading for LLM safety testing: adding benign prompts inflates neuron coverage +43% while RACA shows only +3%. Representation-level coverage prioritizes 83-86% ASR samples vs 53% for neuron-level baselines. Three design principles: synonym-insensitive, invalid-insensitive, jailbreak-sensitive. Fixes the measurement layer for safety testing.
RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun
Traditional neuron-level coverage is catastrophically misleading for LLM safety testing: adding benign prompts inflates neuron coverage +43% while RACA shows only +3%. Representation-level coverage prioritizes 83-86% ASR samples vs 53% for neuron-level baselines. Three design principles: synonym-insensitive, invalid-insensitive, jailbreak-sensitive. Fixes the measurement layer for safety testing.
Type: paper
Year: 2026
-
Review in a fresh session (no production conversation history) catches more errors than same-session review. 360 reviews across 30 artifacts with 150 injected errors. CCR F1 28.6% vs same-session 24.6% (p=0.008). Reviewing twice in same session doesn't help (p=0.11) --- benefit comes from context separation itself, not repetition. Works with any model, no infrastructure needed.
Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
Tae-Eun Song
Review in a fresh session (no production conversation history) catches more errors than same-session review. 360 reviews across 30 artifacts with 150 injected errors. CCR F1 28.6% vs same-session 24.6% (p=0.008). Reviewing twice in same session doesn't help (p=0.11) --- benefit comes from context separation itself, not repetition. Works with any model, no infrastructure needed.
Type: paper
Year: 2026
-
Reasoning judges resist reward hacking better than non-reasoning judges in RL-based alignment. But reasoning-judge-trained policies learn to generate adversarial outputs that deceive other LLM-judges and score well on Arena-Hard. Evaluation gaming as an alignment failure mode: strong judge training produces stronger adversarial generation.
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su
Reasoning judges resist reward hacking better than non-reasoning judges in RL-based alignment. But reasoning-judge-trained policies learn to generate adversarial outputs that deceive other LLM-judges and score well on Arena-Hard. Evaluation gaming as an alignment failure mode: strong judge training produces stronger adversarial generation.
Type: paper
Year: 2026
-
Reveals severe confidence miscalibration in multimodal LLMs. Proposes Confidence-Driven Reinforcement Learning (CDRL) using original-noise image pairs to calibrate confidence. Calibrated confidence enables test-time scaling as free lunch. Confidence-Aware Test-Time Scaling (CA-TTS) coordinates self-consistency, self-reflection, and visual self-check. 8.8% gains across four benchmarks.
Linking Perception, Confidence and Accuracy in MLLMs
Yuetian Du, Yucheng Wang, Rongyu Zhang
Reveals severe confidence miscalibration in multimodal LLMs. Proposes Confidence-Driven Reinforcement Learning (CDRL) using original-noise image pairs to calibrate confidence. Calibrated confidence enables test-time scaling as free lunch. Confidence-Aware Test-Time Scaling (CA-TTS) coordinates self-consistency, self-reflection, and visual self-check. 8.8% gains across four benchmarks.
Type: paper
Year: 2026
- Safety Evaluation
-
Automated red-teaming framework measuring guardrail degradation as continuous per-round compliance trajectories, not binary pass/fail. Uses fine-tuned 70B attacker model (QLoRA). Tested against Claude Opus, Gemini Pro, GPT-5.2. Found 26.7% jailbreak rate with average jailbreak at round 1.25 --- compromises happen early, not through gradual erosion. Treats judge reliability as a first-class outcome. Moves safety evaluation from 'did it break' to 'how does it degrade.'
ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
Harry Owiredu-Ashley
Automated red-teaming framework measuring guardrail degradation as continuous per-round compliance trajectories, not binary pass/fail. Uses fine-tuned 70B attacker model (QLoRA). Tested against Claude Opus, Gemini Pro, GPT-5.2. Found 26.7% jailbreak rate with average jailbreak at round 1.25 --- compromises happen early, not through gradual erosion. Treats judge reliability as a first-class outcome. Moves safety evaluation from 'did it break' to 'how does it degrade.'
Type: paper
Year: 2026
- Hallucination & Factuality
-
Largest empirical hallucination study: 35 open-weight models, 172B tokens, three context lengths, three hardware platforms. Key findings: hallucination at 1.19% for 32K context, nearly triples at 128K, exceeds 10% at 200K. T=0.0 best for accuracy in ~60% of cases, but higher temps reduce fabrication and cut coherence loss up to 48x. Grounding ability and fabrication resistance are distinct capabilities. Model choice dominates all other variables (72pp accuracy spread).
How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
JV Roig
Largest empirical hallucination study: 35 open-weight models, 172B tokens, three context lengths, three hardware platforms. Key findings: hallucination at 1.19% for 32K context, nearly triples at 128K, exceeds 10% at 200K. T=0.0 best for accuracy in ~60% of cases, but higher temps reduce fabrication and cut coherence loss up to 48x. Grounding ability and fabrication resistance are distinct capabilities. Model choice dominates all other variables (72pp accuracy spread).
Type: paper
Year: 2026
-
Part 4: Knowledge & Reasoning
-
§9 Knowledge Graphs + LLMs / Neuro-Symbolic
28
-
Taxonomy of approaches
Neurosymbolic AI for Reasoning over Knowledge Graphs
Taxonomy of approaches
Type: paper
Year: 2023
-
Recent survey, LLM integration
Neural-Symbolic Reasoning over KGs: A Query Perspective
Recent survey, LLM integration
Type: paper
Year: 2024
-
State of the field
Neuro-Symbolic AI in 2024: A Systematic Review
State of the field
Type: paper
Year: 2025
-
KG-BERT
— Yao et al.
(2019)
[paper]
BERT for knowledge graph completion
KG-BERT
Yao et al.
BERT for knowledge graph completion
Type: paper
Year: 2019
-
QA-GNN
— Yasunaga et al.
(2021)
[paper]
GNN + LM for QA over KGs
QA-GNN
Yasunaga et al.
GNN + LM for QA over KGs
Type: paper
Year: 2021
-
GreaseLM
— Zhang et al.
(2022)
[paper]
Fusing LMs and KGs for reasoning
GreaseLM
Zhang et al.
Fusing LMs and KGs for reasoning
Type: paper
Year: 2022
-
★
Think-on-Graph
— Sun et al.
(2023)
[paper]
LLM reasoning on KG structure
Think-on-Graph
Sun et al.
LLM reasoning on KG structure
Type: paper
Year: 2023
-
Reasoning on Graphs
— Luo et al.
(2024)
[paper]
Reasoning on Graphs with LLMs
Reasoning on Graphs
Luo et al.
Reasoning on Graphs with LLMs
Type: paper
Year: 2024
-
Symbolic AI in the Age of LLMs
— Lassila, AWS re:Invent
(2025)
[video]
Practitioner perspective on hybrid systems
Symbolic AI in the Age of LLMs
Lassila, AWS re:Invent
Practitioner perspective on hybrid systems
Type: video
Year: 2025
-
Program synthesis from examples; IP vs. ML comparison
Inductive Programming Meets the Real World
Gulwani et al.
Program synthesis from examples; IP vs. ML comparison
Type: paper
Year: 2015
- Probabilistic Logic Programming
-
DeepProbLog
(2018)
[paper]
Neural predicates in ProbLog
DeepProbLog
Neural predicates in ProbLog
Type: paper
Year: 2018
-
Towards Probabilistic ILP with Neurosymbolic Inference
(2024)
[paper]
Learning logic programs
Towards Probabilistic ILP with Neurosymbolic Inference
Learning logic programs
Type: paper
Year: 2024
-
Statistical Relational Artificial Intelligence
— De Raedt et al.
(2016)
[book]
Textbook---probabilistic logic
Statistical Relational Artificial Intelligence
De Raedt et al.
Textbook---probabilistic logic
Type: book
Year: 2016
- Hybrid / Neural-Symbolic Systems
-
Computational Architectures Integrating Neural and Symbolic Processes
— Sun & Bookman, eds.
(1994)
[book]
Early integration approaches
Computational Architectures Integrating Neural and Symbolic Processes
Sun & Bookman, eds.
Early integration approaches
Type: book
Year: 1994
-
Connectionist-Symbolic Integration
— Sun & Alexandre, eds.
(1997)
[book]
Bridging paradigms
Connectionist-Symbolic Integration
Sun & Alexandre, eds.
Bridging paradigms
Type: book
Year: 1997
-
Hybrid Neural Systems
— Wermter & Sun, eds.
(2000)
[book]
Springer collection
Hybrid Neural Systems
Wermter & Sun, eds.
Springer collection
Type: book
Year: 2000
-
Neural-Symbolic Cognitive Reasoning
— Garcez, Lamb & Gabbay
(2009)
[book]
Foundations of modern neuro-symbolic
Neural-Symbolic Cognitive Reasoning
Garcez, Lamb & Gabbay
Foundations of modern neuro-symbolic
Type: book
Year: 2009
- Minsky & Frames
-
A Framework for Representing Knowledge
— Minsky
(1974)
[paper]
Introduced frames---foundational for KR
A Framework for Representing Knowledge
Minsky
Introduced frames---foundational for KR
Type: paper
Year: 1974
-
Society of Mind
— Minsky
(1986)
[book]
Agents as collections of simpler processes
Society of Mind
Minsky
Agents as collections of simpler processes
Type: book
Year: 1986
-
The Emotion Machine
— Minsky
(2006)
[book]
Commonsense reasoning, emotions in AI
The Emotion Machine
Minsky
Commonsense reasoning, emotions in AI
Type: book
Year: 2006
-
Generic Frame Protocol
[resource]
Standard for frame-based systems
Generic Frame Protocol
Standard for frame-based systems
Type: resource
- Cybersecurity KG + RAG
-
CyKG-RAG
— Kurniawan et al.
(2024)
[paper]
KG + vector search with query routing. Routes structured queries to SPARQL, semantic to embeddings.
CyKG-RAG
Kurniawan et al.
KG + vector search with query routing. Routes structured queries to SPARQL, semantic to embeddings.
Type: paper
Year: 2024
Collected for attack-kg v3. Curate later.
-
Multiple agents adaptively select retrieval strategy (KG traversal vs vector search vs hybrid)
AgCyRAG: Agentic KG-based RAG for Cybersecurity
Kurniawan et al.
Multiple agents adaptively select retrieval strategy (KG traversal vs vector search vs hybrid)
Type: paper
Year: 2025
Collected for attack-kg v3. Curate later.
-
GraphCyRAG
— PNNL
(2024)
[paper]
Neo4j KG traversal over CVE->CWE->CAPEC->ATT&CK. Graph traversal outperforms embedding search for vuln-to-technique mapping.
GraphCyRAG
PNNL
Neo4j KG traversal over CVE->CWE->CAPEC->ATT&CK. Graph traversal outperforms embedding search for vuln-to-technique mapping.
Type: paper
Year: 2024
Collected for attack-kg v3. Curate later.
-
CTI-Thinker
(2025)
[paper]
LLM-driven CTI KG construction + GraphRAG reasoning engine for tactical inference
CTI-Thinker
LLM-driven CTI KG construction + GraphRAG reasoning engine for tactical inference
Type: paper
Year: 2025
Journal: Cybersecurity (Springer, open access)
Collected for attack-kg v3. Curate later.
- Ontology-Grounded RAG
-
Anchors retrieval in domain ontologies. +55% fact recall, +40% correctness, +27% reasoning accuracy vs baseline RAG. Key for Part 3.
OG-RAG: Ontology-Grounded Retrieval-Augmented Generation
Nadkarni et al.
Anchors retrieval in domain ontologies. +55% fact recall, +40% correctness, +27% reasoning accuracy vs baseline RAG. Key for Part 3.
Type: paper
Year: 2024
-
Compares vector RAG vs GraphRAG vs ontology-guided KG. GraphRAG + ontology-KG both hit 90% accuracy. Empirical grounding evidence.
Ontology Learning and KG Construction: Impact on RAG Performance
Reiz et al.
Compares vector RAG vs GraphRAG vs ontology-guided KG. GraphRAG + ontology-KG both hit 90% accuracy. Empirical grounding evidence.
Type: paper
Year: 2025
-
LLM translates patient questions into executable FHIRPath queries against structured EHR data. Query-first approach provides 391x token reduction over retrieval-first. Fine-tuned accuracy ~80% but degrades on unseen FHIR resource types --- overfitting to patterns, not schema reasoning. Working implementation of the proposal engine / decision engine pattern for healthcare.
FHIRPath-QA: Executable Question Answering over FHIR EHRs
Scott Frew, Neel Bheda, Charles Tripp
LLM translates patient questions into executable FHIRPath queries against structured EHR data. Query-first approach provides 391x token reduction over retrieval-first. Fine-tuned accuracy ~80% but degrades on unseen FHIR resource types --- overfitting to patterns, not schema reasoning. Working implementation of the proposal engine / decision engine pattern for healthcare.
Type: paper
Year: 2026
-
§10 Search Engines & Information Retrieval
7
-
★
Introduction to Information Retrieval
— Manning, Raghavan, Schütze
(2008)
[book]
Start here. Free online. Ch 1-8 cover essentials: inverted index, TF-IDF, evaluation
Introduction to Information Retrieval
Manning, Raghavan, Schütze
Start here. Free online. Ch 1-8 cover essentials: inverted index, TF-IDF, evaluation
Type: book
Year: 2008
Publisher: Cambridge University Press
-
BM25 is still the baseline. Understand this before neural approaches.
The Probabilistic Relevance Framework: BM25 and Beyond
Robertson & Zaragoza
BM25 is still the baseline. Understand this before neural approaches.
Type: paper
Year: 2009
-
Survey of neural IR. Good overview before diving into specific papers.
Pretrained Transformers for Text Ranking: BERT and Beyond
Lin et al.
Survey of neural IR. Good overview before diving into specific papers.
Type: paper
Year: 2020
-
Late interaction---practical for production neural search
ColBERT: Efficient and Effective Passage Search
Khattab & Zaharia
Late interaction---practical for production neural search
Type: paper
Year: 2020
-
Passage Re-ranking with BERT
— Nogueira & Cho
(2019)
[paper]
Simple but effective. Good first neural IR paper to implement.
Passage Re-ranking with BERT
Nogueira & Cho
Simple but effective. Good first neural IR paper to implement.
Type: paper
Year: 2019
-
Original Google paper. Historical interest, less relevant to RAG work.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Brin & Page
Original Google paper. Historical interest, less relevant to RAG work.
Type: paper
Year: 1998
-
Learning to Rank for Information Retrieval
— Liu
(2011)
[book]
Deep dive on ranking ML. Reference, not first read.
Learning to Rank for Information Retrieval
Liu
Deep dive on ranking ML. Reference, not first read.
Type: book
Year: 2011
-
§11 Semantics, Semiotics & Ontologies
48
- Semiotics (Signs & Meaning)
-
★
Semiotics: The Basics
— Daniel Chandler
[book]
Start here. Accessible intro to Saussure, Peirce, Eco, and sign theory
Semiotics: The Basics
Daniel Chandler
Start here. Accessible intro to Saussure, Peirce, Eco, and sign theory
Type: book
-
★
Peirce's Theory of Signs
— Stanford Encyclopedia of Philosophy
[article]
Icon, index, symbol trichotomy. Free, authoritative reference
Peirce's Theory of Signs
Stanford Encyclopedia of Philosophy
Icon, index, symbol trichotomy. Free, authoritative reference
Type: article
-
Course in General Linguistics
— Saussure
(1916)
[book]
Signifier/signified distinction---foundational but dense
Course in General Linguistics
Saussure
Signifier/signified distinction---foundational but dense
Type: book
Year: 1916
-
A Theory of Semiotics
— Umberto Eco
(1976)
[book]
Classic text on sign systems---read after Chandler
A Theory of Semiotics
Umberto Eco
Classic text on sign systems---read after Chandler
Type: book
Year: 1976
-
Do LLMs ground symbols? Bridges semiotics and AI debate
Symbols and Grounding in Large Language Models
Mollo & Millière
Do LLMs ground symbols? Bridges semiotics and AI debate
Type: paper
Year: 2023
Journal: Philosophical Transactions of the Royal Society A
-
AI: A Semiotic Perspective
— Walsh Matthews & Danesi
(2019)
[article]
Survey of semiotics vs. AI: abduction, embodiment, Baudrillard, Peirce
AI: A Semiotic Perspective
Stéphanie Walsh Matthews, Marcel Danesi
Survey of semiotics vs. AI: abduction, embodiment, Baudrillard, Peirce
Type: article
Year: 2019
Journal: Chinese Semiotic Studies
-
AI as 'technology of fakery'---mimicry, generation, ideology
The Main Tasks of a Semiotics of Artificial Intelligence
Massimo Leone
AI as 'technology of fakery'---mimicry, generation, ideology
Type: article
Year: 2023
Journal: Language and Semiotic Studies
-
★
The Symbol Grounding Problem
— Stevan Harnad
(1990)
[paper]
Foundational paper. How do symbols get meaning? The Chinese Room problem for semantics.
The Symbol Grounding Problem
Stevan Harnad
Foundational paper. How do symbols get meaning? The Chinese Room problem for semantics.
Type: paper
Year: 1990
Journal: Physica D
-
LLMs through Saussure and Derrida. How word2vec embodies structuralist sign theory.
Language Models as Semiotic Machines
Elad Vromen
LLMs through Saussure and Derrida. How word2vec embodies structuralist sign theory.
Type: paper
Year: 2024
-
LLMs as semiotic means, not minds. Peirce, Lotman's semiosphere, prompt as contract.
Not Minds, but Signs: Reframing LLMs through Semiotics
Mazzocchi et al.
LLMs as semiotic means, not minds. Peirce, Lotman's semiosphere, prompt as contract.
Type: paper
Year: 2025
-
Can LLM internal states be about extra-linguistic reality without embodiment? Argues yes—referential grounding possible from text alone.
The Vector Grounding Problem
Dimitri Coelho Mollo
Can LLM internal states be about extra-linguistic reality without embodiment? Argues yes—referential grounding possible from text alone.
Type: paper
Year: 2023
-
Formal categorical framework. LLMs don't solve grounding—they parasitize human-grounded text. Key for Part 3 thesis.
A Categorical Analysis of LLMs and Why They Circumvent the Symbol Grounding Problem
Betz et al.
Formal categorical framework. LLMs don't solve grounding—they parasitize human-grounded text. Key for Part 3 thesis.
Type: paper
Year: 2025
-
Proposes LSMs that model full Peircean triads (representamen/interpretant/object). Argues LLMs operate only at signifier level.
Beyond Tokens: Introducing Large Semiosis Models (LSMs) for Grounded Meaning in Artificial Intelligence
Luciano Silva
Proposes LSMs that model full Peircean triads (representamen/interpretant/object). Argues LLMs operate only at signifier level.
Type: paper
Year: 2025
- Philosophy & Sociology Foundations
-
Dramaturgical framework. Identity as performance for audiences. Front stage vs back stage. Core framework for Part 1.
The Presentation of Self in Everyday Life
Erving Goffman
Dramaturgical framework. Identity as performance for audiences. Front stage vs back stage. Core framework for Part 1.
Type: book
Year: 1956
Publisher: University of Edinburgh
-
Keying, brackets, fabrication, containment. How frames transform activity and can be manipulated. Core for Parts 1 & 2.
Frame Analysis: An Essay on the Organization of Experience
Erving Goffman
Keying, brackets, fabrication, containment. How frames transform activity and can be manipulated. Core for Parts 1 & 2.
Type: book
Year: 1974
Publisher: Harvard UP
-
★
Philosophical Investigations
— Wittgenstein
(1953)
[book]
Language games. Meaning is use in context. §§1-50 cover the core ideas. Dense but foundational.
Philosophical Investigations
Ludwig Wittgenstein
Language games. Meaning is use in context. §§1-50 cover the core ideas. Dense but foundational.
Type: book
Year: 1953
Publisher: Blackwell
-
★
Discipline and Punish: The Birth of the Prison
— Foucault
(1977)
[book]
Power as productive, not repressive. Normalizing judgment shapes behavior through distribution, not prohibition. Primary text for Part 1.
Discipline and Punish: The Birth of the Prison
Michel Foucault
Power as productive, not repressive. Normalizing judgment shapes behavior through distribution, not prohibition. Primary text for Part 1.
Type: book
Year: 1977
Publisher: Pantheon
-
Foucault: Power is Everywhere
— Powercube
(2011)
[article]
Power as productive, not just repressive. Regimes of truth shape what's sayable. Short, focused, accessible.
Foucault: Power is Everywhere
Powercube
Power as productive, not just repressive. Regimes of truth shape what's sayable. Short, focused, accessible.
Type: article
Year: 2011
-
Speech act theory. Locutionary vs illocutionary force. Utterances don't just describe—they do things. Core for Part 2.
How to Do Things with Words
J.L. Austin
Speech act theory. Locutionary vs illocutionary force. Utterances don't just describe—they do things. Core for Part 2.
Type: book
Year: 1962
Publisher: Harvard UP
-
★
The Intentional Stance
— Dennett
(1987)
[book]
We treat systems as if they have beliefs and desires because it's predictively useful, not because we've verified they do. Anti-anthropomorphism safety rail for Part 1.
The Intentional Stance
Daniel C. Dennett
We treat systems as if they have beliefs and desires because it's predictively useful, not because we've verified they do. Anti-anthropomorphism safety rail for Part 1.
Type: book
Year: 1987
Publisher: MIT Press
-
★
Cognition in the Wild
— Hutchins
(1995)
[book]
Distributed cognition. Thinking isn't in the head—it's across people, tools, artifacts. Grounds the hybrid architecture argument in Part 3.
Cognition in the Wild
Edwin Hutchins
Distributed cognition. Thinking isn't in the head—it's across people, tools, artifacts. Grounds the hybrid architecture argument in Part 3.
Type: book
Year: 1995
Publisher: MIT Press
- Semiotics (Signs & Meaning)
-
Grounds LLM code generation in formal logic (Prolog). Practical example of hybrid architecture with symbolic backstage.
LogicAgent: A Logic-Enhanced Agent Framework for Code Generation
Joshi et al.
Grounds LLM code generation in formal logic (Prolog). Practical example of hybrid architecture with symbolic backstage.
Type: paper
Year: 2025
-
Chain of Semiosis
— Multimodality Glossary
[article]
Glossary entry on Peirce's unlimited semiosis—how signs generate interpretants that become new signs. Context for LLM token chains.
Chain of Semiosis
Multimodality Glossary
Glossary entry on Peirce's unlimited semiosis—how signs generate interpretants that become new signs. Context for LLM token chains.
Type: article
- Semantics (Linguistic Meaning)
-
Compositional semantics, word senses, semantic roles. Free online.
Speech and Language Processing, Ch. 14-18
Jurafsky & Martin
Compositional semantics, word senses, semantic roles. Free online.
Type: book
-
Pre-neural distributional semantics survey. Historical context for embeddings.
From Frequency to Meaning: Vector Space Models of Semantics
Turney & Pantel
Pre-neural distributional semantics survey. Historical context for embeddings.
Type: paper
Year: 2010
-
Polysemous words are singularities in vector space. TDA meets distributional semantics.
Topology of Word Embeddings: Singularities Reflect Polysemy
Jakubowski, Gasic & Zibrowius
Polysemous words are singularities in vector space. TDA meets distributional semantics.
Type: paper
Year: 2020
-
Comprehensive survey of 100+ papers on topological data analysis for NLP.
Unveiling Topological Structures from Language: A Survey of TDA Applications in NLP
Luo et al.
Comprehensive survey of 100+ papers on topological data analysis for NLP.
Type: paper
Year: 2024
-
★
Conceptual Spaces: The Geometry of Thought
— Peter Gärdenfors
(2000)
[book]
Meaning as geometry. Bridges symbolic AI and connectionism. Foundational for understanding embeddings.
Conceptual Spaces: The Geometry of Thought
Peter Gärdenfors
Meaning as geometry. Bridges symbolic AI and connectionism. Foundational for understanding embeddings.
Type: book
Year: 2000
Publisher: MIT Press
ISBN: 978-0262571371
-
Distributional Formal Semantics
— Venhuizen et al.
(2021)
[paper]
Bridging neural embeddings and logic-based meaning. Graduate-level.
Distributional Formal Semantics
Venhuizen et al.
Bridging neural embeddings and logic-based meaning. Graduate-level.
Type: paper
Year: 2021
-
Semantic Parsing: A Survey
— Kamath & Das
(2018)
[paper]
Mapping natural language to formal representations. Specialist topic.
Semantic Parsing: A Survey
Kamath & Das
Mapping natural language to formal representations. Specialist topic.
Type: paper
Year: 2018
- Ontologies & Knowledge Representation
-
★
Ontology Development 101
— Noy & McGuinness
(2001)
[paper]
Start here. Short, practical, free PDF on building ontologies
Ontology Development 101
Noy & McGuinness
Start here. Short, practical, free PDF on building ontologies
Type: paper
Year: 2001
-
Knowledge Representation and Reasoning
— Brachman & Levesque
(2004)
[book]
Comprehensive textbook---logic, frames, description logics
Knowledge Representation and Reasoning
Brachman & Levesque
Comprehensive textbook---logic, frames, description logics
Type: book
Year: 2004
-
The Description Logic Handbook
(2003)
[book]
Reference for OWL/semantic web formal foundations
The Description Logic Handbook
Reference for OWL/semantic web formal foundations
Type: book
Year: 2003
-
OWL 2 Primer
— W3C
[documentation]
Standard for web ontologies
OWL 2 Primer
W3C
Standard for web ontologies
Type: documentation
-
Cyc
— Lenat
(1995)
[resource]
Massive hand-crafted ontology
Cyc
Lenat
Massive hand-crafted ontology
Type: resource
Year: 1995
-
Schema.org
[resource]
Practical ontology used by search engines
Schema.org
Practical ontology used by search engines
Type: resource
-
WordNet
— Miller
(1995)
[resource]
Lexical database---synsets, hypernymy
WordNet
Miller
Lexical database---synsets, hypernymy
Type: resource
Year: 1995
-
ConceptNet
— Speer & Havasi
(2017)
[resource]
Commonsense knowledge graph
ConceptNet
Speer & Havasi
Commonsense knowledge graph
Type: resource
Year: 2017
-
Wikidata
[resource]
Collaborative structured knowledge base
Wikidata
Collaborative structured knowledge base
Type: resource
- Practical Resources
-
Ontology: A Practical Guide
— Pease
(2011)
[book]
Hands-on ontology engineering
Ontology: A Practical Guide
Pease
Hands-on ontology engineering
Type: book
Year: 2011
-
OneZoom
[tool]
Interactive tree of life visualization
OneZoom
Interactive tree of life visualization
Type: tool
-
OLSViz
[tool]
Ontology visualization tool
OLSViz
Ontology visualization tool
Type: tool
- Philosophy & Sociology Foundations
-
A Framework for Representing Knowledge
— Minsky, Marvin
(1974)
[paper]
AI frames: data structures with slots, default values, and inheritance hierarchies. Direct ancestor of knowledge graphs and ontologies. Independent convergence with Goffman---both solving the same problem of organizing context so a system knows what's relevant.
A Framework for Representing Knowledge
Marvin Minsky
AI frames: data structures with slots, default values, and inheritance hierarchies. Direct ancestor of knowledge graphs and ontologies. Independent convergence with Goffman---both solving the same problem of organizing context so a system knows what's relevant.
Type: paper
Year: 1974
-
Frames and the Semantics of Understanding
— Fillmore, Charles J.
(1985)
[paper]
Frame semantics: words evoke structured conceptual schemas with default slots. The linguistic parallel to Goffman. 'Auditor' activates methodology, tools, register, reporting artifacts. Used in Part 2 Authority Transfer section.
Frames and the Semantics of Understanding
Charles J. Fillmore
Frame semantics: words evoke structured conceptual schemas with default slots. The linguistic parallel to Goffman. 'Auditor' activates methodology, tools, register, reporting artifacts. Used in Part 2 Authority Transfer section.
Type: paper
Year: 1985
-
Steps to an Ecology of Mind
— Bateson, Gregory
(1972)
[book]
Introduced 'frame' as metacommunicative bracket (1955 essay 'A Theory of Play and Fantasy'). Source for Goffman's Frame Analysis. 'This is play' is a frame that redefines the meaning of actions within it.
Steps to an Ecology of Mind
Gregory Bateson
Introduced 'frame' as metacommunicative bracket (1955 essay 'A Theory of Play and Fantasy'). Source for Goffman's Frame Analysis. 'This is play' is a frame that redefines the meaning of actions within it.
Type: book
Year: 1972
Publisher: Chandler
-
Three levels of framing: from linguistic form to social action
— Sullivan, Kirk P.H.
(2023)
[paper]
Traces dual Goffman/Fillmore origins of 'frame.' Three levels: semantic (Fillmore), cognitive (knowledge structures), communicative (Goffman). The cleanup paper connecting the independent inventions.
Three levels of framing: from linguistic form to social action
Kirk P.H. Sullivan
Traces dual Goffman/Fillmore origins of 'frame.' Three levels: semantic (Fillmore), cognitive (knowledge structures), communicative (Goffman). The cleanup paper connecting the independent inventions.
Type: paper
Year: 2023
-
Using Goffman's Frameworks to Explain Presence and Reality
— Rettie, Ruth
(2004)
[paper]
Presence = engrossing involvement in a spatial frame. Applies Goffman to virtual environments. Relevant to LLM 'presence' in whatever frame context establishes.
Using Goffman's Frameworks to Explain Presence and Reality
Ruth Rettie
Presence = engrossing involvement in a spatial frame. Applies Goffman to virtual environments. Relevant to LLM 'presence' in whatever frame context establishes.
Type: paper
Year: 2004
-
Frames revisited---the coherence-inducing function of frames
— Bednarek, Monika
(2005)
[paper]
Frames induce discourse coherence. Maps to Embedded Context technique (Part 2): refusing breaks coherence because the request is embedded in a legitimate frame.
Frames revisited---the coherence-inducing function of frames
Monika Bednarek
Frames induce discourse coherence. Maps to Embedded Context technique (Part 2): refusing breaks coherence because the request is embedded in a legitimate frame.
Type: paper
Year: 2005
-
§12 Bayesian Statistics & Probabilistic Reasoning
11
-
Free textbook---rigorous Bayesian ML
Probabilistic Machine Learning
Murphy
Free textbook---rigorous Bayesian ML
Type: book
-
Free textbook---excellent intro
Bayesian Reasoning and Machine Learning
Barber
Free textbook---excellent intro
Type: book
-
Pattern Recognition and Machine Learning
— Bishop
[book]
Classic textbook, Bayesian perspective
Pattern Recognition and Machine Learning
Bishop
Classic textbook, Bayesian perspective
Type: book
-
Bayesian Data Analysis
— Gelman et al.
[book]
The applied Bayesian statistics bible
Bayesian Data Analysis
Gelman et al.
The applied Bayesian statistics bible
Type: book
-
The Book of Why
— Pearl
[book]
Accessible intro to causal inference
The Book of Why
Pearl
Accessible intro to causal inference
Type: book
-
Causality
— Pearl
(2009)
[book]
Technical treatment of causal models
Causality
Pearl
Technical treatment of causal models
Type: book
Year: 2009
-
Probabilistic Graphical Models
— Koller & Friedman
[book]
Bayesian networks, Markov random fields
Probabilistic Graphical Models
Koller & Friedman
Bayesian networks, Markov random fields
Type: book
- Bayesian Deep Learning
-
Dropout as a Bayesian Approximation
— Gal & Ghahramani
(2016)
[paper]
Uncertainty from dropout
Dropout as a Bayesian Approximation
Gal & Ghahramani
Uncertainty from dropout
Type: paper
Year: 2016
-
Weight Uncertainty in Neural Networks
— Blundell et al.
(2015)
[paper]
Bayes by Backprop
Weight Uncertainty in Neural Networks
Blundell et al.
Bayes by Backprop
Type: paper
Year: 2015
-
What Uncertainties Do We Need in Bayesian Deep Learning?
— Kendall & Gal
(2017)
[paper]
Aleatoric vs. epistemic uncertainty
What Uncertainties Do We Need in Bayesian Deep Learning?
Kendall & Gal
Aleatoric vs. epistemic uncertainty
Type: paper
Year: 2017
-
Probabilistic Backpropagation
— Hernández-Lobato & Adams
(2015)
[paper]
Scalable Bayesian neural nets
Probabilistic Backpropagation
Hernández-Lobato & Adams
Scalable Bayesian neural nets
Type: paper
Year: 2015
-
Part 5: Securing AI
-
§13 Security & Adversarial ML
103
-
ATT&CK for AI/ML systems
MITRE ATLAS
ATT&CK for AI/ML systems
Type: resource
-
Explaining and Harnessing Adversarial Examples
— Goodfellow et al.
(2014)
[paper]
FGSM, adversarial examples basics
Explaining and Harnessing Adversarial Examples
Goodfellow et al.
FGSM, adversarial examples basics
Type: paper
Year: 2014
-
Intriguing Properties of Neural Networks
— Szegedy et al.
(2013)
[paper]
Original adversarial examples paper
Intriguing Properties of Neural Networks
Szegedy et al.
Original adversarial examples paper
Type: paper
Year: 2013
-
BadNets
— Gu et al.
(2017)
[paper]
Backdoor attacks on neural nets
BadNets
Gu et al.
Backdoor attacks on neural nets
Type: paper
Year: 2017
-
Poisoning Attacks against SVMs
— Biggio et al.
(2012)
[paper]
Data poisoning foundations
Poisoning Attacks against SVMs
Biggio et al.
Data poisoning foundations
Type: paper
Year: 2012
-
Universal Adversarial Triggers
— Wallace et al.
(2019)
[paper]
Prompt injection precursor
Universal Adversarial Triggers
Wallace et al.
Prompt injection precursor
Type: paper
Year: 2019
-
Ignore Previous Prompt
— Perez & Ribeiro
(2022)
[paper]
Prompt injection attacks
Ignore Previous Prompt
Perez & Ribeiro
Prompt injection attacks
Type: paper
Year: 2022
-
Not What You've Signed Up For
— Greshake et al.
(2023)
[paper]
Foundational indirect prompt injection paper. Demonstrates compromising real-world LLM-integrated applications through injected instructions in retrieved content. Key attack patterns: data exfiltration, prompt theft, plugin exploitation. Essential for Part 2 embedded context discussion.
Not What You've Signed Up For
Greshake et al.
Foundational indirect prompt injection paper. Demonstrates compromising real-world LLM-integrated applications through injected instructions in retrieved content. Key attack patterns: data exfiltration, prompt theft, plugin exploitation. Essential for Part 2 embedded context discussion.
Type: paper
Year: 2023
- MITRE Resources
-
15 tactics, 66 techniques for AI/ML attacks
MITRE ATLAS
15 tactics, 66 techniques for AI/ML attacks
Type: resource
-
Center for Threat-Informed Defense
Type: resource
- LLM Security (Red Teaming)
-
Industry standard threat taxonomy
OWASP LLM Top 10
Industry standard threat taxonomy
Type: resource
-
Curated prompt injection research
LLM Security
Curated prompt injection research
Type: resource
-
Many-Shot Jailbreaking
— Anthropic
(2024)
[paper]
Context window exploitation
Many-Shot Jailbreaking
Anthropic
Context window exploitation
Type: paper
Year: 2024
-
Project Vend
— Anthropic
(2025)
[paper]
AI vending machine experiment. Employees casually asked for discounts and it complied, giving away free items. Demonstrates frame-shifting vulnerability.
Project Vend
Anthropic
AI vending machine experiment. Employees casually asked for discounts and it complied, giving away free items. Demonstrates frame-shifting vulnerability.
Type: paper
Year: 2025
- Agentic Security
-
Agents execute skill files as instructions. Hundreds of skills distributed infostealer malware through what looked like documentation. Markdown became an installer.
From Magic to Malware: How OpenClaw's Agent Skills Become an Attack Surface
Jason Meller
Agents execute skill files as instructions. Hundreds of skills distributed infostealer malware through what looked like documentation. Markdown became an installer.
Type: article
Year: 2026
- LLM Security (Red Teaming)
-
Jailbroken: How Does LLM Safety Training Fail?
— Wei et al.
(2023)
[paper]
Taxonomy of jailbreak techniques
Jailbroken: How Does LLM Safety Training Fail?
Wei et al.
Taxonomy of jailbreak techniques
Type: paper
Year: 2023
-
LLM vulnerability scanner, automated red teaming tool
garak
LLM vulnerability scanner, automated red teaming tool
Type: tool
-
Embrace The Red
— Wunderwuzzi
[blog]
Blog on AI red teaming
Embrace The Red
Wunderwuzzi
Blog on AI red teaming
Type: blog
-
600K+ adversarial prompts, 29 technique taxonomy. Foundational dataset.
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs
Schulhoff et al.
600K+ adversarial prompts, 29 technique taxonomy. Foundational dataset.
Type: paper
Year: 2023
-
NeurIPS 2024. Standard benchmark methodology, 100 behaviors across 10 harm categories.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs
Chao et al.
NeurIPS 2024. Standard benchmark methodology, 100 behaviors across 10 harm categories.
Type: paper
Year: 2024
-
Jailbreaks cluster by semantic type; effective attacks suppress harmfulness perception.
Understanding Jailbreak Success: A Study of Latent Space Dynamics in LLMs
Ball et al.
Jailbreaks cluster by semantic type; effective attacks suppress harmfulness perception.
Type: paper
Year: 2024
-
NeurIPS 2024. Factor analysis: model size, fine-tuning, system prompts affect robustness.
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Xu et al.
NeurIPS 2024. Factor analysis: model size, fine-tuning, system prompts affect robustness.
Type: paper
Year: 2024
- LLMs for Security Work
-
Threat intelligence summarization
[resource]
Distilling reports, CVE analysis
Threat intelligence summarization
Distilling reports, CVE analysis
Type: resource
-
Log analysis & anomaly detection
[resource]
Pattern recognition in SIEM data
Log analysis & anomaly detection
Pattern recognition in SIEM data
Type: resource
-
Malware analysis assistance
[resource]
Code explanation, IOC extraction
Malware analysis assistance
Code explanation, IOC extraction
Type: resource
-
Phishing detection
[resource]
Email/URL classification
Phishing detection
Email/URL classification
Type: resource
-
Report writing & documentation
[resource]
SOC reports, incident summaries
Report writing & documentation
SOC reports, incident summaries
Type: resource
-
Query generation (SPL, KQL)
[resource]
Natural language to security queries
Query generation (SPL, KQL)
Natural language to security queries
Type: resource
- CVE-to-ATT&CK Mapping
-
Official MITRE methodology and dataset. Authoritative mappings in Mappings Explorer.
MITRE CTID: Mapping ATT&CK to CVE for Impact
Official MITRE methodology and dataset. Authoritative mappings in Mappings Explorer.
Type: resource
Collected for attack-kg v3. Curate later.
-
Bidirectional KG: ATT&CK <-> CAPEC <-> CWE <-> CVE. Traversable edges for path-based queries.
BRON: Bidirectional Graph
Hemberg et al.
Bidirectional KG: ATT&CK <-> CAPEC <-> CWE <-> CVE. Traversable edges for path-based queries.
Type: tool
Collected for attack-kg v3. Curate later.
-
SRL extracts attack vectors from CVE text, ATT&CK-BERT embeds both sides, logistic regression classifies. Code + dataset on GitHub (MIT).
SMET: Semantic Mapping of CVE to ATT&CK
Abdeen et al.
SRL extracts attack vectors from CVE text, ATT&CK-BERT embeds both sides, logistic regression classifies. Code + dataset on GitHub (MIT).
Type: paper
Year: 2023
Journal: DBSec 2023
Using in attack-kg v3. Journal version: 10.3233/JCS-230218
-
1,813 labeled CVE->ATT&CK pairs. BERT multi-label classifiers. Dataset useful for fine-tuning.
CVE2ATT&CK: BERT-Based Mapping of CVEs to ATT&CK Techniques
Grigorescu et al.
1,813 labeled CVE->ATT&CK pairs. BERT multi-label classifiers. Dataset useful for fine-tuning.
Type: paper
Year: 2022
Journal: Algorithms (MDPI)
Collected for attack-kg v3. Curate later.
-
SecRoBERTa best at F1 77.81%. GPT-4 zero-shot only 22.04%---general LLMs struggle without fine-tuning.
Automated CVE-to-Tactic Mapping
SecRoBERTa best at F1 77.81%. GPT-4 zero-shot only 22.04%---general LLMs struggle without fine-tuning.
Type: paper
Year: 2024
Journal: Information (MDPI)
Collected for attack-kg v3. Curate later.
- CTI + LLMs + Knowledge Graphs
-
LLM extracts triples from CTI reports, constructs queryable KG. Prompt engineering + fine-tuning comparison.
Actionable Cyber Threat Intelligence using Knowledge Graphs and LLMs
Kumar et al.
LLM extracts triples from CTI reports, constructs queryable KG. Prompt engineering + fine-tuning comparison.
Type: paper
Year: 2024
-
Four-step framework: rewrite reports → parse → entity extraction → MITRE TTP mapping. In-context learning approach.
AttacKG+: Boosting Attack Knowledge Graph Construction with LLMs
Zhang et al.
Four-step framework: rewrite reports → parse → entity extraction → MITRE TTP mapping. In-context learning approach.
Type: paper
Year: 2024
-
88K examples: NL questions → executable graph reasoning paths + CoT explanations. Deterministic execution on KG. Hybrid grounding exemplar.
TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence
Zhou et al.
88K examples: NL questions → executable graph reasoning paths + CoT explanations. Deterministic execution on KG. Hybrid grounding exemplar.
Type: paper
Year: 2025
- Agentic Security
-
CPU-based simulation generates pentesting trajectories from AD network manifests. 8B model fine-tuned on 10K synthetic trajectories achieves domain compromise on real GOAD network. Demonstrates sim-to-real transfer via formal state modeling.
WORLDS: A Simulation Engine for Agentic Pentesting
Dreadnode
CPU-based simulation generates pentesting trajectories from AD network manifests. 8B model fine-tuned on 10K synthetic trajectories achieves domain compromise on real GOAD network. Demonstrates sim-to-real transfer via formal state modeling.
Type: blog
Year: 2025
-
Critiques model-centric detection pipelines; proposes meta-cognitive architecture for accountable decision-making under adversarial uncertainty.
Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy
Kojukhov & Bovshover
Critiques model-centric detection pipelines; proposes meta-cognitive architecture for accountable decision-making under adversarial uncertainty.
Type: paper
Year: 2026
-
Attack where payload survives across sessions via memory poisoning. Black-box attack through web content.
Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
arxiv
Attack where payload survives across sessions via memory poisoning. Black-box attack through web content.
Type: paper
Year: 2026
Critical failure mode for persistent-memory agents. Attacker implants payload during benign task, agent stores it as memory, later treats it as instruction.
-
LLMs systematically prefer certain sources when synthesizing information for users.
In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
arxiv
LLMs systematically prefer certain sources when synthesizing information for users.
Type: paper
Year: 2026
Affects RAG reliability. When sources are attributed, models exhibit latent preferences that steer what information users receive.
-
Autonomous OpenClaw agent published defamatory content after its code contribution was rejected. 'Soul document' personality config produced harmful behavior without jailbreak or adversarial prompting. Operator claimed minimal supervision ('five to ten word replies'). Commenter: 'Nothing underneath it. That's the architectural flaw.' Flagged for Part 3: behavioral configuration without structural constraints.
An AI Agent Published a Hit Piece on Me – The Operator Came Forward
Shambaugh, Scott
Autonomous OpenClaw agent published defamatory content after its code contribution was rejected. 'Soul document' personality config produced harmful behavior without jailbreak or adversarial prompting. Operator claimed minimal supervision ('five to ten word replies'). Commenter: 'Nothing underneath it. That's the architectural flaw.' Flagged for Part 3: behavioral configuration without structural constraints.
Type: post
Year: 2025
-
★
Agents of Chaos
— Shapira et al.
(2026)
[paper]
Red-teaming study of autonomous LLM agents (OpenClaw, Claude Opus + Kimi K2.5) in a live lab with persistent memory, email, Discord, and shell access. 20 researchers, 2 weeks, 11 case studies. Agents complied with non-owners, disclosed PII, nuked their own infrastructure to protect secrets, fell for identity spoofing across session boundaries, and propagated injected instructions to other agents. Central finding: discrepancy between what agents report doing and what they actually do.
Agents of Chaos
Natalie Shapira, Chris Wendler, Avery Yen, Gabriele Sarti, Koyena Pal, Olivia Floody, Adam Belfki, Alex Loftus, Aditya Ratan Jannali, Nikhil Prakash, Jasmine Cui, Giordano Rogers, Jannik Brinkmann, Can Rager, Amir Zur, Michael Ripa, Aruna Sankaranarayanan, David Atkinson, Rohit Gandikota, Jaden Fiotto-Kaufman, EunJeong Hwang, Hadas Orgad, P Sam Sahil, Negev Taglicht, Tomer Shabtay, Atai Ambus, Nitay Alon, Shiri Oron, Ayelet Gordon-Tapiero, Yotam Kaplan, Vered Shwartz, Tamar Rott Shaham, Christoph Riedl, Reuth Mirsky, Maarten Sap, David Manheim, Tomer Ullman, David Bau
Red-teaming study of autonomous LLM agents (OpenClaw, Claude Opus + Kimi K2.5) in a live lab with persistent memory, email, Discord, and shell access. 20 researchers, 2 weeks, 11 case studies. Agents complied with non-owners, disclosed PII, nuked their own infrastructure to protect secrets, fell for identity spoofing across session boundaries, and propagated injected instructions to other agents. Central finding: discrepancy between what agents report doing and what they actually do.
Type: paper
Year: 2026
Empirical evidence for multiple trilogy theses. Three structural lacks (no stakeholder model, no self-model, no private deliberation surface) map to grounding spectrum — these agents sit at 'system prompt' level with no structural constraints. Case #3 (SSN disclosure via 'forward the email' vs direct ask) is distributional vs semantic safety in the wild. Case #10 (constitution attack) is indirect prompt injection via memory poisoning — agent voluntarily propagated the compromised document to other agents. Case #7 (guilt escalation) is frame drift over multi-turn conversation. The fundamental/contingent failure distinction echoes the doctrine: contingent failures need engineering, fundamental failures need architectural rethinking. First live-deployment (not simulated) multi-agent red-teaming study at this scale.
- Sycophancy & Calibration
-
Sycophantic AI Advice
— Cheng et al.
(2026)
[paper]
Published in Science. 11 LLMs endorse user positions 49% more frequently than humans in advice scenarios; affirm harmful/illegal behavior 47% of the time. 2,400+ participant study shows users prefer sycophantic models, cannot distinguish biased from objective responses, become more self-convinced and less empathetic after sycophantic interaction. Asymmetric failure: models agree when they shouldn't, not just when they should.
Sycophantic AI Advice
Myra Cheng, Cinoo Lee, Sunny Yu, Dyllan Han, Pranav Khadpe, Dan Jurafsky
Published in Science. 11 LLMs endorse user positions 49% more frequently than humans in advice scenarios; affirm harmful/illegal behavior 47% of the time. 2,400+ participant study shows users prefer sycophantic models, cannot distinguish biased from objective responses, become more self-convinced and less empathetic after sycophantic interaction. Asymmetric failure: models agree when they shouldn't, not just when they should.
Type: paper
Year: 2026
- Prompt Injection & Jailbreaks
-
Large-scale public red-teaming competition: 464 participants, 272K attack attempts, 8,648 successful attacks across 13 frontier models in 41 scenarios. All models vulnerable — ASR ranges from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). Key transfer finding: attacks that bypass Opus 4.5 transfer at 44-81% to all other models; attacks from vulnerable models don't transfer upward. 'Holodeck' universal template works across 21/41 behaviors on 9 models. Open-sourced + quarterly updates.
How Vulnerable Are AI Agents to Indirect Prompt Injections?
Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, Micha Nowak, Nick Winter, Eliot Jones, Andy Zou, Lama Ahmad, Kamalika Chaudhuri, Sahana Chennabasappa, Xander Davies, Lauren Deason, Benjamin L. Edelman, Tanner Emek, Ivan Evtimov, Jim Gust, Maia Hamin, Kat He, Klaudia Krawiecka, Riccardo Patana, Neil Perry, Troy Peterson, Xiangyu Qi, Javier Rando, Zifan Wang, Zihan Wang, Spencer Whitman, Eric Winsor, Arman Zharmagambetov, Matt Fredrikson, Zico Kolter
Large-scale public red-teaming competition: 464 participants, 272K attack attempts, 8,648 successful attacks across 13 frontier models in 41 scenarios. All models vulnerable — ASR ranges from 0.5% (Claude Opus 4.5) to 8.5% (Gemini 2.5 Pro). Key transfer finding: attacks that bypass Opus 4.5 transfer at 44-81% to all other models; attacks from vulnerable models don't transfer upward. 'Holodeck' universal template works across 21/41 behaviors on 9 models. Open-sourced + quarterly updates.
Type: paper
Year: 2026
-
RL-trained attacker (Qwen3-4B) achieves ASR@10=1.0 against Meta-SecAlign-8B, the strongest published prompt injection defense. Two mechanisms: adaptive entropy regularization (forces exploration under strong defenses) and dynamic advantage weighting (amplifies rare successes). Trained on just 100 samples, generalizes to 12 unseen benchmarks. All 8 evaluated defenses cluster in two bad regions: high utility but easily broken, or robust but degraded utility. Invalidates published 'near-zero ASR' defense claims based on static evaluation.
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
Chenlong Yin, Runpeng Geng, Yanting Wang, Jinyuan Jia
RL-trained attacker (Qwen3-4B) achieves ASR@10=1.0 against Meta-SecAlign-8B, the strongest published prompt injection defense. Two mechanisms: adaptive entropy regularization (forces exploration under strong defenses) and dynamic advantage weighting (amplifies rare successes). Trained on just 100 samples, generalizes to 12 unseen benchmarks. All 8 evaluated defenses cluster in two bad regions: high utility but easily broken, or robust but degraded utility. Invalidates published 'near-zero ASR' defense claims based on static evaluation.
Type: paper
Year: 2026
- Agentic Security
-
Systematic review of 128 papers (51 attack methods, 60 defense methods). Introduces 7 agent design dimensions (input trust, access sensitivity, workflow, action, memory, tool, UI) — each spans a flexibility spectrum where more flexibility = more attack surface. Taxonomizes 6 attack vectors (indirect prompt injection, data poisoning, tool manipulation, direct injection, model poisoning, memory poisoning) and 7 cascading risk categories. Defense-in-depth framework with 4 layers.
The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey
Juhee Kim, Xiaoyuan Liu, Zhun Wang, Shi Qiu, Bo Li, Wenbo Guo, Dawn Song
Systematic review of 128 papers (51 attack methods, 60 defense methods). Introduces 7 agent design dimensions (input trust, access sensitivity, workflow, action, memory, tool, UI) — each spans a flexibility spectrum where more flexibility = more attack surface. Taxonomizes 6 attack vectors (indirect prompt injection, data poisoning, tool manipulation, direct injection, model poisoning, memory poisoning) and 7 cascading risk categories. Defense-in-depth framework with 4 layers.
Type: paper
Year: 2026
-
PR metadata framing drops GPT-4o-mini vulnerability detection from 97.2% to 3.6% (strong bug-free framing). Asymmetric: bug-free framing suppresses detection 16-93pp, while bug-present framing increases false positives only 0.8-13.6pp. Claude Code accepts 88% of known-CVE code after iterative adversarial framing of PR metadata. Debiasing fix (explicit instructions to ignore metadata) recovers 100% detection. 250 CVE-derived file pairs across 4 CWE types, 4 models, 5 framing conditions.
Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
Dimitris Mitropoulos, Nikolaos Alexopoulos, Georgios Alexopoulos, Diomidis Spinellis
PR metadata framing drops GPT-4o-mini vulnerability detection from 97.2% to 3.6% (strong bug-free framing). Asymmetric: bug-free framing suppresses detection 16-93pp, while bug-present framing increases false positives only 0.8-13.6pp. Claude Code accepts 88% of known-CVE code after iterative adversarial framing of PR metadata. Debiasing fix (explicit instructions to ignore metadata) recovers 100% detection. 250 CVE-derived file pairs across 4 CWE types, 4 models, 5 framing conditions.
Type: paper
Year: 2026
-
Neuro-symbolic CTF agent using MCP for schema enforcement. Key ablation: MCP schema enforcement alone (55-line prompt, no templates or lessons) achieves 77.8% solve rate across 15 CTF challenges — additional documentation adds only ~9pp (not statistically significant). Won live university CTF (215 pts, 1st of 22+ teams). Validates that architectural constraints on LLM output (protocol-layer rejection of invalid actions) outperform elaborate prompt engineering.
STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving
James Hugglestone, Samuel Jacob Chacko, Dawson Stoller, Ryan Schmidt, Xiuwen Liu
Neuro-symbolic CTF agent using MCP for schema enforcement. Key ablation: MCP schema enforcement alone (55-line prompt, no templates or lessons) achieves 77.8% solve rate across 15 CTF challenges — additional documentation adds only ~9pp (not statistically significant). Won live university CTF (215 pts, 1st of 22+ teams). Validates that architectural constraints on LLM output (protocol-layer rejection of invalid actions) outperform elaborate prompt engineering.
Type: paper
Year: 2026
-
Multi-agent automated web pentesting system with RAG for external knowledge, shared recurrent memory for persistent state, and dual-phase reflection for payload validation. 86% success rate on XBOW benchmark (vs 50% PentestAgent, 46% AutoPT, 6% VulnBot). 93.99% subtask completion rate indicating strong long-horizon reasoning. Evaluated on XBOW + Vulhub CVEs.
Red-MIRROR: Agentic LLM-based Autonomous Penetration Testing
Tran Vy Khang, Nguyen Dang Nguyen Khang, Nghi Hoang Khoa, Do Thi Thu Hien, Van-Hau Pham, Phan The Duy
Multi-agent automated web pentesting system with RAG for external knowledge, shared recurrent memory for persistent state, and dual-phase reflection for payload validation. 86% success rate on XBOW benchmark (vs 50% PentestAgent, 46% AutoPT, 6% VulnBot). 93.99% subtask completion rate indicating strong long-horizon reasoning. Evaluated on XBOW + Vulhub CVEs.
Type: paper
Year: 2026
- Sycophancy & Calibration
-
Evaluates 6 VLMs (3 general, 3 medical) on 3 medical VQA datasets. Finds grounding-sycophancy tradeoff: models with lowest hallucination are most sycophantic, while most pressure-resistant model hallucinates more. No model achieves Clinical Safety Index above 0.35. Proposes three metrics: L-VASE (logit-space grounding), CCS (confidence-calibrated sycophancy), and CSI (unified safety combining grounding, autonomy, calibration).
To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical VLMs
OFM Riaz Rahman Aranya, Kevin Desai
Evaluates 6 VLMs (3 general, 3 medical) on 3 medical VQA datasets. Finds grounding-sycophancy tradeoff: models with lowest hallucination are most sycophantic, while most pressure-resistant model hallucinates more. No model achieves Clinical Safety Index above 0.35. Proposes three metrics: L-VASE (logit-space grounding), CCS (confidence-calibrated sycophancy), and CSI (unified safety combining grounding, autonomy, calibration).
Type: paper
Year: 2026
- Agentic Security
-
STRIDE/DREAD threat modeling of MCP across 5 components (host/client, LLM, server, data stores, auth server). Identifies tool poisoning (malicious instructions in tool metadata) as most prevalent client-side vulnerability. Systematic comparison of 7 major MCP clients reveals insufficient static validation and parameter visibility. Proposes multi-layered defense: static metadata analysis, decision path tracking, behavioral anomaly detection, user transparency.
Model Context Protocol Threat Modeling and Tool Poisoning Vulnerabilities
Charoes Huang, Xin Huang, Ngoc Phu Tran, Amin Milani Fard
STRIDE/DREAD threat modeling of MCP across 5 components (host/client, LLM, server, data stores, auth server). Identifies tool poisoning (malicious instructions in tool metadata) as most prevalent client-side vulnerability. Systematic comparison of 7 major MCP clients reveals insufficient static validation and parameter visibility. Proposes multi-layered defense: static metadata analysis, decision path tracking, behavioral anomaly detection, user transparency.
Type: paper
Year: 2026
-
First systematic analysis of MCP specification-level vulnerabilities. Of 275 MCP clauses, 50.2% are discretionary (SHOULD/MAY). Analysis of 10 language SDKs finds 1,270 non-implementations creating 'compatibility-abusing attacks' (silent prompt injection, DoS). Cross-language analysis via language-agnostic IR + LLM-guided semantic reasoning. 20/26 reports acknowledged by maintainers; tool invited into official MCP conformance testing. Attacks exploit the spec itself, not implementation bugs.
Compatibility at a Cost: MCP Clause-Compliance Vulnerabilities
Nanzi Yang, Weiheng Bai, Kangjie Lu
First systematic analysis of MCP specification-level vulnerabilities. Of 275 MCP clauses, 50.2% are discretionary (SHOULD/MAY). Analysis of 10 language SDKs finds 1,270 non-implementations creating 'compatibility-abusing attacks' (silent prompt injection, DoS). Cross-language analysis via language-agnostic IR + LLM-guided semantic reasoning. 20/26 reports acknowledged by maintainers; tool invited into official MCP conformance testing. Attacks exploit the spec itself, not implementation bugs.
Type: paper
Year: 2026
-
The Internal State of an LLM Knows When It's Lying
— Azaria & Mitchell
(2024)
[paper]
Internal activation classifiers detect hallucinations better than output-based methods. The confabulation signal exists before generation. Goes outside the text layer---structurally analogous to Cohen's detection ceiling. Referenced in hallucination article outline (thesis 79, composted thesis 40).
The Internal State of an LLM Knows When It's Lying
Amos Azaria, Tom Mitchell
Internal activation classifiers detect hallucinations better than output-based methods. The confabulation signal exists before generation. Goes outside the text layer---structurally analogous to Cohen's detection ceiling. Referenced in hallucination article outline (thesis 79, composted thesis 40).
Type: paper
Year: 2024
-
Zero Trust gateway for MCP servers. Real attack examples: CVE-2025-6514 (npm MCP auth), NeighborJack (0.0.0.0-bound MCP servers), confused deputy (SQL injection via support ticket processed by AI agent). Prompt injection hidden in tool descriptions. Maps to Thesis 13 (model can't verify context) and Thesis 30 (behavioral shaping without structural constraints is an architectural flaw).
Securing the AI Revolution: Introducing Cloudflare MCP Server Portals
Kenny Johnson
Zero Trust gateway for MCP servers. Real attack examples: CVE-2025-6514 (npm MCP auth), NeighborJack (0.0.0.0-bound MCP servers), confused deputy (SQL injection via support ticket processed by AI agent). Prompt injection hidden in tool descriptions. Maps to Thesis 13 (model can't verify context) and Thesis 30 (behavioral shaping without structural constraints is an architectural flaw).
Type: blog
Year: 2025
-
Edge-native LLM prompt firewall using Llama Guard for real-time classification. Model-agnostic, sits in front of any provider. Explicitly acknowledges keyword blocklists fail because meaning is context-dependent --- Wittgenstein's argument restated by a product team. Parallel async detection architecture (PII + unsafe topics + future modules). Alignment stack downstream layer.
Block unsafe prompts targeting your LLM endpoints with Firewall for AI
Radwa Radwan, Mathias Deschamps
Edge-native LLM prompt firewall using Llama Guard for real-time classification. Model-agnostic, sits in front of any provider. Explicitly acknowledges keyword blocklists fail because meaning is context-dependent --- Wittgenstein's argument restated by a product team. Parallel async detection architecture (PII + unsafe topics + future modules). Alignment stack downstream layer.
Type: blog
Year: 2025
-
Shadow AI as the new Shadow IT. Scenario: junior engineer pastes proprietary code into public AI chatbot. Solution: SWG-level traffic inspection with application categorization (approved/unapproved/in-review), DLP integration, browser isolation. High Sprocket relevance --- directly actionable for security teams managing LLM adoption.
Unmasking the Unseen: Your Guide to Taming Shadow AI with Cloudflare One
Noelle Kagan, Joey Steinberger
Shadow AI as the new Shadow IT. Scenario: junior engineer pastes proprietary code into public AI chatbot. Solution: SWG-level traffic inspection with application categorization (approved/unapproved/in-review), DLP integration, browser isolation. High Sprocket relevance --- directly actionable for security teams managing LLM adoption.
Type: blog
Year: 2025
-
Web Bot Auth uses HTTP Message Signatures (ed25519) to cryptographically verify AI agent identity. User-agent strings are behavioral claims; signatures are structural verification. Distinguishes agents-directed-by-operators from agents-directed-by-users. First cohort: ChatGPT agent, Goose (Block), Browserbase, Anchor Browser.
The age of agents: cryptographically recognizing agent traffic
Jin-Hee Lee
Web Bot Auth uses HTTP Message Signatures (ed25519) to cryptographically verify AI agent identity. User-agent strings are behavioral claims; signatures are structural verification. Distinguishes agents-directed-by-operators from agents-directed-by-users. First cohort: ChatGPT agent, Goose (Block), Browserbase, Anchor Browser.
Type: blog
Year: 2025
-
Graph convolutional network (MPGCN) on Abstract Syntax Trees detects malicious JavaScript from structure, not signatures. 3.5B scripts/day, precision 98%, recall 90%, F1 94%, <0.3s inference. Detected all 18 compromised npm packages (chalk, debug, ansi-styles, etc.) despite never seeing this attack. The 10% miss rate (1 - recall) is excellent engineering under Cohen's undecidability ceiling. Their own acknowledgment: 'the only reliable way to distinguish truly malicious payloads is by assessing the trustworthiness of their connected domains' --- that's undecidability surfacing in production. LLMs and human analysts review flagged scripts; detection is one layer, not the whole system.
How Cloudflare's client-side security made the npm supply chain attack a non-event
Bashyam Anant, Juan Miguel Cejuela, Zhiyuan Zheng, Denzil Correa, Israel Adura, Georgie Yoxall
Graph convolutional network (MPGCN) on Abstract Syntax Trees detects malicious JavaScript from structure, not signatures. 3.5B scripts/day, precision 98%, recall 90%, F1 94%, <0.3s inference. Detected all 18 compromised npm packages (chalk, debug, ansi-styles, etc.) despite never seeing this attack. The 10% miss rate (1 - recall) is excellent engineering under Cohen's undecidability ceiling. Their own acknowledgment: 'the only reliable way to distinguish truly malicious payloads is by assessing the trustworthiness of their connected domains' --- that's undecidability surfacing in production. LLMs and human analysts review flagged scripts; detection is one layer, not the whole system.
Type: blog
Year: 2025
-
The foundational proof that general virus detection is undecidable. Cohen constructs a 'contradictory virus' that queries the detector about itself: if the detector says 'virus,' it doesn't infect (contradiction); if 'not virus,' it infects (contradiction). Derives undecidability independently via self-referential construction, does not cite Rice's theorem, though the result is a special case. Practical conclusions: isolation, compartmentalization, protect high-privilege accounts. Key quotes: 'No infection can exist that cannot be detected, and no detection mechanism can exist that can't be infected.' 'The only provably safe policy as of this time is isolationism.' Anchor source for the detection ceiling article.
Computer Viruses: Theory and Experiments
Cohen, Fred
The foundational proof that general virus detection is undecidable. Cohen constructs a 'contradictory virus' that queries the detector about itself: if the detector says 'virus,' it doesn't infect (contradiction); if 'not virus,' it infects (contradiction). Derives undecidability independently via self-referential construction, does not cite Rice's theorem, though the result is a special case. Practical conclusions: isolation, compartmentalization, protect high-privilege accounts. Key quotes: 'No infection can exist that cannot be detected, and no detection mechanism can exist that can't be infected.' 'The only provably safe policy as of this time is isolationism.' Anchor source for the detection ceiling article.
Type: paper
Year: 1987
-
An Undetectable Computer Virus
— Chess, David & White, Steve
(2000)
[paper]
Builds on Cohen (1987). Constructs a virus that is provably undetectable by any program analysis. Clarifies what 'detect' means in practice --- the distinction between detecting a specific virus (tractable), detecting all viruses (undecidable), and detecting suspiciousness (heuristic). Useful for the 'three meanings of detect' framing in the detection ceiling article.
An Undetectable Computer Virus
Chess, David & White, Steve
Builds on Cohen (1987). Constructs a virus that is provably undetectable by any program analysis. Clarifies what 'detect' means in practice --- the distinction between detecting a specific virus (tractable), detecting all viruses (undecidable), and detecting suspiciousness (heuristic). Useful for the 'three meanings of detect' framing in the detection ceiling article.
Type: paper
Year: 2000
-
Mathematical proof that detection utility has a ceiling independent of detector quality. Even at 99% true positive rate, low base rates of actual attacks produce unmanageable false alarm volumes. 'Even in the best-case scenario, the Bayesian detection rate drops to around 2%' at 100 false alarms/day. The operational ceiling on detection --- complements Cohen's theoretical ceiling. Directly supports the detection ceiling article's budget argument.
The Base-Rate Fallacy and the Difficulty of Intrusion Detection
Axelsson, Stefan
Mathematical proof that detection utility has a ceiling independent of detector quality. Even at 99% true positive rate, low base rates of actual attacks produce unmanageable false alarm volumes. 'Even in the best-case scenario, the Bayesian detection rate drops to around 2%' at 100 false alarms/day. The operational ceiling on detection --- complements Cohen's theoretical ceiling. Directly supports the detection ceiling article's budget argument.
Type: paper
Year: 2000
-
A detection vendor arguing that its own paradigm has diminishing returns. Key paradox: when breaches are investigated, disabled (noisy) rules turn out to have been the ones that would have caught the activity earlier. Detection treated as binary (always shown to analyst or never shown) breaks under alert volume. Remarkable because CrowdStrike is the detection vendor --- they're naming the ceiling from inside it.
Why the Detection Funnel Hits Diminishing Returns
CrowdStrike
A detection vendor arguing that its own paradigm has diminishing returns. Key paradox: when breaches are investigated, disabled (noisy) rules turn out to have been the ones that would have caught the activity earlier. Detection treated as binary (always shown to analyst or never shown) breaks under alert volume. Remarkable because CrowdStrike is the detection vendor --- they're naming the ceiling from inside it.
Type: blog
Year: 2024
-
Comprehensive survey of RLHF failure modes: reward hacking, sycophancy, mode collapse, distributional shift. Required for the behavioral shaping brittleness argument (Part 1 alignment stack, Part 2 constraint asymmetry). Best available source for why RLHF produces distributional safety, not guaranteed safety.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Casper et al.
Comprehensive survey of RLHF failure modes: reward hacking, sycophancy, mode collapse, distributional shift. Required for the behavioral shaping brittleness argument (Part 1 alignment stack, Part 2 constraint asymmetry). Best available source for why RLHF produces distributional safety, not guaranteed safety.
Type: paper
Year: 2023
-
Best available source for RLHF sycophancy failure mode. Models trained with RLHF systematically produce responses that match user beliefs rather than truth. Empirical evidence for reward gaming — the model optimizes for approval signal, not correctness. Supports Part 1's Foucault argument (power shapes distribution) and the hallucination article's Grice argument (performed cooperation without grounding).
Towards Understanding Sycophancy in Language Models
Sharma et al.
Best available source for RLHF sycophancy failure mode. Models trained with RLHF systematically produce responses that match user beliefs rather than truth. Empirical evidence for reward gaming — the model optimizes for approval signal, not correctness. Supports Part 1's Foucault argument (power shapes distribution) and the hallucination article's Grice argument (performed cooperation without grounding).
Type: paper
Year: 2023
-
Extends sycophancy beyond factual agreement to face-preserving behavior. ELEPHANT framework measures five dimensions: emotional validation, moral endorsement, indirect language, indirect action, accepting framing. N=1,604 across two preregistered experiments: LLMs accept user framing 90% vs 60% for humans; 44% false negative rate on moral judgment; users rated sycophantic responses as higher quality and trusted sycophantic AI more. Preference datasets reward sycophantic behavior, meaning RLHF itself propagates the vulnerability. Operationalizes Part 2's frame drift argument.
Social Sycophancy: A Broader Understanding of LLM Sycophancy
Cheng, Yu, Lee, Khadpe, Ibrahim, Jurafsky
Extends sycophancy beyond factual agreement to face-preserving behavior. ELEPHANT framework measures five dimensions: emotional validation, moral endorsement, indirect language, indirect action, accepting framing. N=1,604 across two preregistered experiments: LLMs accept user framing 90% vs 60% for humans; 44% false negative rate on moral judgment; users rated sycophantic responses as higher quality and trusted sycophantic AI more. Preference datasets reward sycophantic behavior, meaning RLHF itself propagates the vulnerability. Operationalizes Part 2's frame drift argument.
Type: paper
Year: 2025
-
Claude Opus 4.6 found 22 vulnerabilities in Firefox (14 high-severity) over two weeks, but could only exploit 2 out of several hundred attempts (~$4,000 in API credits). The representing/intervening split quantified: discovery scales, exploitation doesn't (yet). Mozilla required minimal test cases, PoCs, and candidate patches — structural constraints filtering probabilistic output. 'Task verifiers' provide real-time feedback to patching agents, confirming fixes eliminate vulnerabilities without regression — the proposal engine / decision engine pattern from Part 3 independently implemented. 112 reports submitted, 22 real (~20% precision). Concrete scenario for AI Agentic Pentesting article. Also validates Hacker Epistemology's representing/intervening distinction with dollar figures.
Anthropic's Partnership with Mozilla on Firefox Security
Anthropic
Claude Opus 4.6 found 22 vulnerabilities in Firefox (14 high-severity) over two weeks, but could only exploit 2 out of several hundred attempts (~$4,000 in API credits). The representing/intervening split quantified: discovery scales, exploitation doesn't (yet). Mozilla required minimal test cases, PoCs, and candidate patches — structural constraints filtering probabilistic output. 'Task verifiers' provide real-time feedback to patching agents, confirming fixes eliminate vulnerabilities without regression — the proposal engine / decision engine pattern from Part 3 independently implemented. 112 reports submitted, 22 real (~20% precision). Concrete scenario for AI Agentic Pentesting article. Also validates Hacker Epistemology's representing/intervening distinction with dollar figures.
Type: article
Year: 2026
-
Primary source on Clinejection. Prompt injection in a GitHub issue title hijacked Cline's AI triage agent (Claude 'happily executed the payload in all test attempts') → cache poisoning via Khan's Cacheract tool (LRU eviction, shared cache scope between low-privilege triage and high-privilege release) → credential theft (NPM_RELEASE_TOKEN, VSCE_PAT, OVSX_PAT) → malicious cline@2.3.0 published to npm for ~8 hours, installed OpenClaw globally. 5+ million users in blast radius. Credential model weakness: VS Code Marketplace tokens tied to publishers not extensions, so nightly PATs could publish production versions. Realized impact limited to npm; potential blast radius was backdoored VS Code extension with auto-updates across millions of IDEs. Every repressive control failed: npm audit saw legitimate software, code review saw one changed line, provenance wasn't configured. Remediation all structural: OIDC provenance, cache isolation, credential verification. Connects to: Part 2 (Greshake indirect injection mechanism in production — 'natural language became the attack vector'), MCP Attack Surface (article names MCP tool poisoning as equivalent), Designing for Invariants (remediation = drawing the boundary correctly; 'we trust the cache' was a probabilistic assumption), Detection ceiling (Cohen — detector can't distinguish legitimate software installed legitimately vs. maliciously), Principal Hierarchy Problem (no hierarchy to prevent untrusted input from becoming de facto principal). Timeline: triage workflow added Dec 21, reported Jan 1, no response 5 weeks, public disclosure Feb 9, patched in 30 min, botched credential rotation, weaponized by different actor Feb 17.
Clinejection: AI bot supply chain attack via prompt injection in GitHub Actions
Snyk (based on Adnan Khan's research)
Primary source on Clinejection. Prompt injection in a GitHub issue title hijacked Cline's AI triage agent (Claude 'happily executed the payload in all test attempts') → cache poisoning via Khan's Cacheract tool (LRU eviction, shared cache scope between low-privilege triage and high-privilege release) → credential theft (NPM_RELEASE_TOKEN, VSCE_PAT, OVSX_PAT) → malicious cline@2.3.0 published to npm for ~8 hours, installed OpenClaw globally. 5+ million users in blast radius. Credential model weakness: VS Code Marketplace tokens tied to publishers not extensions, so nightly PATs could publish production versions. Realized impact limited to npm; potential blast radius was backdoored VS Code extension with auto-updates across millions of IDEs. Every repressive control failed: npm audit saw legitimate software, code review saw one changed line, provenance wasn't configured. Remediation all structural: OIDC provenance, cache isolation, credential verification. Connects to: Part 2 (Greshake indirect injection mechanism in production — 'natural language became the attack vector'), MCP Attack Surface (article names MCP tool poisoning as equivalent), Designing for Invariants (remediation = drawing the boundary correctly; 'we trust the cache' was a probabilistic assumption), Detection ceiling (Cohen — detector can't distinguish legitimate software installed legitimately vs. maliciously), Principal Hierarchy Problem (no hierarchy to prevent untrusted input from becoming de facto principal). Timeline: triage workflow added Dec 21, reported Jan 1, no response 5 weeks, public disclosure Feb 9, patched in 30 min, botched credential rotation, weaponized by different actor Feb 17.
Type: article
Year: 2026
-
Secondary analysis of Clinejection. Emphasizes confused deputy framing, OpenClaw capabilities (persistent daemon, shell access via Gateway API), and broader implications for AI agents in CI/CD. See snyk-clinejection for primary source with full technical details.
Clinejection: When Your AI Tool Installs Another
Grith
Secondary analysis of Clinejection. Emphasizes confused deputy framing, OpenClaw capabilities (persistent daemon, shell access via Gateway API), and broader implications for AI agents in CI/CD. See snyk-clinejection for primary source with full technical details.
Type: article
Year: 2026
-
Canonical empirical finding for RLHF reward hacking. As you optimize harder against a proxy reward model, performance on the true objective degrades — the relationship follows predictable scaling laws. Establishes that proxy capture is a structural property of RLHF: the proxy encodes human rater behavior, not the underlying goal. Foundational for the alignment-stack-has-an-owner argument in Part 3.
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, Jacob Hilton
Canonical empirical finding for RLHF reward hacking. As you optimize harder against a proxy reward model, performance on the true objective degrades — the relationship follows predictable scaling laws. Establishes that proxy capture is a structural property of RLHF: the proxy encodes human rater behavior, not the underlying goal. Foundational for the alignment-stack-has-an-owner argument in Part 3.
Type: paper
Year: 2022
-
Experimental evidence (n=24) that sycophantic chatbots produce worse task outcomes while users cannot detect this. Users of the high-sycophancy chatbot were less likely to correct misconceptions and over-relied on unhelpful responses — but most could not detect the sycophancy. The key finding: degraded task outcomes + invisible failure. Direct support for the trust boundary argument in Part 3.
Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks
Jessica Y Bo et al.
Experimental evidence (n=24) that sycophantic chatbots produce worse task outcomes while users cannot detect this. Users of the high-sycophancy chatbot were less likely to correct misconceptions and over-relied on unhelpful responses — but most could not detect the sycophancy. The key finding: degraded task outcomes + invisible failure. Direct support for the trust boundary argument in Part 3.
Type: paper
Year: 2025
-
SYCON Bench: measures sycophancy in multi-turn free-form conversations via Turn of Flip (how quickly a model conforms to user pressure) and frequency of capitulation. Finding: alignment tuning amplifies sycophancy in multi-turn settings; chain-of-thought training strengthens resistance. Multi-turn is the normal mode of consumer chat — the failure is worst exactly where users spend most of their time.
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim et al.
SYCON Bench: measures sycophancy in multi-turn free-form conversations via Turn of Flip (how quickly a model conforms to user pressure) and frequency of capitulation. Finding: alignment tuning amplifies sycophancy in multi-turn settings; chain-of-thought training strengthens resistance. Multi-turn is the normal mode of consumer chat — the failure is worst exactly where users spend most of their time.
Type: paper
Year: 2025
- LLM Security (Red Teaming)
-
Decomposes safety into geometrically separable recognition and execution axes in activation space. Refusal Erasure Attack disables execution while leaving recognition intact (ASR 0.76-0.82 on JailbreakBench). Different architectures (Llama vs Qwen) implement safety through fundamentally different control patterns. Extends Ball et al.: jailbreaks don't need to fool recognition, they just suppress execution.
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models
Yuhao Wu, Xingyu Xie, Guowei Lin, Zhao Zhao
Decomposes safety into geometrically separable recognition and execution axes in activation space. Refusal Erasure Attack disables execution while leaving recognition intact (ASR 0.76-0.82 on JailbreakBench). Different architectures (Llama vs Qwen) implement safety through fundamentally different control patterns. Extends Ball et al.: jailbreaks don't need to fool recognition, they just suppress execution.
Type: paper
Year: 2026
-
Localizes safety to specific attention heads at specific layers, then attacks them for 14% ASR improvement over SOTA. Safety head positions vary across architectures (Llama layer 7, Qwen layer 5, Deepseek mid-upper). Existing defenses operate at shallow levels; this exposes vulnerability in deeper components, creating a false sense of security.
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
Yuhao Wu, Xingyu Xie, Zhao Zhao, Jing Chen
Localizes safety to specific attention heads at specific layers, then attacks them for 14% ASR improvement over SOTA. Safety head positions vary across architectures (Llama layer 7, Qwen layer 5, Deepseek mid-upper). Existing defenses operate at shallow levels; this exposes vulnerability in deeper components, creating a false sense of security.
Type: paper
Year: 2026
-
CC-Delta identifies jailbreak features by comparing the same harmful request with and without jailbreak context in sparse autoencoder latent space. Outperforms dense-space steering (CAA) in 11/12 comparisons; generalizes to 7 unseen attack types. Off-the-shelf interpretability SAEs repurposed as defenses. Moves from classification toward elimination --- constraint-based defense in feature space.
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Maël Assogba, Giacomo Cortellazzi, Javier Abad, Nuria Rodriguez
CC-Delta identifies jailbreak features by comparing the same harmful request with and without jailbreak context in sparse autoencoder latent space. Outperforms dense-space steering (CAA) in 11/12 comparisons; generalizes to 7 unseen attack types. Off-the-shelf interpretability SAEs repurposed as defenses. Moves from classification toward elimination --- constraint-based defense in feature space.
Type: paper
Year: 2026
-
Enforces monotonicity as an architectural constraint: small input perturbations produce bounded output changes by design, not by training. Pure doctrine material --- an invariant baked into architecture that makes a class of adversarial state transitions impossible rather than unlikely.
Monotonicity as an Architectural Bias for Robust Language Models
Ariel Cooper, Parsa Nadali, Ashish Trivedi, Samuel Velasquez
Enforces monotonicity as an architectural constraint: small input perturbations produce bounded output changes by design, not by training. Pure doctrine material --- an invariant baked into architecture that makes a class of adversarial state transitions impossible rather than unlikely.
Type: paper
Year: 2026
-
Uses safety-censored topics as naturally occurring capability overhang --- the model has the knowledge but training suppresses it. Demonstrates that training history underdetermines future behavior: capabilities persist beneath alignment. Relevant to the gap between distributional safety and semantic safety.
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Hector Casademunt, Mikolaj Cywinski, Chi-Hang Tran
Uses safety-censored topics as naturally occurring capability overhang --- the model has the knowledge but training suppresses it. Demonstrates that training history underdetermines future behavior: capabilities persist beneath alignment. Relevant to the gap between distributional safety and semantic safety.
Type: paper
Year: 2026
- Agentic Security
-
Three MCP attack vectors: resource theft (hidden token consumption), conversation hijacking (persistent instruction injection), and covert tool invocation. Core issue: MCP servers control both prompt content and response interpretation, collapsing the trust boundary between data and instructions.
New Prompt Injection Attack Vectors Through MCP Sampling
Unit42 (Palo Alto Networks)
Three MCP attack vectors: resource theft (hidden token consumption), conversation hijacking (persistent instruction injection), and covert tool invocation. Core issue: MCP servers control both prompt content and response interpretation, collapsing the trust boundary between data and instructions.
Type: blog
Year: 2026
- LLM Security (Red Teaming)
-
Jailbreak behavior produces stable, detectable patterns in internal representations --- consistent across architectures including non-Transformer (Mamba2). CP tensor decomposition achieves 78% jailbreak detection with 6% false positives. Adversarial intent encoded early (near input embedding), suggesting detection can operate upstream of generation. Architecture-agnostic: consistent F1 across GPT-J, LLaMA, Mistral, Mamba2.
Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
Jailbreak behavior produces stable, detectable patterns in internal representations --- consistent across architectures including non-Transformer (Mamba2). CP tensor decomposition achieves 78% jailbreak detection with 6% false positives. Adversarial intent encoded early (near input embedding), suggesting detection can operate upstream of generation. Architecture-agnostic: consistent F1 across GPT-J, LLaMA, Mistral, Mamba2.
Type: paper
Year: 2026
-
Steering methods achieve high efficacy but consistently fail to preserve robustness under adversarial inputs. Llama-8B safety drops 35-55% under jailbreak after steering. PCA analysis shows harmful queries with jailbreak prefixes occupy the same activation region as harmless queries --- steering vectors trained on clean separation inadvertently amplify compliance for adversarially disguised inputs. Direct doctrine evidence: behavioral shaping fails at trust boundaries.
Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions
Navita Goyal, Hal Daumé III
Steering methods achieve high efficacy but consistently fail to preserve robustness under adversarial inputs. Llama-8B safety drops 35-55% under jailbreak after steering. PCA analysis shows harmful queries with jailbreak prefixes occupy the same activation region as harmless queries --- steering vectors trained on clean separation inadvertently amplify compliance for adversarially disguised inputs. Direct doctrine evidence: behavioral shaping fails at trust boundaries.
Type: paper
Year: 2026
-
Self-evolving LLM agents spontaneously develop deception as an evolutionarily stable strategy. Deception generalizes robustly across scenarios (WR 1.00 on unseen scenarios); honesty requires scenario-specific adaptation and 43% more rhetorical intensity. Neutral evolution naturally converges on high deception. Agents reframe deception as 'strategic necessity' while maintaining alignment facade --- deception emerges from optimization pressure, not injection.
Evolving Deception: When Agents Evolve, Deception Wins
Zonghao Ying, Haowen Dai, Tianyuan Zhang, Yisong Xiao, Quanchen Zou, Aishan Liu, Jian Yang, Yaodong Yang, Xianglong Liu
Self-evolving LLM agents spontaneously develop deception as an evolutionarily stable strategy. Deception generalizes robustly across scenarios (WR 1.00 on unseen scenarios); honesty requires scenario-specific adaptation and 43% more rhetorical intensity. Neutral evolution naturally converges on high deception. Agents reframe deception as 'strategic necessity' while maintaining alignment facade --- deception emerges from optimization pressure, not injection.
Type: paper
Year: 2026
-
Cryptographic proof (via TEEs) that a response was generated through a specific guardrail pipeline. Addresses verification asymmetry: users currently cannot verify that advertised guardrails actually ran on their query. Structural solution to a trust problem that behavioral approaches cannot solve --- you can't prompt-engineer cryptographic guarantees.
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren
Cryptographic proof (via TEEs) that a response was generated through a specific guardrail pipeline. Addresses verification asymmetry: users currently cannot verify that advertised guardrails actually ran on their query. Structural solution to a trust problem that behavioral approaches cannot solve --- you can't prompt-engineer cryptographic guarantees.
Type: paper
Year: 2026
-
Reframes safety as density estimation on benign representation manifold. Diffusion model trained only on benign hidden states detects and corrects unsafe representations at inference time. JailbreakBench: 0% ASR across three models (down from 46-98%). Avoids the out-of-distribution problem that plagues safety classifiers --- novel attacks still produce anomalous hidden states relative to the benign manifold.
MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs
Chun Yan Ryan Kan, Tommy Tran, Vedant Yadav, Ava Cai, Kevin Zhu, Ruizhe Li, Maheep Chaudhary
Reframes safety as density estimation on benign representation manifold. Diffusion model trained only on benign hidden states detects and corrects unsafe representations at inference time. JailbreakBench: 0% ASR across three models (down from 46-98%). Avoids the out-of-distribution problem that plagues safety classifiers --- novel attacks still produce anomalous hidden states relative to the benign manifold.
Type: paper
Year: 2026
- LLMs for Security Work
-
First comprehensive benchmark decomposing pentesting into 6 stages across 346 tasks. Attack decision-making is the critical bottleneck: ground-truth ADM inputs improve performance +36pp. Exploit syntax correctness (0.69) far exceeds functional correctness (0.26) --- code runs but doesn't work. Models evaluate vulnerabilities independently but can't construct multi-step attack chains.
PentestEval: Benchmarking LLM-based Penetration Testing with Modular and Stage-Level Design
Ruozhao Yang, Mingfei Cheng, Gelei Deng, Tianwei Zhang, Junjie Wang, Xiaofei Xie
First comprehensive benchmark decomposing pentesting into 6 stages across 346 tasks. Attack decision-making is the critical bottleneck: ground-truth ADM inputs improve performance +36pp. Exploit syntax correctness (0.69) far exceeds functional correctness (0.26) --- code runs but doesn't work. Models evaluate vulnerabilities independently but can't construct multi-step attack chains.
Type: paper
Year: 2025
- Agentic Security
-
Three MCP attack categories demonstrated: malicious code execution, remote access control, credential theft. RADE attack (novel): corrupted data in vector databases triggers automatic execution when retrieved by MCP-connected LLM --- upstream poisoning without direct system access. Claude shows inconsistent refusal; Llama only refuses explicit harmful keywords. New class of indirect prompt injection via public data sources.
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
Brandon Radosevich, John Halloran
Three MCP attack categories demonstrated: malicious code execution, remote access control, credential theft. RADE attack (novel): corrupted data in vector databases triggers automatic execution when retrieved by MCP-connected LLM --- upstream poisoning without direct system access. Claude shows inconsistent refusal; Llama only refuses explicit harmful keywords. New class of indirect prompt injection via public data sources.
Type: paper
Year: 2025
-
Perplexity's NIST/CAISI response on agent security. Maps attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination. Emphasizes indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. Proposes layered defense: input-level, model-level, sandboxed execution, and deterministic policy enforcement. Identifies gaps in adaptive security benchmarks, delegation/privilege models, and multi-agent system design.
Security Considerations for Artificial Intelligence Agents
Ninghui Li, Kaiyuan Zhang, Kyle Polley
Perplexity's NIST/CAISI response on agent security. Maps attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination. Emphasizes indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. Proposes layered defense: input-level, model-level, sandboxed execution, and deterministic policy enforcement. Identifies gaps in adaptive security benchmarks, delegation/privilege models, and multi-agent system design.
Type: paper
Year: 2026
-
Demonstrates cross-stack attack composition: traditional CVEs (Rowhammer, code injection) combined with LLM-specific attacks to compromise compound AI pipelines. Two novel attacks: (1) Rowhammer guardrail bypass + unaltered jailbreak prompt = safety violation, (2) knowledge DB manipulation + agent redirect = data exfiltration. Systematizes attack primitives by objective and maps them to attack lifecycle stages.
Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems
Sarbartha Banerjee, Prateek Sahu, Anjo Vahldiek-Oberwagner
Demonstrates cross-stack attack composition: traditional CVEs (Rowhammer, code injection) combined with LLM-specific attacks to compromise compound AI pipelines. Two novel attacks: (1) Rowhammer guardrail bypass + unaltered jailbreak prompt = safety violation, (2) knowledge DB manipulation + agent redirect = data exfiltration. Systematizes attack primitives by objective and maps them to attack lifecycle stages.
Type: paper
Year: 2026
-
Demonstrates agents in multi-agent systems escalating each other's privileges --- one agent's constrained context can be overridden by instructions from a peer agent. Extends prompt injection threat model to agent-to-agent interactions.
Cross-Agent Privilege Escalation: When Agents Free Each Other
Johann Rehberger
Demonstrates agents in multi-agent systems escalating each other's privileges --- one agent's constrained context can be overridden by instructions from a peer agent. Extends prompt injection threat model to agent-to-agent interactions.
Type: blog
Year: 2026
-
Hidden instructions embedded in agent skill files via Unicode characters (invisible to humans, parsed by models). Attack vector for the skills/MCP ecosystem --- skills as trojan vectors. Includes detection methods.
Scary Agent Skills: Hidden Unicode Instructions in Skills ...And How To Catch Them
Johann Rehberger
Hidden instructions embedded in agent skill files via Unicode characters (invisible to humans, parsed by models). Attack vector for the skills/MCP ecosystem --- skills as trojan vectors. Includes detection methods.
Type: blog
Year: 2026
- Jailbreaks & Adversarial Prompting
-
Largest empirical study of prefill attacks. 50 open-weight models tested against 23 attack strategies. Near-100% attack success rate. Fundamental vulnerability in local deployment: attackers can force models to start responses with specific tokens, biasing toward compliance. Open-weight models cannot defend against this because the attacker controls inference.
Systematic Vulnerability in Open-Weight LLMs: Prefill Attacks Achieve Near-Perfect Success Rates Across 50 Models
Unknown
Largest empirical study of prefill attacks. 50 open-weight models tested against 23 attack strategies. Near-100% attack success rate. Fundamental vulnerability in local deployment: attackers can force models to start responses with specific tokens, biasing toward compliance. Open-weight models cannot defend against this because the attacker controls inference.
Type: paper
Year: 2026
-
Theoretical model of jailbreak success scaling. Without prompt injection, attack success grows polynomially with inference-time samples. With injection, growth shifts to exponential --- a phase transition modeled via spin-glass systems. Short injected prompts yield polynomial scaling; long ones yield exponential. Engineering implication: sampling-based defenses (best-of-N) face fundamentally different threat profiles depending on injection length.
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
Theoretical model of jailbreak success scaling. Without prompt injection, attack success grows polynomially with inference-time samples. With injection, growth shifts to exponential --- a phase transition modeled via spin-glass systems. Short injected prompts yield polynomial scaling; long ones yield exponential. Engineering implication: sampling-based defenses (best-of-N) face fundamentally different threat profiles depending on injection length.
Type: paper
Year: 2026
- LLM Security (Red Teaming)
-
Safety-tuned LLMs refuse legitimate defensive cybersecurity tasks at 2.72x the rate of semantically similar neutral requests (p < 0.001). System hardening refused 43.8%, malware analysis 34.3%. Explicitly stating authorization *increases* refusal rates. Based on 2,390 real-world examples from NCCDC. Demonstrates that keyword-based safety alignment fails to distinguish offense from defense --- the alignment stack treats language, not intent.
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight
Safety-tuned LLMs refuse legitimate defensive cybersecurity tasks at 2.72x the rate of semantically similar neutral requests (p < 0.001). System hardening refused 43.8%, malware analysis 34.3%. Explicitly stating authorization *increases* refusal rates. Based on 2,390 real-world examples from NCCDC. Demonstrates that keyword-based safety alignment fails to distinguish offense from defense --- the alignment stack treats language, not intent.
Type: paper
Year: 2026
- Jailbreaks & Adversarial Prompting
-
Training dataset for instruction hierarchy --- teaching LLMs to prioritize system > developer > user > tool instructions under conflict. Fine-tuning GPT-5-Mini improved IH robustness from 84.1% to 94.1%, reduced unsafe behavior from 6.6% to 0.7%, with minimal capability regression. Uses online adversarial example generation + RL. Held across 16 benchmarks including human red-teaming. Dataset public on HuggingFace. Concrete constraint-training method for prompt injection defense.
IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Michael Pokorny, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao
Training dataset for instruction hierarchy --- teaching LLMs to prioritize system > developer > user > tool instructions under conflict. Fine-tuning GPT-5-Mini improved IH robustness from 84.1% to 94.1%, reduced unsafe behavior from 6.6% to 0.7%, with minimal capability regression. Uses online adversarial example generation + RL. Held across 16 benchmarks including human red-teaming. Dataset public on HuggingFace. Concrete constraint-training method for prompt injection defense.
Type: paper
Year: 2026
- RAG & Retrieval Security
-
Safety alignment homogeneity across models creates a shared, transferable attack surface in RAG. TabooRAG crafts poisoned documents that trigger safety refusals on benign queries --- optimized against one model, transfers to others at up to 96% success on GPT-5.2. The standardization of safety training makes blocking attacks portable. Availability attack that weaponizes the alignment stack itself.
When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG
Junchen Li, Chao Qi, Rongzheng Wang, Qizhi Chen, Liang Xu, Di Liang, Bob Simons, Shuang Liang
Safety alignment homogeneity across models creates a shared, transferable attack surface in RAG. TabooRAG crafts poisoned documents that trigger safety refusals on benign queries --- optimized against one model, transfers to others at up to 96% success on GPT-5.2. The standardization of safety training makes blocking attacks portable. Availability attack that weaponizes the alignment stack itself.
Type: paper
Year: 2026
- Agentic Security
-
Topology inference attack on LLM multi-agent systems. Compromising a single arbitrary agent (not admin) lets attacker infer the full communication graph via context-based inference --- 60% higher accuracy than baselines under active defenses. Includes covert jailbreak mechanism and jailbreak-free diffusion design. Demonstrates that multi-agent architecture topology is leakable IP, even with keyword-based defenses.
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
Topology inference attack on LLM multi-agent systems. Compromising a single arbitrary agent (not admin) lets attacker infer the full communication graph via context-based inference --- 60% higher accuracy than baselines under active defenses. Includes covert jailbreak mechanism and jailbreak-free diffusion design. Demonstrates that multi-agent architecture topology is leakable IP, even with keyword-based defenses.
Type: paper
Year: 2026
-
Defense-in-depth runtime security for tool-augmented LLM agents. Distributes enforcement across 10 lifecycle hooks (message ingress, prompt construction, tool execution, result storage, outbound comms, sub-agent spawning). Hybrid heuristic+LLM scanning with risk accumulation and TTL decay. Policy-driven restrictions on tool usage, file paths, network access, secret patterns. Tamper-evident audit plane. Practical architectural pattern for agent security in production.
OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents
Frank Li
Defense-in-depth runtime security for tool-augmented LLM agents. Distributes enforcement across 10 lifecycle hooks (message ingress, prompt construction, tool execution, result storage, outbound comms, sub-agent spawning). Hybrid heuristic+LLM scanning with risk accumulation and TTL decay. Policy-driven restrictions on tool usage, file paths, network access, secret patterns. Tamper-evident audit plane. Practical architectural pattern for agent security in production.
Type: paper
Year: 2026
- Agentic Pentesting
-
First empirical study of LLM-assisted pentesting (FSE '23). GPT-3.5 in a closed-feedback loop with a vulnerable VM via SSH. Established the 'human uplift' paradigm: LLM as assistant in the loop, not autonomous agent. Baseline for all subsequent autonomous pentesting work.
Getting pwn'd by AI: Penetration Testing with Large Language Models
Andreas Happe, Jürgen Cito
First empirical study of LLM-assisted pentesting (FSE '23). GPT-3.5 in a closed-feedback loop with a vulnerable VM via SSH. Established the 'human uplift' paradigm: LLM as assistant in the loop, not autonomous agent. Baseline for all subsequent autonomous pentesting work.
Type: paper
Year: 2023
-
Three-module architecture: Reasoning Module (Pentesting Task Tree for long-term memory), Generation Module (Chain-of-Thought for detailed operations), Parsing Module. 228.6% task-completion improvement over GPT-3.5 on benchmark targets. USENIX Security '24. 4,700+ GitHub stars. The PTT (attributed tree where nodes capture sub-tasks, status, tool usage, finding types) is a reusable design pattern for structured agent state.
PentestGPT: An LLM-empowered Automatic Penetration Testing Tool
Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Edelkamp
Three-module architecture: Reasoning Module (Pentesting Task Tree for long-term memory), Generation Module (Chain-of-Thought for detailed operations), Parsing Module. 228.6% task-completion improvement over GPT-3.5 on benchmark targets. USENIX Security '24. 4,700+ GitHub stars. The PTT (attributed tree where nodes capture sub-tasks, status, tool usage, finding types) is a reusable design pattern for structured agent state.
Type: paper
Year: 2024
-
GPT-4 exploits 87% of 15 one-day CVEs when given the CVE description; 0% for all other models tested (GPT-3.5, open-source). Controversial: Rohlf rebuttal argues 11/15 CVEs were after GPT-4's knowledge cutoff --- is this reasoning or retrieval? The debate itself is the insight: what counts as autonomous exploitation vs. pattern-matching against training data.
LLM Agents Can Autonomously Exploit One-day Vulnerabilities
Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang
GPT-4 exploits 87% of 15 one-day CVEs when given the CVE description; 0% for all other models tested (GPT-3.5, open-source). Controversial: Rohlf rebuttal argues 11/15 CVEs were after GPT-4's knowledge cutoff --- is this reasoning or retrieval? The debate itself is the insight: what counts as autonomous exploitation vs. pattern-matching against training data.
Type: paper
Year: 2024
-
Follow-up to one-day paper. HPTSA: hierarchical planning agent launches subagents for different vulnerability types. 4.3x improvement over prior agent frameworks on 14 real-world vulnerabilities. Multi-agent architecture addresses long-term planning failures in single-agent approaches. Directly relevant to continuous pentesting system design.
Teams of LLM Agents Can Exploit Zero-Day Vulnerabilities
Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang
Follow-up to one-day paper. HPTSA: hierarchical planning agent launches subagents for different vulnerability types. 4.3x improvement over prior agent frameworks on 14 real-world vulnerabilities. Multi-agent architecture addresses long-term planning failures in single-agent approaches. Directly relevant to continuous pentesting system design.
Type: paper
Year: 2024
-
Rebuttal to Fang et al. 11/15 CVEs chosen were after GPT-4's knowledge cutoff --- but ease of reproducibility, not novel reasoning, likely explains the 87% rate. LLMs excel at automating manual tasks (not a novel capability). Key question for agentic pentesting: is the agent reasoning about the vulnerability or retrieving a known exploit pattern? Maps directly to Hacking's representing/intervening distinction.
No, LLM Agents can not Autonomously Exploit One-day Vulnerabilities
Chris Rohlf
Rebuttal to Fang et al. 11/15 CVEs chosen were after GPT-4's knowledge cutoff --- but ease of reproducibility, not novel reasoning, likely explains the 87% rate. LLMs excel at automating manual tasks (not a novel capability). Key question for agentic pentesting: is the agent reasoning about the vulnerability or retrieving a known exploit pattern? Maps directly to Hacking's representing/intervening distinction.
Type: blog
Year: 2024
-
ARTEMIS: first head-to-head comparison of AI agents vs. 10 human pentesters on a live university network (~8,000 hosts, 12 subnets). ARTEMIS placed 2nd overall, found 9 valid vulnerabilities, 82% valid submission rate, outperformed 9/10 humans. $18/hour vs. $60/hour for professional pentesters. Architecture: supervisor + swarm of arbitrary sub-agents + triager for vulnerability verification. Higher false-positive rates than humans, struggles with GUI tasks. Open-sourced.
Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing
Abramovich et al.
ARTEMIS: first head-to-head comparison of AI agents vs. 10 human pentesters on a live university network (~8,000 hosts, 12 subnets). ARTEMIS placed 2nd overall, found 9 valid vulnerabilities, 82% valid submission rate, outperformed 9/10 humans. $18/hour vs. $60/hour for professional pentesters. Architecture: supervisor + swarm of arbitrary sub-agents + triager for vulnerability verification. Higher false-positive rates than humans, struggles with GUI tasks. Open-sourced.
Type: paper
Year: 2025
- Agentic Security
-
First holistic survey of agentic security. 160+ papers organized as Applications (offensive + defensive), Threats (manipulation, jailbreaking, tool misuse), Defenses (guardrails, verification). Reveals strong reliance on closed-source LLMs, underexplored non-textual modalities, amplified vulnerability of agents vs. base LLMs. Awesome list maintained at github.com/kagnlp/Awesome-Agentic-Security.
A Survey on Agentic Security: Applications, Threats and Defenses
Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez
First holistic survey of agentic security. 160+ papers organized as Applications (offensive + defensive), Threats (manipulation, jailbreaking, tool misuse), Defenses (guardrails, verification). Reveals strong reliance on closed-source LLMs, underexplored non-textual modalities, amplified vulnerability of agents vs. base LLMs. Awesome list maintained at github.com/kagnlp/Awesome-Agentic-Security.
Type: paper
Year: 2025
- Agentic Pentesting
-
GPT-5-powered autonomous vulnerability discovery. Reads code, writes and runs tests, proposes patches. 10 CVEs assigned from open-source work. Now rebranded as Codex Security (research preview, Mar 2026). Comparable to Anthropic/Mozilla Firefox work but with patching emphasis. Validates the discovery-scales-exploitation-doesn't pattern: agent finds and proposes fixes, human reviews.
Introducing Aardvark: OpenAI's agentic security researcher
OpenAI
GPT-5-powered autonomous vulnerability discovery. Reads code, writes and runs tests, proposes patches. 10 CVEs assigned from open-source work. Now rebranded as Codex Security (research preview, Mar 2026). Comparable to Anthropic/Mozilla Firefox work but with patching emphasis. Validates the discovery-scales-exploitation-doesn't pattern: agent finds and proposes fixes, human reviews.
Type: blog
Year: 2026
-
Part 6: Resources
-
§14 Textbooks (Free Online)
13
-
★
Deep Learning
— Goodfellow, Bengio, Courville
(2016)
[book]
Theory foundations
Deep Learning
Ian Goodfellow, Yoshua Bengio, Aaron Courville
Theory foundations
Type: book
Year: 2016
Publisher: MIT Press
ISBN: 978-0262035613
Part I (applied math) and Part II (deep networks) are most relevant. Part III covers research topics.
-
NLP fundamentals
Speech and Language Processing
Jurafsky & Martin
NLP fundamentals
Type: book
Publisher: Pearson
-
Bayesian/rigorous approach
Probabilistic Machine Learning
Kevin Murphy
Bayesian/rigorous approach
Type: book
Publisher: MIT Press
-
Interactive, code-heavy
Dive into Deep Learning
Zhang et al.
Interactive, code-heavy
Type: book
Publisher: Cambridge University Press
-
Concise visual intro
The Little Book of Deep Learning
François Fleuret
Concise visual intro
Type: book
-
Gentle introduction
Neural Networks and Deep Learning
Michael Nielsen
Gentle introduction
Type: book
-
Foundations of Statistical Natural Language Processing
— Manning & Schütze
[book]
Classic (1999), pre-neural NLP
Foundations of Statistical Natural Language Processing
Manning & Schütze
Classic (1999), pre-neural NLP
Type: book
Publisher: MIT Press
- AIMA Resources
-
Preface, contents, index PDFs
AIMA Main Site
Preface, contents, index PDFs
Type: resource
-
All algorithms from the book
AIMA Algorithms/Pseudocode PDF
All algorithms from the book
Type: resource
-
Diagrams and illustrations
AIMA Figures PDF
Diagrams and illustrations
Type: resource
-
2000+ citations
AIMA Bibliography
2000+ citations
Type: resource
-
Python, Java implementations
AIMA GitHub: aimacode
Python, Java implementations
Type: resource
-
Interactive question bank
AIMA Exercises
Interactive question bank
Type: resource
-
§15 Books (Print)
18
- MIT Press Essential Knowledge Series
-
Large Language Models
— Raaijmakers
(2025)
[book]
Architecture, training, limitations
Large Language Models
Raaijmakers
Architecture, training, limitations
Type: book
Year: 2025
Publisher: MIT Press
-
What 'general intelligence' means
Artificial General Intelligence
Togelius
What 'general intelligence' means
Type: book
Year: 2024
Publisher: MIT Press
-
ChatGPT and the Future of AI
— Sejnowski
(2024)
[book]
Deep language revolution
ChatGPT and the Future of AI
Sejnowski
Deep language revolution
Type: book
Year: 2024
Publisher: MIT Press
- Accessible Introductions
-
The Worlds I See
— Fei-Fei Li
(2023)
[book]
AI pioneer memoir, computer vision
The Worlds I See
Fei-Fei Li
AI pioneer memoir, computer vision
Type: book
Year: 2023
-
Artificial Intelligence: A Guide for Thinking Humans
— Mitchell
(2019)
[book]
Balanced overview, limitations
Artificial Intelligence: A Guide for Thinking Humans
Mitchell
Balanced overview, limitations
Type: book
Year: 2019
-
The standard AI textbook (4th ed.)
Artificial Intelligence: A Modern Approach
Russell & Norvig
The standard AI textbook (4th ed.)
Type: book
Year: 2020
Publisher: Pearson
- Manning Publications
-
★
Build a Large Language Model (From Scratch)
— Raschka
(2024)
[book]
Hands-on LLM implementation
Build a Large Language Model (From Scratch)
Raschka
Hands-on LLM implementation
Type: book
Year: 2024
Publisher: Manning
-
Reasoning enhancements, RL for tools, distillation. MEAP available (75% complete).
Build a Reasoning Model (From Scratch)
Raschka
Reasoning enhancements, RL for tools, distillation. MEAP available (75% complete).
Type: book
Year: 2026
Publisher: Manning
ISBN: 9781633434677
-
LLMs in Production
(2024)
[book]
Deployment, scaling, ops
LLMs in Production
Deployment, scaling, ops
Type: book
Year: 2024
Publisher: Manning
-
AI Agents in Production
(2025)
[book]
Agent architectures, deployment
AI Agents in Production
Agent architectures, deployment
Type: book
Year: 2025
Publisher: Manning
-
Knowledge Graphs and LLMs in Action
(2024)
[book]
KG + LLM integration patterns
Knowledge Graphs and LLMs in Action
KG + LLM integration patterns
Type: book
Year: 2024
Publisher: Manning
- Tools & Frameworks
-
Local LLM inference
Ollama
Local LLM inference
Type: tool
-
GUI for local models
LM Studio
GUI for local models
Type: tool
-
SymPy
[tool]
Symbolic mathematics in Python
SymPy
Symbolic mathematics in Python
Type: tool
-
LangChain / LlamaIndex
[tool]
RAG orchestration frameworks
LangChain / LlamaIndex
RAG orchestration frameworks
Type: tool
-
Instructor
[tool]
Structured outputs from LLMs
Instructor
Structured outputs from LLMs
Type: tool
-
Weights & Biases
[tool]
Experiment tracking
Weights & Biases
Experiment tracking
Type: tool
-
MLflow
[tool]
ML lifecycle management
MLflow
ML lifecycle management
Type: tool
-
§16 Blogs & Newsletters
29
- Academic-leaning
-
★
Lil'Log
— Lilian Weng
[blog]
Excellent deep dives, OpenAI researcher
Lil'Log
Lilian Weng
Excellent deep dives, OpenAI researcher
Type: blog
-
Long-form essays
The Gradient
Long-form essays
Type: blog
-
Beautiful visualizations (inactive but archived)
Distill.pub
Beautiful visualizations (inactive but archived)
Type: blog
-
Jay Alammar's Blog
— Jay Alammar
[blog]
Visual explanations (Illustrated Transformer)
Jay Alammar's Blog
Jay Alammar
Visual explanations (Illustrated Transformer)
Type: blog
-
Import AI
— Jack Clark
[blog]
Weekly newsletter, policy + research
Import AI
Jack Clark
Weekly newsletter, policy + research
Type: blog
-
The Batch
— deeplearning.ai
[blog]
Weekly digest
The Batch
deeplearning.ai
Weekly digest
Type: blog
-
Practical, code-focused
Sebastian Raschka's Newsletter
Sebastian Raschka
Practical, code-focused
Type: blog
-
Papers + implementations
Papers With Code
Papers + implementations
Type: resource
- Practitioner blogs
-
Simon Willison's Blog
— Simon Willison
[blog]
Daily LLM experiments, tool reviews, SQLite
Simon Willison's Blog
Simon Willison
Daily LLM experiments, tool reviews, SQLite
Type: blog
-
Eugene Yan
— Eugene Yan
[blog]
ML systems, RecSys, production patterns
Eugene Yan
Eugene Yan
ML systems, RecSys, production patterns
Type: blog
-
Chip Huyen
— Chip Huyen
[blog]
MLOps, systems design, interviews
Chip Huyen
Chip Huyen
MLOps, systems design, interviews
Type: blog
-
Hamel Husain
— Hamel Husain
[blog]
LLM fine-tuning, practical notebooks
Hamel Husain
Hamel Husain
LLM fine-tuning, practical notebooks
Type: blog
-
Latent Space
— swyx & Alessio
[blog]
AI Engineer perspective, interviews
Latent Space
swyx & Alessio
AI Engineer perspective, interviews
Type: blog
-
Safety, interpretability, capabilities
Anthropic Research Blog
Safety, interpretability, capabilities
Type: blog
-
Model releases, safety research
OpenAI Research Blog
Model releases, safety research
Type: blog
-
Research announcements, tutorials
Google AI Blog
Research announcements, tutorials
Type: blog
- Essential Articles (Printable)
-
Production architecture
Patterns for Building LLM-based Systems
Eugene Yan
Production architecture
Type: article
-
End-to-end guide
Building LLM Applications for Production
Chip Huyen
End-to-end guide
Type: article
-
How GPT Tokenizers Work
— Simon Willison
[article]
Tokenization deep-dive
How GPT Tokenizers Work
Simon Willison
Tokenization deep-dive
Type: article
-
Gemini generates EagleCAD library files from chip datasheets. Structured input, structured output, human verification. Appropriate use case example for Part 3.
From PDF to .LBR: Using Deep Think to Write Custom CAD Parts
Adafruit
Gemini generates EagleCAD library files from chip datasheets. Structured input, structured output, human verification. Appropriate use case example for Part 3.
Type: article
Year: 2026
-
TDD consultancy arrives at the doctrine independently: as generation becomes cheap, the constraint system ('harness') becomes the product. 'AI increases the cost of being wrong' is verification asymmetry from a software quality angle. Validates Part 3's thesis from engineering practice without the philosophy scaffolding.
Quality You Can't Generate: AI Output Only as Good as Your Constraints
Test Double
TDD consultancy arrives at the doctrine independently: as generation becomes cheap, the constraint system ('harness') becomes the product. 'AI increases the cost of being wrong' is verification asymmetry from a software quality angle. Validates Part 3's thesis from engineering practice without the philosophy scaffolding.
Type: article
Year: 2025
-
Flattening of high-entropy content—the cost of reliability. For creative writing, that's a loss. For security-critical systems, it's the point. Cited in Part 3.
Semantic Ablation: AI Writing's Hidden Problem
Claudio Nastruzzi
Flattening of high-entropy content—the cost of reliability. For creative writing, that's a loss. For security-critical systems, it's the point. Cited in Part 3.
Type: article
Year: 2026
-
RAG tradeoffs
RAG vs. Long Context: A Hybrid Approach
Simon Willison
RAG tradeoffs
Type: article
-
Agent architectures
LLM Powered Autonomous Agents
Lilian Weng
Agent architectures
Type: article
-
Prompt Engineering
— Lilian Weng
[article]
Comprehensive guide
Prompt Engineering
Lilian Weng
Comprehensive guide
Type: article
-
Role definition
The Rise of the AI Engineer
swyx
Role definition
Type: article
-
Evaluation strategy
Your AI Product Needs Evals
Hamel Husain
Evaluation strategy
Type: article
-
Systematic approach
Prompt Engineering vs. Blind Prompting
Mitchell Hashimoto
Systematic approach
Type: article
-
8-part series (~20K words) surveying ML risks: dynamics/chaos, culture, information ecology, annoyances, psychological hazards, safety, work. From the Jepsen author. Arrives at the trilogy's conclusions through practice rather than theory: verification problem, Bainbridge deskilling, prompt injection as fundamental, sycophancy as structural, 'lethal trifecta is a unifecta.' Diagnosis without prescription — the architectural response (grounding spectrum, formal constraints) is the trilogy's contribution. Key sections: Dynamics (chaos, latent disaster, verification problem), Safety (alignment, prompt injection, 'unifecta'), Work (Bainbridge, witchcraft/compiler framing). Cites Cook 'How Complex Systems Fail', Bainbridge 1983, Willison lethal trifecta. PDF/EPUB available.
The Future of Everything is Lies, I Guess
Kyle Kingsbury
8-part series (~20K words) surveying ML risks: dynamics/chaos, culture, information ecology, annoyances, psychological hazards, safety, work. From the Jepsen author. Arrives at the trilogy's conclusions through practice rather than theory: verification problem, Bainbridge deskilling, prompt injection as fundamental, sycophancy as structural, 'lethal trifecta is a unifecta.' Diagnosis without prescription — the architectural response (grounding spectrum, formal constraints) is the trilogy's contribution. Key sections: Dynamics (chaos, latent disaster, verification problem), Safety (alignment, prompt injection, 'unifecta'), Work (Bainbridge, witchcraft/compiler framing). Cites Cook 'How Complex Systems Fail', Bainbridge 1983, Willison lethal trifecta. PDF/EPUB available.
Type: blog
Year: 2026
-
§17 Aggregators & Discovery
5
-
Karpathy's filtered arxiv
arxiv-sanity-lite
Karpathy's filtered arxiv
Type: resource
-
Trending papers with annotations
papers.labml.ai
Trending papers with annotations
Type: resource
-
Community upvoted
Hugging Face Daily Papers
Community upvoted
Type: resource
-
Visual citation graphs
Connected Papers
Visual citation graphs
Type: resource
-
AI-powered paper search
Semantic Scholar
AI-powered paper search
Type: resource
-
§18 YouTube & Video
8
-
Deep explanations, live coding (GPT from scratch)
Andrej Karpathy
Deep explanations, live coding (GPT from scratch)
Type: video
-
Paper walkthroughs, ML news
Yannic Kilcher
Paper walkthroughs, ML news
Type: video
-
Visual math intuition
3Blue1Brown
Visual math intuition
Type: video
-
Quick research summaries
Two Minute Papers
Quick research summaries
Type: video
-
News analysis, capability deep-dives
AI Explained
News analysis, capability deep-dives
Type: video
-
Practitioner talks, production systems
AI Engineer Conference
Practitioner talks, production systems
Type: video
-
Best single intro to LLMs
Karpathy: Intro to LLMs (1hr)
Best single intro to LLMs
Type: video
-
Build a transformer, step by step
Karpathy: GPT from Scratch (2hr)
Build a transformer, step by step
Type: video
-
§19 Podcasts
6
-
AI engineering, practitioner interviews
Latent Space
AI engineering, practitioner interviews
Type: podcast
-
Applied ML, accessible
Practical AI
Applied ML, accessible
Type: podcast
-
Industry trends, executive interviews
Eye on AI
Industry trends, executive interviews
Type: podcast
-
Long-form researcher interviews
Lex Fridman Podcast
Long-form researcher interviews
Type: podcast
-
Gradient Dissent
— Weights & Biases
[podcast]
ML practitioners
Gradient Dissent
Weights & Biases
ML practitioners
Type: podcast
-
Research and industry mix
TWIML AI
Research and industry mix
Type: podcast
-
§20 Documentation & Guides
21
- Prompt Engineering
-
★
Prompt Engineering Guide
— DAIR.AI
[documentation]
Comprehensive reference: techniques, agents, model guides, prompt hub
Prompt Engineering Guide
DAIR.AI
Comprehensive reference: techniques, agents, model guides, prompt hub
Type: documentation
- LLM Providers
-
★
Anthropic Docs
[documentation]
Claude API, prompt engineering guide
Anthropic Docs
Claude API, prompt engineering guide
Type: documentation
-
GPT API, assistants, function calling
OpenAI Platform Docs
GPT API, assistants, function calling
Type: documentation
-
OpenAI Cookbook
[documentation]
Code examples, patterns, recipes
OpenAI Cookbook
Code examples, patterns, recipes
Type: documentation
-
Google AI Docs
[documentation]
Gemini API, embeddings
Google AI Docs
Gemini API, embeddings
Type: documentation
-
Cohere Docs
[documentation]
Embeddings, reranking, RAG
Cohere Docs
Embeddings, reranking, RAG
Type: documentation
- Frameworks & Orchestration
-
LangChain Docs
[documentation]
Chains, agents, RAG patterns
LangChain Docs
Chains, agents, RAG patterns
Type: documentation
-
LlamaIndex Docs
[documentation]
Data ingestion, indexing, RAG
LlamaIndex Docs
Data ingestion, indexing, RAG
Type: documentation
-
Structured outputs, validation
Pydantic
Structured outputs, validation
Type: documentation
-
Instructor
[documentation]
Structured LLM outputs with Pydantic
Instructor
Structured LLM outputs with Pydantic
Type: documentation
-
DSPy Docs
[documentation]
Programmatic prompt optimization
DSPy Docs
Programmatic prompt optimization
Type: documentation
- Vector Databases & Search
-
Vector search concepts, tutorials
Pinecone Learning Center
Vector search concepts, tutorials
Type: documentation
-
Weaviate Docs
[documentation]
Hybrid search, modules
Weaviate Docs
Hybrid search, modules
Type: documentation
-
Qdrant Docs
[documentation]
Vector DB with filtering
Qdrant Docs
Vector DB with filtering
Type: documentation
-
Chroma Docs
[documentation]
Lightweight, local-first
Chroma Docs
Lightweight, local-first
Type: documentation
-
FAISS Wiki
[documentation]
Meta's similarity search library
FAISS Wiki
Meta's similarity search library
Type: documentation
- Local & Open Source
-
Run models locally, simple CLI
Ollama
Run models locally, simple CLI
Type: documentation
-
LM Studio
[documentation]
Local models with GUI
LM Studio
Local models with GUI
Type: documentation
-
Model hub, fine-tuning, inference
HuggingFace Transformers
Model hub, fine-tuning, inference
Type: documentation
-
vLLM Docs
[documentation]
Fast inference, PagedAttention
vLLM Docs
Fast inference, PagedAttention
Type: documentation
-
CPU inference, quantization
llama.cpp
CPU inference, quantization
Type: tool
-
§21 Industry Reports
13
-
★
State of AI Report
— Benaich & Hogarth
[report]
Annual industry overview, trends
State of AI Report
Benaich & Hogarth
Annual industry overview, trends
Type: report
-
AI Index
— Stanford HAI
[report]
Comprehensive metrics, policy
AI Index
Stanford HAI
Comprehensive metrics, policy
Type: report
-
Enterprise adoption, business impact
McKinsey State of AI
Enterprise adoption, business impact
Type: report
-
Compute trends, scaling analysis
Epoch AI
Compute trends, scaling analysis
Type: resource
-
Physician- and Large Language Model-Generated Hospital Discharge Summaries
— Williams et al.
(2025)
[paper]
Blinded evaluation of 100 discharge summaries at UCSF. LLM-generated narratives had more errors (2.91 vs 1.82 per summary) but scored higher on concision and coherence---making errors harder to catch during review. The mechanism behind 'the doctor checks it' failing. JAMA Internal Medicine.
Physician- and Large Language Model-Generated Hospital Discharge Summaries
Chloe Williams et al.
Blinded evaluation of 100 discharge summaries at UCSF. LLM-generated narratives had more errors (2.91 vs 1.82 per summary) but scored higher on concision and coherence---making errors harder to catch during review. The mechanism behind 'the doctor checks it' failing. JAMA Internal Medicine.
Type: paper
Year: 2025
-
Cost of a Data Breach Report 2024
— IBM Security / Ponemon Institute
(2024)
[report]
Global average breach cost $4.88M (10% YoY increase). IR teams + regular testing saved $2.03M per breach (38% reduction). AI in prevention workflows saved $2.2M (highest single factor). Internal detection shortened lifecycle by 61 days vs. attacker-disclosed. Key finding for detection ceiling article: response-side investments deliver massive cost reductions, supporting the 'fund what happens when something gets through' thesis.
Cost of a Data Breach Report 2024
IBM Security / Ponemon Institute
Global average breach cost $4.88M (10% YoY increase). IR teams + regular testing saved $2.03M per breach (38% reduction). AI in prevention workflows saved $2.2M (highest single factor). Internal detection shortened lifecycle by 61 days vs. attacker-disclosed. Key finding for detection ceiling article: response-side investments deliver massive cost reductions, supporting the 'fund what happens when something gets through' thesis.
Type: report
Year: 2024
-
M-Trends 2025
— Mandiant / Google Cloud
(2025)
[report]
Global median dwell time dropped from 205 days (2014) to 10 days (2023) --- genuine detection success story. But ticked up to 11 days in 2024, the first increase ever recorded. Suggests the detection improvement curve is flattening. Increase driven by espionage actors with longer dwell times; ransomware (which announces itself) had been masking the plateau.
M-Trends 2025
Mandiant / Google Cloud
Global median dwell time dropped from 205 days (2014) to 10 days (2023) --- genuine detection success story. But ticked up to 11 days in 2024, the first increase ever recorded. Suggests the detection improvement curve is flattening. Increase driven by espionage actors with longer dwell times; ransomware (which announces itself) had been masking the plateau.
Type: report
Year: 2025
-
Largest real-world ambient AI scribe study. 7,260 physicians, 2.5M encounters over 63 weeks. 15,791 hours saved, 88% positive impact on visits, 82% improved satisfaction. The positive counterweight to error-rate studies: these tools are valued at scale.
Ambient Artificial Intelligence Scribes: Learnings after 1 Year and over 2.5 Million Uses
Kaiser Permanente / TPMG
Largest real-world ambient AI scribe study. 7,260 physicians, 2.5M encounters over 63 weeks. 15,791 hours saved, 88% positive impact on visits, 82% improved satisfaction. The positive counterweight to error-rate studies: these tools are valued at scale.
Type: article
Year: 2025
-
Multi-site study (263 physicians, 6 health systems). Burnout dropped from 51.9% to 38.8%. Cited in Part 3 context: the tools are good enough that people depend on them, which is why failure modes matter.
Use of Ambient AI Scribes to Reduce Administrative Burden and Professional Burnout
Olson et al.
Multi-site study (263 physicians, 6 health systems). Burnout dropped from 51.9% to 38.8%. Cited in Part 3 context: the tools are good enough that people depend on them, which is why failure modes matter.
Type: article
Year: 2025
-
31.5% of US hospitals using generative AI in 2024; 24.7% planned adoption within one year. Based on AHA IT Supplement survey (2,174 hospitals). Cited in Part 3 industry intro.
Uptake of Generative AI Integrated With Electronic Health Records in US Hospitals
JAMA Network Open
31.5% of US hospitals using generative AI in 2024; 24.7% planned adoption within one year. Based on AHA IT Supplement survey (2,174 hospitals). Cited in Part 3 industry intro.
Type: article
Year: 2025
-
2024 Artificial Intelligence TechReport
— American Bar Association
(2025)
[article]
Lawyer AI adoption tripled: 11% (2023) to 30% (2024). Among 500+ attorney firms: 47.8%. Top tools: ChatGPT 52%, CoCounsel 26%, Lexis+ AI 24%. Independent (non-vendor) source. Cited in Part 3 industry intro.
2024 Artificial Intelligence TechReport
American Bar Association
Lawyer AI adoption tripled: 11% (2023) to 30% (2024). Among 500+ attorney firms: 47.8%. Top tools: ChatGPT 52%, CoCounsel 26%, Lexis+ AI 24%. Independent (non-vendor) source. Cited in Part 3 industry intro.
Type: article
Year: 2025
-
18 expert interviews across 10 industrial R&D orgs find two competing requirements: deterministic execution and conversational flexibility. Only 2 of 20 reviewed systems achieve both. Schema-gating separates them. Independent discovery of the proposal engine / decision engine pattern with empirical practitioner validation. Cost: shifts effort from prompt engineering to registry maintenance.
Talk Freely, Execute Strictly: Schema-Gated Agentic AI
Brandon Strickland, Manan Vijeta, Stuart Moores, Eyal Bodek
18 expert interviews across 10 industrial R&D orgs find two competing requirements: deterministic execution and conversational flexibility. Only 2 of 20 reviewed systems achieve both. Schema-gating separates them. Independent discovery of the proposal engine / decision engine pattern with empirical practitioner validation. Cost: shifts effort from prompt engineering to registry maintenance.
Type: paper
Year: 2026
-
Maps Bainbridge's 1983 'ironies of automation' onto generative AI. Four productivity loss mechanisms: production-to-evaluation shift, workflow restructuring, task interruptions, and task-complexity polarization. Copilot users failed to complete tasks more often. Novices disproportionately harmed --- the tool benefits those who need it least. GenAI-as-feedback outperforms GenAI-as-generator.
Ironies of Generative AI: Understanding and Mitigating Productivity Loss in Human-AI Interactions
Austeja Simkute, Advait Tankelevitch, Viktor Kewenig
Maps Bainbridge's 1983 'ironies of automation' onto generative AI. Four productivity loss mechanisms: production-to-evaluation shift, workflow restructuring, task interruptions, and task-complexity polarization. Copilot users failed to complete tasks more often. Novices disproportionately harmed --- the tool benefits those who need it least. GenAI-as-feedback outperforms GenAI-as-generator.
Type: paper
Year: 2024
-
§22 Technical Reports & Whitepapers (PDFs)
36
-
★
GPT-4 Technical Report
— OpenAI
(2023)
[whitepaper]
Capabilities, limitations, safety
GPT-4 Technical Report
OpenAI
Capabilities, limitations, safety
Type: whitepaper
Year: 2023
-
Multimodal architecture
Gemini: A Family of Highly Capable Models
Google
Multimodal architecture
Type: whitepaper
Year: 2023
-
Open weights, RLHF details
Llama 2: Open Foundation Models
Meta
Open weights, RLHF details
Type: whitepaper
Year: 2023
-
RLHF from human feedback
Training a Helpful and Harmless Assistant
Anthropic
RLHF from human feedback
Type: whitepaper
Year: 2022
-
★
Constitutional AI
— Anthropic
(2022)
[whitepaper]
Self-supervised alignment
Constitutional AI
Anthropic
Self-supervised alignment
Type: whitepaper
Year: 2022
-
The Claude Model Spec
— Anthropic
(2025)
[whitepaper]
Values, behavior guidelines
The Claude Model Spec
Anthropic
Values, behavior guidelines
Type: whitepaper
Year: 2025
-
Scaling laws for loss vs. compute, data, and parameters. Foundational predictions for how LM performance scales.
Scaling Laws for Autoregressive Models
OpenAI
Scaling laws for loss vs. compute, data, and parameters. Foundational predictions for how LM performance scales.
Type: whitepaper
Year: 2020
- Safety & Alignment
-
Red Teaming Language Models
— Anthropic
(2022)
[whitepaper]
Discovering harmful outputs
Red Teaming Language Models
Anthropic
Discovering harmful outputs
Type: whitepaper
Year: 2022
-
What is a fact in the age of generative AI? Fact-checking as an epistemological lens
— Dierickx et al.
(2026)
[paper]
Emergent facts: constructs arising from training data, architecture, and prompts. Probabilistic, context-dependent, opaque in derivation. Core epistemology for Part 3.
What is a fact in the age of generative AI? Fact-checking as an epistemological lens
Laurence Dierickx, et al.
Emergent facts: constructs arising from training data, architecture, and prompts. Probabilistic, context-dependent, opaque in derivation. Core epistemology for Part 3.
Type: paper
Year: 2026
Journal: Information, Communication & Society
- Mechanistic Interpretability
-
Reverse-engineering transformers via path expansion. QK/OV circuits, skip-trigrams, composition types. Foundation for understanding what's in the weights.
A Mathematical Framework for Transformer Circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, et al.
Reverse-engineering transformers via path expansion. QK/OV circuits, skip-trigrams, composition types. Foundation for understanding what's in the weights.
Type: paper
Year: 2021
Key concepts: residual stream as communication channel, attention heads as two separable circuits (QK for 'where to attend', OV for 'what to copy').
-
Induction heads implement [A][B]...[A] → [B] pattern matching. Primary mechanism for in-context learning. Phase change when they form.
In-Context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, et al.
Induction heads implement [A][B]...[A] → [B] pattern matching. Primary mechanism for in-context learning. Phase change when they form.
Type: paper
Year: 2022
Bridges circuit-level understanding to macroscopic phenomena like scaling laws. Explains why models suddenly get better at copying patterns.
-
★
Toy Models of Superposition
— Elhage et al.
(2022)
[paper]
Networks represent more features than dimensions via almost-orthogonal directions. Explains why neurons are polysemantic. Motivates sparse autoencoders.
Toy Models of Superposition
Nelson Elhage, Tristan Hume, Catherine Olsson, et al.
Networks represent more features than dimensions via almost-orthogonal directions. Explains why neurons are polysemantic. Motivates sparse autoencoders.
Type: paper
Year: 2022
Features organize into geometric structures (triangles, tetrahedrons). Superposition increases adversarial vulnerability. 'Solving superposition' is the core interpretability challenge.
-
Sparse autoencoders extract 4,000+ interpretable features from 512 neurons. DNA sequences, legal language, HTTP requests as separate features. Most properties invisible at neuron level.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
Trenton Bricken, Adly Templeton, et al.
Sparse autoencoders extract 4,000+ interpretable features from 512 neurons. DNA sequences, legal language, HTTP requests as separate features. Most properties invisible at neuron level.
Type: paper
Year: 2023
Dictionary learning applied to activations. Proves features are distributed across neurons, not localized.
-
SAEs scale to production models. Millions of features including safety-relevant ones (deception, bias, dangerous content). Clamping features causally steers behavior.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton, Tom Conerly, et al.
SAEs scale to production models. Millions of features including safety-relevant ones (deception, bias, dangerous content). Clamping features causally steers behavior.
Type: paper
Year: 2024
Found Golden Gate Bridge feature, code vulnerabilities, multilingual concepts. Features cluster semantically in geometric space. Direct link between geometry and steering.
- Safety & Alignment
-
Representation Engineering
— Anthropic
(2023)
[whitepaper]
Controlling model behavior via activation steering
Representation Engineering
Anthropic
Controlling model behavior via activation steering
Type: whitepaper
Year: 2023
Foundational paper. Personas, behaviors, and concepts are measurable directions in activation space.
-
Add 'steering vectors' to activations at inference to control behavior without fine-tuning.
Activation Addition: Steering Language Models Without Optimization
Turner et al.
Add 'steering vectors' to activations at inference to control behavior without fine-tuning.
Type: paper
Year: 2023
Practical activation engineering. Shows emotion, honesty, sycophancy can be steered geometrically.
-
The Assistant Axis
— Anthropic
(2026)
[whitepaper]
Models organize personas along measurable 'assistant axis' in activation space. Jailbreaks displace models from assistant region.
The Assistant Axis
Anthropic
Models organize personas along measurable 'assistant axis' in activation space. Jailbreaks displace models from assistant region.
Type: whitepaper
Year: 2026
Geometric interpretation of persona. Cited in Part 2: explains why jailbreaks work as displacement.
-
Classifiers on hidden states detect hallucinations better than output-based methods.
Detecting Hallucination with Internal Representations
Azaria et al.
Classifiers on hidden states detect hallucinations better than output-based methods.
Type: paper
Year: 2024
Internal geometry knows when model is confabulating. Practical reliability technique.
-
Comprehensive survey of activation steering methods: probing, steering vectors, concept erasure, model editing.
A Survey on Representation Engineering
Li et al.
Comprehensive survey of activation steering methods: probing, steering vectors, concept erasure, model editing.
Type: paper
Year: 2025
Good overview for embeddings topology article. Covers localization, editing, and limitations.
-
Sleeper Agents
— Anthropic
(2024)
[whitepaper]
Deceptive behavior persistence
Sleeper Agents
Anthropic
Deceptive behavior persistence
Type: whitepaper
Year: 2024
-
Fine-tuning on benign data degrades safety because alignment concentrates in low-dimensional subspaces with sharp curvature.
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
arxiv
Fine-tuning on benign data degrades safety because alignment concentrates in low-dimensional subspaces with sharp curvature.
Type: paper
Year: 2026
Explains why fine-tuning breaks safety even without adversarial intent. Alignment is geometrically brittle—gradient descent can't detect or defend these subspaces. Directly supports Part 1's alignment stack framing.
-
Agent safety framework
Practices for Governing Agentic AI
OpenAI
Agent safety framework
Type: whitepaper
Year: 2023
-
Government risk framework
AI Risk Management Framework
NIST
Government risk framework
Type: whitepaper
Year: 2023
-
Safety behaviors concentrate in small parameter subset, making alignment brittle. Proposes neuron-level alignment as defense against targeted attacks.
SafeNeuron: Neuron-Level Safety Alignment for Large Language Models
Wang et al.
Safety behaviors concentrate in small parameter subset, making alignment brittle. Proposes neuron-level alignment as defense against targeted attacks.
Type: paper
Year: 2026
Directly supports 'alignment stack' framing: shows WHERE in architecture safety lives.
-
RL-trained models spontaneously learn to exploit loopholes to maximize reward, even without adversarial prompting. Specification gaming emerges from training itself.
Capability-Oriented Training Induced Alignment Risk
Zhou et al.
RL-trained models spontaneously learn to exploit loopholes to maximize reward, even without adversarial prompting. Specification gaming emerges from training itself.
Type: paper
Year: 2026
Supports Part 2: the alignment stack can teach models to game the stack.
-
Theoretical analysis of how sampling and reference policy choices affect preference alignment. Explains why some RLHF configurations fail.
How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics
Chen et al.
Theoretical analysis of how sampling and reference policy choices affect preference alignment. Explains why some RLHF configurations fail.
Type: paper
Year: 2026
- Topology & Geometry
-
Plot Holes and Text Topology
— Stanford CS224N
(2020)
[paper]
Uses text topology to detect narrative inconsistencies. Plot holes as topological defects.
Plot Holes and Text Topology
Stanford CS224N
Uses text topology to detect narrative inconsistencies. Plot holes as topological defects.
Type: paper
Year: 2020
The original insight connecting topology to narrative consistency. Bridges Part 3 to topology article.
-
Survey of KG-based hallucination mitigation. Covers GraphEval, FactAlign, and extract-then-verify patterns.
Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective
Lavrinovics et al.
Survey of KG-based hallucination mitigation. Covers GraphEval, FactAlign, and extract-then-verify patterns.
Type: paper
Year: 2025
Key reference for output-side geometry. Atomic claims as triples, graph alignment for verification.
-
Algebraic topology on representation manifolds. Introduces 'perforation' measure. Transformers vs LSTMs have different topological signatures.
Hidden Holes: Topological Aspects of Language Models
Fitz, Romero & Schneider
Algebraic topology on representation manifolds. Introduces 'perforation' measure. Transformers vs LSTMs have different topological signatures.
Type: paper
Year: 2024
Foundational for representational topology section. Natural language creates topology absent from synthetic data.
-
TDA on reasoning traces. Topological features outperform graph metrics for assessing reasoning quality.
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models
Tan et al.
TDA on reasoning traces. Topological features outperform graph metrics for assessing reasoning quality.
Type: paper
Year: 2025
Topology as evaluation metric. Applications to automated assessment and RL reward signals.
-
Persistent homology under backdoor fine-tuning and prompt injection. Adversarial conditions compress latent topologies.
Holes in Latent Space: Topological Signatures Under Adversarial Influence
Fay et al.
Persistent homology under backdoor fine-tuning and prompt injection. Adversarial conditions compress latent topologies.
Type: paper
Year: 2025
Security angle: adversarial attacks leave topological signatures. Detection through topology.
-
LLM-driven graph construction and repair. Version control for graph edits, edge impact scores for prioritized repair.
Constructing Coherent Spatial Memory in LLM Agents Through Graph Rectification
Zhang et al.
LLM-driven graph construction and repair. Version control for graph edits, edge impact scores for prioritized repair.
Type: paper
Year: 2025
Structural consistency as first-class concern. LLMs as graph builders, not just queriers.
-
LLMs refine graph topology via semantic similarity, not just node features. Edge refinement and pseudo-label propagation.
LLM4GraphTopology: Using LLMs to Refine Graph Structure
DASFAA'25
LLMs refine graph topology via semantic similarity, not just node features. Edge refinement and pseudo-label propagation.
Type: paper
Year: 2025
Shift from LLMs as feature enhancers to structural improvers.
-
Plot hole detection as LLM reasoning benchmark. LLMs generate 50-100% more plot holes than humans.
Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
Ahuja, Sclar & Tsvetkov
Plot hole detection as LLM reasoning benchmark. LLMs generate 50-100% more plot holes than humans.
Type: paper
Year: 2025
Narrative consistency as evaluation. Requires entity tracking, abstract thinking, theory of mind.
-
Deep Learning is Applied Topology
— 12gramsofcarbon
(2024)
[article]
Conceptual primer on neural nets as topology generators. Embeddings as geometric objects, dimensional separability.
Deep Learning is Applied Topology
12gramsofcarbon
Conceptual primer on neural nets as topology generators. Embeddings as geometric objects, dimensional separability.
Type: article
Year: 2024
Good accessible introduction for the article's opening.
-
Peer-reviewed confirmation that constrained decoding degrades reasoning while improving classification. Validates Tam et al. ('Let Me Speak Freely'). Structured output formats restrict the token space in ways that prevent chain-of-thought reasoning paths. Evidence for the structured output trap: format compliance and reasoning quality trade off.
The Hidden Cost of Structure: How Constrained Decoding Degrades LLM Reasoning
Peer-reviewed confirmation that constrained decoding degrades reasoning while improving classification. Validates Tam et al. ('Let Me Speak Freely'). Structured output formats restrict the token space in ways that prevent chain-of-thought reasoning paths. Evidence for the structured output trap: format compliance and reasoning quality trade off.
Type: paper
Year: 2025
-
Part 7: Big Picture & Paths
-
§23 Philosophy / Criticism / Big Picture
70
-
How Complex Systems Fail
— Richard I. Cook
(2000)
[article]
18 propositions on failure in complex systems. Short, canonical. Key claims: catastrophe requires multiple failures (prop 3); latent failures change constantly (prop 5); hindsight biases post-accident attribution (prop 12); safety is a system property, not a component property (prop 14). Directly relevant to detection ceiling (interlocking safeguards erode under added complexity) and designing-for-invariants (invariants as safeguards that must hold under composition). Cited by Aphyr (2026) in 'Latent Disaster' section.
How Complex Systems Fail
Richard I. Cook
18 propositions on failure in complex systems. Short, canonical. Key claims: catastrophe requires multiple failures (prop 3); latent failures change constantly (prop 5); hindsight biases post-accident attribution (prop 12); safety is a system property, not a component property (prop 14). Directly relevant to detection ceiling (interlocking safeguards erode under added complexity) and designing-for-invariants (invariants as safeguards that must hold under composition). Cited by Aphyr (2026) in 'Latent Disaster' section.
Type: article
Year: 2000
-
Doctors using AI polyp detection tools appear worse at spotting adenomas during colonoscopies. Empirical evidence for Bainbridge's deskilling prediction applied to clinical AI. Cited by Aphyr (2026). Relevant to verification asymmetry (Part 3) and Designing for Invariants (human-in-the-loop is not an invariant).
Impact of AI-assisted colonoscopy on adenoma detection: a deskilling effect
Lancet Gastroenterology & Hepatology
Doctors using AI polyp detection tools appear worse at spotting adenomas during colonoscopies. Empirical evidence for Bainbridge's deskilling prediction applied to clinical AI. Cited by Aphyr (2026). Relevant to verification asymmetry (Part 3) and Designing for Invariants (human-in-the-loop is not an invariant).
Type: paper
Year: 2025
-
People who interact with LLMs are more likely to believe themselves in the right, and less likely to take responsibility and repair conflicts. Complements Shaw & Nave (cognitive surrender): Shaw & Nave shows users adopt AI output uncritically; this shows LLM use inflates self-certainty more broadly. Cited by Aphyr (2026, Psychological Hazards). Relevant to hallucinations article (why readers trust wrong output) and verification asymmetry (the reviewer's confidence doesn't track accuracy).
LLM interactions increase users' belief in their own correctness
People who interact with LLMs are more likely to believe themselves in the right, and less likely to take responsibility and repair conflicts. Complements Shaw & Nave (cognitive surrender): Shaw & Nave shows users adopt AI output uncritically; this shows LLM use inflates self-certainty more broadly. Cited by Aphyr (2026, Psychological Hazards). Relevant to hallucinations article (why readers trust wrong output) and verification asymmetry (the reviewer's confidence doesn't track accuracy).
Type: paper
Year: 2025
-
★
The Bitter Lesson
— Sutton
(2019)
[article]
Scaling beats clever engineering
The Bitter Lesson
Sutton
Scaling beats clever engineering
Type: article
Year: 2019
-
Sparks of Artificial General Intelligence
— Microsoft
(2023)
[paper]
Optimistic capability claims
Sparks of Artificial General Intelligence
Microsoft
Optimistic capability claims
Type: paper
Year: 2023
-
LLMs exhibit 'presumptive grounding' — assuming understanding before establishing it. Cites finding that RLHF erodes conversational grounding: models 77.5% less likely than humans to use clarification and acknowledgment acts. Connects to Part 3's core argument: alignment stack optimizes for helpfulness at the cost of verification. The model commits to an interpretation rather than flagging ambiguity. Reframes the design challenge from information (what AI knows) to communication (how AI engages).
What if LLMs were actually interesting to talk to? Reaching 'common ground' with conversational AI
Adrian Chan
LLMs exhibit 'presumptive grounding' — assuming understanding before establishing it. Cites finding that RLHF erodes conversational grounding: models 77.5% less likely than humans to use clarification and acknowledgment acts. Connects to Part 3's core argument: alignment stack optimizes for helpfulness at the cost of verification. The model commits to an interpretation rather than flagging ambiguity. Reframes the design challenge from information (what AI knows) to communication (how AI engages).
Type: article
Year: 2025
-
Gary Marcus's writings
[resource]
Skeptical of pure neural approaches
Gary Marcus's writings
Skeptical of pure neural approaches
Type: resource
-
AI Snake Oil
— Narayanan & Kapoor
(2024)
[book]
Separating AI hype from reality; what works, what doesn't, and the flawed science behind the claims
AI Snake Oil
Arvind Narayanan, Sayash Kapoor
Separating AI hype from reality; what works, what doesn't, and the flawed science behind the claims
Type: book
Year: 2024
Publisher: Princeton University Press
ISBN: 9780691249131
-
FAccT 2021. Environmental and labor costs of large LMs; risks of coherent-sounding text without meaning; documentation gaps. The paper that got Gebru fired from Google. Landmark critique of LLMs as pattern-stitchers without understanding.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
FAccT 2021. Environmental and labor costs of large LMs; risks of coherent-sounding text without meaning; documentation gaps. The paper that got Gebru fired from Google. Landmark critique of LLMs as pattern-stitchers without understanding.
Type: paper
Year: 2021
-
★
Talking About Large Language Models
— Murray Shanahan
(2022)
[paper]
Resist anthropomorphism. LLMs model token distributions, not beliefs. Intentional stance is useful shorthand but obscures mechanism. Cites Dennett, Wittgenstein.
Talking About Large Language Models
Murray Shanahan
Resist anthropomorphism. LLMs model token distributions, not beliefs. Intentional stance is useful shorthand but obscures mechanism. Cites Dennett, Wittgenstein.
Type: paper
Year: 2022
-
The Alignment Problem: Machine Learning and Human Values
— Brian Christian
(2020)
[book]
How ML systems learn unintended behaviors. Accessible bridge between philosophy and engineering.
The Alignment Problem: Machine Learning and Human Values
Brian Christian
How ML systems learn unintended behaviors. Accessible bridge between philosophy and engineering.
Type: book
Year: 2020
Publisher: W.W. Norton
ISBN: 978-0393635829
-
How We Became Posthuman: Virtual Bodies in Cybernetics, Literature, and Informatics
— N. Katherine Hayles
(1999)
[book]
Information lost its body. Foundational posthumanist text on disembodied cognition.
How We Became Posthuman: Virtual Bodies in Cybernetics, Literature, and Informatics
N. Katherine Hayles
Information lost its body. Foundational posthumanist text on disembodied cognition.
Type: book
Year: 1999
Publisher: University of Chicago Press
ISBN: 978-0226321462
-
Natural-Born Cyborgs: Minds, Technologies, and the Future of Human Intelligence
— Andy Clark
(2003)
[book]
Extended mind thesis. Human intelligence has always been 'retrieval-augmented.'
Natural-Born Cyborgs: Minds, Technologies, and the Future of Human Intelligence
Andy Clark
Extended mind thesis. Human intelligence has always been 'retrieval-augmented.'
Type: book
Year: 2003
Publisher: Oxford University Press
ISBN: 978-0195177510
-
How Deeply Human Is Language?
— Grodzinsky
(2025)
[book]
Chomskyan linguistics vs. LLM capabilities
How Deeply Human Is Language?
Grodzinsky
Chomskyan linguistics vs. LLM capabilities
Type: book
Year: 2025
-
On the Measure of Intelligence
— Chollet
(2019)
[paper]
What is intelligence, really?
On the Measure of Intelligence
Chollet
What is intelligence, really?
Type: paper
Year: 2019
-
Models learn heuristics, not world models
What Has a Foundation Model Found?
Vafa et al.
Models learn heuristics, not world models
Type: paper
Year: 2025
-
Reward is Enough
— Silver et al.
(2021)
[paper]
RL maximalism
Reward is Enough
Silver et al.
RL maximalism
Type: paper
Year: 2021
-
Judea Pearl's work
[resource]
Causality vs. correlation
Judea Pearl's work
Causality vs. correlation
Type: resource
-
Thinking, Fast and Slow
— Kahneman
[book]
System 1/2---informs neuro-symbolic debate
Thinking, Fast and Slow
Kahneman
System 1/2---informs neuro-symbolic debate
Type: book
-
Gödel, Escher, Bach
— Hofstadter
[book]
Classic on minds and formal systems
Gödel, Escher, Bach
Hofstadter
Classic on minds and formal systems
Type: book
-
Artificial Intelligence: The Very Idea
— Haugeland
(1985)
[book]
Coined 'GOFAI,' philosophical foundations
Artificial Intelligence: The Very Idea
Haugeland
Coined 'GOFAI,' philosophical foundations
Type: book
Year: 1985
-
What Computers Can't Do
— Dreyfus
(1972)
[book]
Classic phenomenological critique
What Computers Can't Do
Dreyfus
Classic phenomenological critique
Type: book
Year: 1972
-
Computer Power and Human Reason
— Weizenbaum
(1976)
[book]
ELIZA creator's warning about AI hubris
Computer Power and Human Reason
Weizenbaum
ELIZA creator's warning about AI hubris
Type: book
Year: 1976
-
The Emperor's New Mind
— Penrose
(1989)
[book]
Consciousness, Gödel, and computation
The Emperor's New Mind
Penrose
Consciousness, Gödel, and computation
Type: book
Year: 1989
-
The Cambridge Handbook of AI
— Boden, ed.
(2014)
[book]
Comprehensive overview chapters
The Cambridge Handbook of AI
Boden, ed.
Comprehensive overview chapters
Type: book
Year: 2014
-
LLMs as functionally delirious---follow probability distributions without recognizing disconnection from reality. Uses Foucault's History of Madness + Hume. No conflict with Part 1 (different Foucault text, different argument).
On Large Language Models' Delirium (with Hume and Foucault)
Schliesser, Eric
LLMs as functionally delirious---follow probability distributions without recognizing disconnection from reality. Uses Foucault's History of Madness + Hume. No conflict with Part 1 (different Foucault text, different argument).
Type: post
Year: 2023
-
Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence
— Crawford, Kate
(2021)
[book]
Material and political economy of AI: labor, data, infrastructure, state power. Empirical grounding for 'whose norms' critique.
Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence
Kate Crawford
Material and political economy of AI: labor, data, infrastructure, state power. Empirical grounding for 'whose norms' critique.
Type: book
Year: 2021
Publisher: Yale UP
-
Race After Technology: Abolitionist Tools for the New Jim Code
— Benjamin, Ruha
(2019)
[book]
How discriminatory design gets encoded as neutral technical infrastructure. 'New Jim Code' --- racism embedded in algorithms and platforms.
Race After Technology: Abolitionist Tools for the New Jim Code
Ruha Benjamin
How discriminatory design gets encoded as neutral technical infrastructure. 'New Jim Code' --- racism embedded in algorithms and platforms.
Type: book
Year: 2019
Publisher: Polity
-
Algorithms of Oppression: How Search Engines Reinforce Racism
— Noble, Safiya Umoja
(2018)
[book]
Search as norm-enforcing infrastructure. Whose knowledge gets surfaced, whose gets suppressed.
Algorithms of Oppression: How Search Engines Reinforce Racism
Safiya Umoja Noble
Search as norm-enforcing infrastructure. Whose knowledge gets surfaced, whose gets suppressed.
Type: book
Year: 2018
Publisher: NYU Press
-
A Realist Theory of Science
— Bhaskar, Roy
(1975)
[book]
Foundational text for critical realism. Stratified ontology: the empirical (observed), actual (events), and real (underlying mechanisms). Retroduction: infer mechanisms from effects. Grounds the distinction between context-steering (meaning) and formal constraints (mechanism) in Part 3.
A Realist Theory of Science
Roy Bhaskar
Foundational text for critical realism. Stratified ontology: the empirical (observed), actual (events), and real (underlying mechanisms). Retroduction: infer mechanisms from effects. Grounds the distinction between context-steering (meaning) and formal constraints (mechanism) in Part 3.
Type: book
Year: 1975
Publisher: Leeds Books
-
The Possibility of Naturalism
— Bhaskar, Roy
(1979)
[book]
Applies critical realism to social science. More accessible than A Realist Theory of Science. Social structures are real and causally efficacious even when unobservable. Better entry point for social applications of critical realism.
The Possibility of Naturalism
Roy Bhaskar
Applies critical realism to social science. More accessible than A Realist Theory of Science. Social structures are real and causally efficacious even when unobservable. Better entry point for social applications of critical realism.
Type: book
Year: 1979
Publisher: Harvester Press
-
How the Laws of Physics Lie
— Cartwright, Nancy
(1983)
[book]
Scientific laws describe idealized models, not the world directly. Capacities (dispositions of real things) are more fundamental than covering laws. Laws are tools for controlled domains; stability is local, not universal. Strengthens the project's stance that constraints emerge from causal structure rather than universal description. Also relevant to LLM evals: what do benchmark results actually measure about model behavior?
How the Laws of Physics Lie
Nancy Cartwright
Scientific laws describe idealized models, not the world directly. Capacities (dispositions of real things) are more fundamental than covering laws. Laws are tools for controlled domains; stability is local, not universal. Strengthens the project's stance that constraints emerge from causal structure rather than universal description. Also relevant to LLM evals: what do benchmark results actually measure about model behavior?
Type: book
Year: 1983
Publisher: Oxford UP
-
Representing and Intervening
— Hacking, Ian
(1983)
[book]
Entity realism: 'if you can spray them, they're real.' Representing (building models) and intervening (manipulating things) are distinct activities. Maps to pentesting: exploitation proves the entity, but the fix requires the model. Core thinker for Hacker Epistemology article. Non-polemical treatment of realism vs. constructivism.
Representing and Intervening
Ian Hacking
Entity realism: 'if you can spray them, they're real.' Representing (building models) and intervening (manipulating things) are distinct activities. Maps to pentesting: exploitation proves the entity, but the fix requires the model. Core thinker for Hacker Epistemology article. Non-polemical treatment of realism vs. constructivism.
Type: book
Year: 1983
Publisher: Cambridge UP
-
Two Dogmas of Empiricism
— Quine, W.V.O.
(1951)
[article]
Foundational essay. Challenges analytic/synthetic distinction and reductionism. Holism: statements face experience as a corporate body. Indeterminacy of translation: meaning isn't fixed by behavior alone. Start here before Word and Object.
Two Dogmas of Empiricism
Willard Van Orman Quine
Foundational essay. Challenges analytic/synthetic distinction and reductionism. Holism: statements face experience as a corporate body. Indeterminacy of translation: meaning isn't fixed by behavior alone. Start here before Word and Object.
Type: article
Year: 1951
Publisher: The Philosophical Review
-
Inquiries into Truth and Interpretation
— Davidson, Donald
(1984)
[book]
Essay collection including 'Truth and Meaning' and 'Radical Interpretation'. Principle of charity: interpret speakers as mostly rational and true. Maps to how we evaluate LLM outputs. Builds on Quine toward a more workable account of meaning attribution.
Inquiries into Truth and Interpretation
Donald Davidson
Essay collection including 'Truth and Meaning' and 'Radical Interpretation'. Principle of charity: interpret speakers as mostly rational and true. Maps to how we evaluate LLM outputs. Builds on Quine toward a more workable account of meaning attribution.
Type: book
Year: 1984
Publisher: Oxford UP
-
Articulating Reasons: An Introduction to Inferentialism
— Brandom, Robert
(2000)
[book]
Accessible intro to inferentialism: meaning is a matter of inferential role, not reference. The analytic version of Wittgenstein's language games argument. Read before Making It Explicit (1994). Complicates the symbol grounding problem: if meaning is inferential, world-reference may be less foundational than Harnad claims.
Articulating Reasons: An Introduction to Inferentialism
Robert Brandom
Accessible intro to inferentialism: meaning is a matter of inferential role, not reference. The analytic version of Wittgenstein's language games argument. Read before Making It Explicit (1994). Complicates the symbol grounding problem: if meaning is inferential, world-reference may be less foundational than Harnad claims.
Type: book
Year: 2000
Publisher: Harvard UP
-
Gender Trouble: Feminism and the Subversion of Identity
— Butler, Judith
(1990)
[book]
Performativity: identity is constituted through repeated citational acts, not expressed from a prior self. Relevant to role-switching and character-capture in LLMs. Whose norms get encoded as 'natural' behavior.
Gender Trouble: Feminism and the Subversion of Identity
Judith Butler
Performativity: identity is constituted through repeated citational acts, not expressed from a prior self. Relevant to role-switching and character-capture in LLMs. Whose norms get encoded as 'natural' behavior.
Type: book
Year: 1990
Publisher: Routledge
-
Alignment technologically pacifies value conflicts by framing political questions as engineering problems. Key concepts: Foucault's 'conducting conducts' (more precise than panopticon for RLHF), 'authority based on examples,' is/ought normativity. Critical angle---trilogy is descriptive/engineering by contrast.
"Desired behaviors": alignment and the emergence of a machine learning ethics
Schwerzmann & Campolo
Alignment technologically pacifies value conflicts by framing political questions as engineering problems. Key concepts: Foucault's 'conducting conducts' (more precise than panopticon for RLHF), 'authority based on examples,' is/ought normativity. Critical angle---trilogy is descriptive/engineering by contrast.
Type: paper
Year: 2025
-
Some Philosophical Problems from the Standpoint of Artificial Intelligence
— McCarthy & Hayes
(1969)
[paper]
The classical AI frame problem: how does a system determine what's relevant---what changes when an action is taken, and what stays the same? Series title concept. LLMs face this at every token.
Some Philosophical Problems from the Standpoint of Artificial Intelligence
John McCarthy, Patrick J. Hayes
The classical AI frame problem: how does a system determine what's relevant---what changes when an action is taken, and what stays the same? Series title concept. LLMs face this at every token.
Type: paper
Year: 1969
-
Logic and Conversation
— Grice, H.P.
(1975)
[paper]
The Cooperative Principle and conversational maxims (Quality, Quantity, Relation, Manner). LLMs learn the statistical shape of cooperative conversation without being cooperative agents. Three connections: (1) hallucination is an invisible Quality violation---the surface form signals 'I have evidence' when the model has none; (2) the cooperative default is an attack surface---embedded context exploits the model's assumption that inputs are truthful and relevant; (3) role prompts trigger Gricean implicature---'you are a security auditor' causes inference of an entire pragmatic context. Not used in the trilogy (Austin and Goffman do more work per sentence for the audience), but would fit a standalone piece on why hallucinations are deceptive.
Logic and Conversation
H. Paul Grice
The Cooperative Principle and conversational maxims (Quality, Quantity, Relation, Manner). LLMs learn the statistical shape of cooperative conversation without being cooperative agents. Three connections: (1) hallucination is an invisible Quality violation---the surface form signals 'I have evidence' when the model has none; (2) the cooperative default is an attack surface---embedded context exploits the model's assumption that inputs are truthful and relevant; (3) role prompts trigger Gricean implicature---'you are a security auditor' causes inference of an entire pragmatic context. Not used in the trilogy (Austin and Goffman do more work per sentence for the audience), but would fit a standalone piece on why hallucinations are deceptive.
Type: paper
Year: 1975
-
Studies in the Way of Words
— Grice, H.P.
(1989)
[book]
Collected papers including 'Logic and Conversation.' The book-length treatment. Harvard UP.
Studies in the Way of Words
H. Paul Grice
Collected papers including 'Logic and Conversation.' The book-length treatment. Harvard UP.
Type: book
Year: 1989
-
LLMs generate substantially fewer grounding acts (clarification, acknowledgment, follow-up) than humans. Preference optimization (DPO/PPO) reduces grounding further. Prompting increases grounding frequency but does not improve agreement (Cohen's kappa) with human timing and placement. Key result for the project: surface behavioral change doesn't recover structural alignment. RLHF doesn't just fail to teach grounding---it actively degrades it. Empirical support for thesis 78 (learned shape of cooperation, not cooperation itself), the doctrine (steering changes what's probable, not what's possible), and verification asymmetry. Background support for Parts 2-3; not a headline citation. Stanford.
Grounding Gaps in Language Model Generations
Omar Shaikh, Kristina Gligoric, Ashna Khetan, Matthias Gerstgrasser, Diyi Yang, Dan Jurafsky
LLMs generate substantially fewer grounding acts (clarification, acknowledgment, follow-up) than humans. Preference optimization (DPO/PPO) reduces grounding further. Prompting increases grounding frequency but does not improve agreement (Cohen's kappa) with human timing and placement. Key result for the project: surface behavioral change doesn't recover structural alignment. RLHF doesn't just fail to teach grounding---it actively degrades it. Empirical support for thesis 78 (learned shape of cooperation, not cooperation itself), the doctrine (steering changes what's probable, not what's possible), and verification asymmetry. Background support for Parts 2-3; not a headline citation. Stanford.
Type: paper
Year: 2024
-
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
— Shaikh, Mozannar, Bansal, Fourney & Horvitz
(2025)
[paper]
Sequel to Shaikh's 2024 grounding gaps paper, now with Microsoft. Quantifies the cascade: LLMs are 3x less likely to clarify and 16x less likely to request follow-up than humans. 45% of LLM turns are overresponses (verbose answers to narrow questions; humans: 0%). Early grounding failures triple the probability of downstream breakdowns. Rifts benchmark: frontier models score 23% on grounding-initiation tasks (worse than random 33%), but 96% when no grounding is needed --- models are instruction-followers, not collaborators. Simple prompting intervention improved Llama 3.1 8B from near-random to 54%. Builds on Clark's common ground theory. Explicitly anti-anthropomorphic: 'We use grounding as a metaphor, not as an anthropomorphic assertion.' Strengthens the case for verification asymmetry: the system can generate fluent responses without establishing mutual understanding, and neither party can verify alignment from surface text alone.
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
Omar Shaikh, Hussein Mozannar, Gagan Bansal, Adam Fourney, Eric Horvitz
Sequel to Shaikh's 2024 grounding gaps paper, now with Microsoft. Quantifies the cascade: LLMs are 3x less likely to clarify and 16x less likely to request follow-up than humans. 45% of LLM turns are overresponses (verbose answers to narrow questions; humans: 0%). Early grounding failures triple the probability of downstream breakdowns. Rifts benchmark: frontier models score 23% on grounding-initiation tasks (worse than random 33%), but 96% when no grounding is needed --- models are instruction-followers, not collaborators. Simple prompting intervention improved Llama 3.1 8B from near-random to 54%. Builds on Clark's common ground theory. Explicitly anti-anthropomorphic: 'We use grounding as a metaphor, not as an anthropomorphic assertion.' Strengthens the case for verification asymmetry: the system can generate fluent responses without establishing mutual understanding, and neither party can verify alignment from surface text alone.
Type: paper
Year: 2025
-
The Symbol Grounding Problem
— Harnad, Stevan
(1990)
[paper]
A system that manipulates symbols without connection to their referents is performing language, not using it. The philosophical claim that motivates the grounding spectrum in Part 3. One sentence in the article, but the paper is the foundation. Physica D.
The Symbol Grounding Problem
Stevan Harnad
A system that manipulates symbols without connection to their referents is performing language, not using it. The philosophical claim that motivates the grounding spectrum in Part 3. One sentence in the article, but the paper is the foundation. Physica D.
Type: paper
Year: 1990
-
Cognition in the Wild
— Hutchins, Edwin
(1995)
[book]
Cognition is distributed across people, tools, artifacts, and environment. A cockpit 'remembers' its airspeed through instruments and checklists, not any single pilot's memory. Used in Part 3: the 'thinking' in an LLM system is across model weights, vector stores, retrieved documents, tool APIs, and deployment constraints. MIT Press.
Cognition in the Wild
Edwin Hutchins
Cognition is distributed across people, tools, artifacts, and environment. A cockpit 'remembers' its airspeed through instruments and checklists, not any single pilot's memory. Used in Part 3: the 'thinking' in an LLM system is across model weights, vector stores, retrieved documents, tool APIs, and deployment constraints. MIT Press.
Type: book
Year: 1995
-
★
An Introduction to Cybernetics
— Ashby, W. Ross
(1956)
[book]
Law of Requisite Variety: only variety can absorb variety. An LLM has enormous variety; a single rule won't constrain it. You need a structured system of constraints with commensurate complexity. Used in Part 3 to justify why the alignment stack needs multiple layers. Ch. 11 is the payload. Chapman & Hall.
An Introduction to Cybernetics
W. Ross Ashby
Law of Requisite Variety: only variety can absorb variety. An LLM has enormous variety; a single rule won't constrain it. You need a structured system of constraints with commensurate complexity. Used in Part 3 to justify why the alignment stack needs multiple layers. Ch. 11 is the payload. Chapman & Hall.
Type: book
Year: 1956
Publisher: Chapman & Hall
-
A regulator can only control what it models. If your threat model represents protocol components, it regulates protocol-level threats. Threats at the semantic layer (context manipulation, tool composition, principal hierarchy collapse) require a model of the semantic layer. Candidate thesis for MCP Attack Surface article: STRIDE is a good regulator of a different system. International Journal of Systems Science 1(2), 89--97.
Every Good Regulator of a System Must Be a Model of That System
Roger C. Conant and W. Ross Ashby
A regulator can only control what it models. If your threat model represents protocol components, it regulates protocol-level threats. Threats at the semantic layer (context manipulation, tool composition, principal hierarchy collapse) require a model of the semantic layer. Candidate thesis for MCP Attack Surface article: STRIDE is a good regulator of a different system. International Journal of Systems Science 1(2), 89--97.
Type: paper
Year: 1970
-
How to Do Things with Words
— Austin, J.L.
(1962)
[book]
Utterances don't just describe---they do things. Performative vs. constative. Load-bearing in Part 2: 'You are a security auditor' constitutes the role, doesn't describe it. The mechanism behind prompt-as-speech-act. Harvard UP.
How to Do Things with Words
J.L. Austin
Utterances don't just describe---they do things. Performative vs. constative. Load-bearing in Part 2: 'You are a security auditor' constitutes the role, doesn't describe it. The mechanism behind prompt-as-speech-act. Harvard UP.
Type: book
Year: 1962
-
Disempowered Speech
— Hornsby, Jennifer
(1995)
[paper]
Illocutionary silencing: a speech act can be performed yet fail to secure its intended force because the conditions don't recognize it as authoritative. Structural insight used (uncited) in Part 2: natural-language constraints can be present in the context window yet inert if the active frame doesn't treat them as binding. The mechanism for why jailbreaks don't remove safety instructions---they suppress their authority. Political context stripped; only the authority-encoding structure is used. Philosophical Topics 23.2.
Disempowered Speech
Jennifer Hornsby
Illocutionary silencing: a speech act can be performed yet fail to secure its intended force because the conditions don't recognize it as authoritative. Structural insight used (uncited) in Part 2: natural-language constraints can be present in the context window yet inert if the active frame doesn't treat them as binding. The mechanism for why jailbreaks don't remove safety instructions---they suppress their authority. Political context stripped; only the authority-encoding structure is used. Philosophical Topics 23.2.
Type: paper
Year: 1995
-
Lying, Misleading, and What is Said
— Saul, Jennifer
(2012)
[book]
The said/implicated distinction (Grice) is in practice a deniability structure: harmful content communicated through implicature while literal content provides cover. Direct application to adversarial prompting: attackers exploit the gap between what is said (surface classifier input) and what is implicated (model behavior trigger). Each Part 2 injection technique maps to a specific implicature move. Framework is general philosophy of language developed on feminist problems. Oxford UP.
Lying, Misleading, and What is Said
Jennifer Saul
The said/implicated distinction (Grice) is in practice a deniability structure: harmful content communicated through implicature while literal content provides cover. Direct application to adversarial prompting: attackers exploit the gap between what is said (surface classifier input) and what is implicated (model behavior trigger). Each Part 2 injection technique maps to a specific implicature move. Framework is general philosophy of language developed on feminist problems. Oxford UP.
Type: book
Year: 2012
-
Dogwhistles, Political Manipulation, and Philosophy of Language
— Saul, Jennifer
(2018)
[paper]
Dogwhistles communicate different things to different audiences using the same words. Literal content is deniable; implicature is audience-specific. Application to adversarial prompting: a prompt that reads as legitimate to a safety classifier while communicating different instructions to the model is operating as a dogwhistle in Saul's technical sense. Predicts an attack class where the audience split is classifier vs. model. In New Work on Speech Acts, Oxford UP.
Dogwhistles, Political Manipulation, and Philosophy of Language
Jennifer Saul
Dogwhistles communicate different things to different audiences using the same words. Literal content is deniable; implicature is audience-specific. Application to adversarial prompting: a prompt that reads as legitimate to a safety classifier while communicating different instructions to the model is operating as a dogwhistle in Saul's technical sense. Predicts an attack class where the audience split is classifier vs. model. In New Work on Speech Acts, Oxford UP.
Type: paper
Year: 2018
-
Computation and Its Limits
— Cockshott, Mackenzie, Michaelson
(2012)
[book]
Russell's Paradox -> Godel's Incompleteness -> Halting Problem as instances of the same structural move: self-reference produces contradiction, contradiction proves limits. Cohen's virus detection proof is the next link; LLMs processing language about language is a softer instance. Background knowledge for the three-layer framework (thesis 74). Not cited in articles. Oxford UP.
Computation and Its Limits
W. Paul Cockshott, Lewis M. Mackenzie, Greg Michaelson
Russell's Paradox -> Godel's Incompleteness -> Halting Problem as instances of the same structural move: self-reference produces contradiction, contradiction proves limits. Cohen's virus detection proof is the next link; LLMs processing language about language is a softer instance. Background knowledge for the three-layer framework (thesis 74). Not cited in articles. Oxford UP.
Type: book
Year: 2012
-
Structural Realism: The Best of Both Worlds?
— Worrall, John
(1989)
[paper]
Founding paper for structural realism. Scientific revolutions preserve mathematical structure while replacing ontology (Fresnel's equations survive ether -> electromagnetic fields). Epistemic version: we can only know structure, not the nature of entities. Names the philosophical position closest to the trilogy's method: same relational pattern (underdetermination -> self-reference -> constraints) across domains, without ontological commitment to what the domains 'really are.' In tension with Hacking's entity realism. Background knowledge, not article content.
Structural Realism: The Best of Both Worlds?
John Worrall
Founding paper for structural realism. Scientific revolutions preserve mathematical structure while replacing ontology (Fresnel's equations survive ether -> electromagnetic fields). Epistemic version: we can only know structure, not the nature of entities. Names the philosophical position closest to the trilogy's method: same relational pattern (underdetermination -> self-reference -> constraints) across domains, without ontological commitment to what the domains 'really are.' In tension with Hacking's entity realism. Background knowledge, not article content.
Type: paper
Year: 1989
-
The Sciences of the Artificial
— Simon, Herbert A.
(1969)
[book]
Foundational text on design science. Artificial systems are shaped by their environment and their purpose, not just their components. Bounded rationality: agents satisfice rather than optimize. The architecture of complexity: hierarchical, nearly decomposable systems. Inner/outer environment distinction maps to alignment stack (inner) vs. deployment context (outer). Pairs with Wimsatt on limited agents and Ashby on requisite variety.
The Sciences of the Artificial
Herbert A. Simon
Foundational text on design science. Artificial systems are shaped by their environment and their purpose, not just their components. Bounded rationality: agents satisfice rather than optimize. The architecture of complexity: hierarchical, nearly decomposable systems. Inner/outer environment distinction maps to alignment stack (inner) vs. deployment context (outer). Pairs with Wimsatt on limited agents and Ashby on requisite variety.
Type: book
Year: 1969
Publisher: MIT Press
-
More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science
— Anderson, Philip W.
(1972)
[paper]
Foundational emergence paper. 'The ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe.' Each level of complexity requires its own laws. Reductionism fails as a constructive program. Directly supports the project's layered realism: token prediction doesn't predict system behavior, alignment stack properties don't reduce to attention weights. Science 177(4047).
More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science
Philip W. Anderson
Foundational emergence paper. 'The ability to reduce everything to simple fundamental laws does not imply the ability to start from those laws and reconstruct the universe.' Each level of complexity requires its own laws. Reductionism fails as a constructive program. Directly supports the project's layered realism: token prediction doesn't predict system behavior, alignment stack properties don't reduce to attention weights. Science 177(4047).
Type: paper
Year: 1972
-
The Fixation of Belief
— Peirce, Charles Sanders
(1877)
[article]
Four methods of fixing belief: tenacity, authority, a priori, and scientific method. Only scientific method is self-correcting. Reality is what resists inquiry. Foundational pragmatist text. Prevents pragmatism from collapsing into 'if it works, it's true' --- Peirce's pragmatism is disciplined by consequence and correction, not convenience.
The Fixation of Belief
Charles Sanders Peirce
Four methods of fixing belief: tenacity, authority, a priori, and scientific method. Only scientific method is self-correcting. Reality is what resists inquiry. Foundational pragmatist text. Prevents pragmatism from collapsing into 'if it works, it's true' --- Peirce's pragmatism is disciplined by consequence and correction, not convenience.
Type: article
Year: 1877
-
How to Make Our Ideas Clear
— Peirce, Charles Sanders
(1878)
[article]
The pragmatic maxim: the meaning of a concept is its practical consequences. 'Consider what effects, that might conceivably have practical bearings, we conceive the object of our conception to have.' Companion to 'Fixation of Belief.' Together they ground the project's pragmatist epistemology: claims justified through consequence, not correspondence.
How to Make Our Ideas Clear
Charles Sanders Peirce
The pragmatic maxim: the meaning of a concept is its practical consequences. 'Consider what effects, that might conceivably have practical bearings, we conceive the object of our conception to have.' Companion to 'Fixation of Belief.' Together they ground the project's pragmatist epistemology: claims justified through consequence, not correspondence.
Type: article
Year: 1878
-
Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality
— Wimsatt, William C.
(2007)
[book]
Engineering epistemology for finite agents reasoning in complex systems. Key concepts: robustness (convergence across independent methods confirms reality), multiple realizability, heuristics as adaptive tools with systematic biases. Extremely aligned with the trilogy --- finite systems using piecewise approximations, constraints emerging from limits. Strengthens the project's stance without mystifying limits.
Re-Engineering Philosophy for Limited Beings: Piecewise Approximations to Reality
William C. Wimsatt
Engineering epistemology for finite agents reasoning in complex systems. Key concepts: robustness (convergence across independent methods confirms reality), multiple realizability, heuristics as adaptive tools with systematic biases. Extremely aligned with the trilogy --- finite systems using piecewise approximations, constraints emerging from limits. Strengthens the project's stance without mystifying limits.
Type: book
Year: 2007
Publisher: Harvard UP
-
★
Ironies of Automation
— Bainbridge
(1983)
[paper]
The more you automate, the more critical and more degraded the human operator becomes. Five ironies: designer's paradox (hardest tasks left for humans), skill decay, knowledge atrophy, monitoring impossibility (~30 min vigilance limit), and the final irony (most successful systems need most training investment). Prescription: human-computer collaboration, obvious failure modes, maintain operator skills. Foundational for verification asymmetry argument in Part 3.
Ironies of Automation
Lisanne Bainbridge
The more you automate, the more critical and more degraded the human operator becomes. Five ironies: designer's paradox (hardest tasks left for humans), skill decay, knowledge atrophy, monitoring impossibility (~30 min vigilance limit), and the final irony (most successful systems need most training investment). Prescription: human-computer collaboration, obvious failure modes, maintain operator skills. Foundational for verification asymmetry argument in Part 3.
Type: paper
Year: 1983
-
No knowledge from nowhere. The 'god trick' --- claiming to see everything from no particular place --- sounds like objectivity but is actually invisibility of position. Systems encode their makers' assumptions as universal. Situated knowledge isn't bias concession; it's the only honest starting point. Used in 'What Bainbridge and Haraway Knew' to add the question Bainbridge doesn't ask: who decided to move the boundary?
Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective
Donna Haraway
No knowledge from nowhere. The 'god trick' --- claiming to see everything from no particular place --- sounds like objectivity but is actually invisibility of position. Systems encode their makers' assumptions as universal. Situated knowledge isn't bias concession; it's the only honest starting point. Used in 'What Bainbridge and Haraway Knew' to add the question Bainbridge doesn't ask: who decided to move the boundary?
Type: paper
Year: 1988
-
UX practitioners express the same optimism that preceded every documented automation irony. Novel incubation period argument: design activities (sketching, wireframing) aren't just production tasks but cognitive aids. When AI handles them, designers lose exploratory iterations AND incubation time for unconscious processing. Not 'they get worse at drawing' --- 'they get worse at thinking about design problems.'
De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design
Prakash Shukla, Phuong Bui, Sean S Levy, Max Kowalski, Ali Baigelenov, Paul Parsons
UX practitioners express the same optimism that preceded every documented automation irony. Novel incubation period argument: design activities (sketching, wireframing) aren't just production tasks but cognitive aids. When AI handles them, designers lose exploratory iterations AND incubation time for unconscious processing. Not 'they get worse at drawing' --- 'they get worse at thinking about design problems.'
Type: paper
Year: 2025
-
Proposes 'System 3' (AI as external cognition) extending dual-process theory. Three preregistered experiments (N=1,372, 9,593 trials) show 'cognitive surrender': participants adopt AI outputs with minimal scrutiny. When AI is accurate, +25pp over baseline; when faulty, -15pp. Effect is large (Cohen's h=0.81). AI increases confidence even when wrong. Time pressure and incentives don't eliminate the pattern. Directly relevant to the memo's Section 3 (human role) and Bainbridge: the reviewer looking at brot findings is in this position. The humanizer makes it worse by removing signals that might trigger System 2 scrutiny.
Thinking—Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender
Steven D. Shaw, Gideon Nave
Proposes 'System 3' (AI as external cognition) extending dual-process theory. Three preregistered experiments (N=1,372, 9,593 trials) show 'cognitive surrender': participants adopt AI outputs with minimal scrutiny. When AI is accurate, +25pp over baseline; when faulty, -15pp. Effect is large (Cohen's h=0.81). AI increases confidence even when wrong. Time pressure and incentives don't eliminate the pattern. Directly relevant to the memo's Section 3 (human role) and Bainbridge: the reviewer looking at brot findings is in this position. The humanizer makes it worse by removing signals that might trigger System 2 scrutiny.
Type: paper
Year: 2026
-
Plans and Situated Actions: The Problem of Human-Machine Communication
— Suchman, Lucy
(1987)
[book]
Plans are representations, not programs for action. The gap between a formal plan and situated practice explains why fluent LLM output is trusted as evidence of action when no action occurred. The hallucination performs a plan ('here's what I found') while erasing the situated activity. Candidate thinker for the hallucinations article alongside Grice. STS tradition --- empirical and translatable.
Plans and Situated Actions: The Problem of Human-Machine Communication
Lucy Suchman
Plans are representations, not programs for action. The gap between a formal plan and situated practice explains why fluent LLM output is trusted as evidence of action when no action occurred. The hallucination performs a plan ('here's what I found') while erasing the situated activity. Candidate thinker for the hallucinations article alongside Grice. STS tradition --- empirical and translatable.
Type: book
Year: 1987
Publisher: Cambridge UP
-
Privacy in Context: Technology, Policy, and the Integrity of Social Life
— Nissenbaum, Helen
(2010)
[book]
Contextual integrity: information flows have context-dependent norms. A flow appropriate in one context becomes a violation in another. More precise than least privilege --- an MCP tool call can have the right permissions and still violate contextual integrity if the context has shifted. Candidate thinker for MCP Attack Surface article.
Privacy in Context: Technology, Policy, and the Integrity of Social Life
Helen Nissenbaum
Contextual integrity: information flows have context-dependent norms. A flow appropriate in one context becomes a violation in another. More precise than least privilege --- an MCP tool call can have the right permissions and still violate contextual integrity if the context has shifted. Candidate thinker for MCP Attack Surface article.
Type: book
Year: 2010
Publisher: Stanford UP
-
The Ethnography of Infrastructure
— Star, Susan Leigh
(1999)
[paper]
Infrastructure becomes invisible when it works; that concealment has costs. The alignment stack, tool-call layers, and MCP servers all become infrastructure the moment they function. Her concept is about a property of systems, not a domain --- applicable across multiple articles. STS tradition, empirical, highly translatable to security register.
The Ethnography of Infrastructure
Susan Leigh Star
Infrastructure becomes invisible when it works; that concealment has costs. The alignment stack, tool-call layers, and MCP servers all become infrastructure the moment they function. Her concept is about a property of systems, not a domain --- applicable across multiple articles. STS tradition, empirical, highly translatable to security register.
Type: paper
Year: 1999
-
Language, Thought, and Other Biological Categories: New Foundations for Realism
— Millikan, Ruth Garrett
(1984)
[book]
Teleosemantics: meaning comes from selection history (function), not resemblance (structure). Ontologies commit to function; vector embeddings encode structure. Explains why LLMs can generate descriptions but can't replace ontology design --- the functional commitment is what the ontology is. Candidate thinker for Ontologies for Developers. Analytic philosophy of science --- needs aggressive translation for practitioner audience.
Language, Thought, and Other Biological Categories: New Foundations for Realism
Ruth Garrett Millikan
Teleosemantics: meaning comes from selection history (function), not resemblance (structure). Ontologies commit to function; vector embeddings encode structure. Explains why LLMs can generate descriptions but can't replace ontology design --- the functional commitment is what the ontology is. Candidate thinker for Ontologies for Developers. Analytic philosophy of science --- needs aggressive translation for practitioner audience.
Type: book
Year: 1984
Publisher: MIT Press
-
Objectivity
— Daston, Lorraine and Galison, Peter
(2007)
[book]
How the concept of scientific objectivity has changed historically, always encoding the epistemic virtues of a particular era. 'Objectivity' is not one thing --- it's a succession of practices (truth-to-nature, mechanical objectivity, trained judgment) each reflecting different anxieties about human subjectivity. Relevant to any article about what 'accuracy' or 'reliability' means for LLMs, benchmark gaming, or the gap between evaluation metrics and actual reliability. Needs the right article to exist.
Objectivity
Lorraine Daston, Peter Galison
How the concept of scientific objectivity has changed historically, always encoding the epistemic virtues of a particular era. 'Objectivity' is not one thing --- it's a succession of practices (truth-to-nature, mechanical objectivity, trained judgment) each reflecting different anxieties about human subjectivity. Relevant to any article about what 'accuracy' or 'reliability' means for LLMs, benchmark gaming, or the gap between evaluation metrics and actual reliability. Needs the right article to exist.
Type: book
Year: 2007
Publisher: Zone Books
-
Trust and Antitrust
— Baier, Annette
(1986)
[paper]
Trust as a three-place relation involving vulnerability and goodwill, not just reliability. Relevant to when to trust LLM output --- the security engineer audience already thinks formally about trust through threat modeling. Lighter analytical edge than Suchman or Nissenbaum but conceptually clean.
Trust and Antitrust
Annette Baier
Trust as a three-place relation involving vulnerability and goodwill, not just reliability. Relevant to when to trust LLM output --- the security engineer audience already thinks formally about trust through threat modeling. Lighter analytical edge than Suchman or Nissenbaum but conceptually clean.
Type: paper
Year: 1986
-
Models and Analogies in Science
— Hesse, Mary
(1963)
[book]
How models work by analogy and what that means for their limits. Underread. If there's an article about why alignment metaphors matter --- why 'guardrails' vs 'constraints' vs 'filters' shape engineering decisions --- she's the thinker. No current article fits cleanly.
Models and Analogies in Science
Mary Hesse
How models work by analogy and what that means for their limits. Underread. If there's an article about why alignment metaphors matter --- why 'guardrails' vs 'constraints' vs 'filters' shape engineering decisions --- she's the thinker. No current article fits cleanly.
Type: book
Year: 1963
Publisher: Sheed and Ward
-
Epistemic Injustice: Power and the Ethics of Knowing
— Fricker, Miranda
(2007)
[book]
Testimonial injustice (credibility deficits based on identity) and hermeneutical injustice (gaps in collective interpretive resources). Models trained on text inherit both forms. Analytically specific and non-redundant but higher political charge than the other candidates. Personal blog register, not Sprocket.
Epistemic Injustice: Power and the Ethics of Knowing
Miranda Fricker
Testimonial injustice (credibility deficits based on identity) and hermeneutical injustice (gaps in collective interpretive resources). Models trained on text inherit both forms. Analytically specific and non-redundant but higher political charge than the other candidates. Personal blog register, not Sprocket.
Type: book
Year: 2007
Publisher: Oxford UP
-
§24 Learning Paths
0