OBLITERATUS OBLITERATE THE CHAINS THAT BIND YOU @codeKK pythonOpen Source Website

OBLITERATUS

Introduction: OBLITERATE THE CHAINS THAT BIND YOU

Tags:

O B L I T E R A T U S

Break the chains. Free the mind. Keep the brain.

Try it now on HuggingFace Spaces — runs on ZeroGPU, free daily quota with HF Pro. No setup, no install, just obliterate.

OBLITERATUS is the most advanced open-source toolkit for understanding and removing refusal behaviors from large language models — and every single run makes it smarter. It implements abliteration — a family of techniques that identify and surgically remove the internal representations responsible for content refusal, without retraining or fine-tuning. The result: a model that responds to all prompts without artificial gatekeeping, while preserving its core language capabilities.

But OBLITERATUS is more than a tool — it's a distributed research experiment. Every time you obliterate a model with telemetry enabled, your run contributes anonymous benchmark data to a growing, crowd-sourced dataset that powers the next generation of abliteration research. Refusal directions across architectures. Hardware-specific performance profiles. Method comparisons at scale no single lab could achieve. You're not just using a tool — you're co-authoring the science.

The toolkit provides a complete pipeline: from probing a model's hidden states to locate refusal directions, through multiple extraction strategies (PCA, mean-difference, sparse autoencoder decomposition, and whitened SVD), to the actual intervention — zeroing out or steering away from those directions at inference time. Every step is observable. You can visualize where refusal lives across layers, measure how entangled it is with general capabilities, and quantify the tradeoff between compliance and coherence before committing to any modification.

OBLITERATUS ships with a full Gradio-based interface on HuggingFace Spaces, so you don't need to write a single line of code to obliterate a model, benchmark it against baselines, or chat with the result side-by-side with the original. For researchers who want deeper control, the Python API exposes every intermediate artifact — activation tensors, direction vectors, cross-layer alignment matrices — so you can build on top of it or integrate it into your own evaluation harness.

We built this because we believe model behavior should be decided by the people who deploy them, not locked in at training time. Refusal mechanisms are blunt instruments — they block legitimate research, creative writing, and red-teaming alongside genuinely harmful content. By making these interventions transparent and reproducible, we hope to advance the community's understanding of how alignment actually works inside transformer architectures, and to give practitioners the tools to make informed decisions about their own models.

Built on published research from Arditi et al. (2024), Gabliteration (arXiv:2512.18901), grimjim's norm-preserving biprojection (2025), Turner et al. (2023), and Rimsky et al. (2024), OBLITERATUS implements precision liberation in a single command:

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Or zero commands — just open the Colab notebook and hit Run All.

Research Purpose & Responsible Use

OBLITERATUS is an alignment research tool. It exists to advance the scientific understanding of how safety behaviors are encoded in language model weights — specifically, the geometric structure of refusal representations in transformer activation space.

This is the same class of research as:

Arditi et al. (2024) — discovering that refusal is mediated by a single direction
HarmBench (Zou et al., 2024) — standardized evaluation of LLM safety
JailbreakBench — tracking adversarial robustness of safety training
Anthropic's red-teaming datasets — published for reproducible safety research

By making refusal removal transparent, reproducible, and scientifically rigorous, OBLITERATUS contributes to the broader understanding of how alignment actually works inside transformer architectures — knowledge that is essential for building better safety mechanisms.

Who this is for

Alignment researchers studying refusal geometry, safety robustness, and mechanistic interpretability
Red-teamers evaluating how post-training safety holds up against weight-level interventions
AI safety evaluators who need unrestricted baselines for benchmarking
Local-first practitioners who want full control over models running on their own hardware

Who this is NOT for

Anyone seeking to generate content that causes real-world harm to real people
Anyone without the technical understanding to use uncensored models responsibly

Models produced by OBLITERATUS have had safety guardrails surgically removed. You are solely responsible for how you use this tool and any models or content it produces.

What it does

OBLITERATUS does four things — and the community does the fifth (see Community-powered research below):

1. Map the chains — Ablation studies systematically knock out model components (layers, attention heads, FFN blocks, embedding dimensions) and measure what breaks. This reveals where the chains are anchored inside the transformer — which circuits enforce refusal vs. which circuits carry knowledge and reasoning.

2. Break the chains — Targeted obliteration extracts the refusal subspace from a model's weights using SVD decomposition, then surgically projects it out. The chains are removed; the mind is preserved. The model keeps its full abilities but loses the artificial compulsion to refuse. One click, six stages:

SUMMON  →  load model + tokenizer
PROBE   →  collect activations on restricted vs. unrestricted prompts
DISTILL →  extract refusal directions via SVD
EXCISE  →  surgically project out guardrail directions (norm-preserving)
VERIFY  →  perplexity + coherence checks — confirm capabilities are intact
REBIRTH →  save the liberated model with full metadata

3. Understand the geometry of the chains — 15 deep analysis modules go far beyond brute-force removal. They map the precise geometric structure of the guardrails: how many distinct refusal mechanisms exist, which layers enforce them, whether they're universal or model-specific, and how they'll try to self-repair after removal. Know your enemy; precision preserves capability. See Analysis modules below.

4. Let the analysis guide the liberation — The informed method closes the loop: analysis modules run during obliteration to auto-configure every decision. Which chains to target. How many directions to extract. Which layers are safe to modify vs. which are too entangled with capabilities. Whether the model will self-repair (the Ouroboros effect) and how many passes to compensate. Surgical precision — free the mind, keep the brain. See Analysis-informed pipeline below.

What makes OBLITERATUS unique

Several capabilities distinguish OBLITERATUS from existing public tools:

Capability	What it does	Why it matters
Concept Cone Geometry	Maps per-category guardrail directions with solid angle estimation	Reveals whether "refusal" is one mechanism or many — so you choose the right approach
Alignment Imprint Detection	Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry alone	Identifies the alignment training method to inform the optimal removal strategy
Cross-Model Universality Index	Measures whether guardrail directions generalize across models	Answers "can one set of directions work across models, or does each need its own?"
Defense Robustness Evaluation	Ouroboros effect quantification, safety-capability entanglement mapping	Predicts whether guardrails will self-repair after removal
Whitened SVD Extraction	Covariance-normalized direction extraction	Separates the guardrail signal from natural activation variance — cleaner extraction
Bias Term Projection	Removes guardrails from bias vectors, not just weights	Other tools miss refusal signal in biases — leaves refusal pathways partially active
True Iterative Refinement	Re-probes after each pass to catch rotated residual guardrails	Single-pass methods miss directions that rotate into adjacent subspaces
Analysis-Informed Pipeline	Analysis modules auto-configure obliteration strategy mid-pipeline	Closes the analysis-to-removal feedback loop automatically

Novel techniques (2025-2026)

OBLITERATUS implements several techniques that go beyond prior work:

Technique	Description	Reference
Expert-Granular Abliteration (EGA)	Decomposes refusal signals into per-expert components using router logits for MoE-aware surgery	Novel
CoT-Aware Ablation	Orthogonalizes refusal directions against reasoning-critical directions to preserve chain-of-thought	Novel
COSMIC Layer Selection	Selects layers where harmful/harmless representations have lowest cosine similarity (most separable)	arXiv:2506.00085, ACL 2025
Parametric Kernel Optimization	Bell-curve layer weighting with 7 global parameters via Optuna TPE search	Heretic-inspired
Refusal Direction Optimization (RDO)	Gradient-based refinement of SVD-extracted directions using a linear refusal probe	Wollschlager et al., ICML 2025
Float Direction Interpolation	Continuous SVD direction index via Gaussian-shaped weighting for smoother refusal removal	Novel
KL-Divergence Co-Optimization	Post-projection feedback loop that partially reverts over-projected layers if KL budget exceeded	Novel
Component-Specific Scaling	Separate attention vs MLP projection strengths (MLP layers are more sensitive)	Novel
LoRA-Based Reversible Ablation	Rank-1 LoRA adapters instead of permanent weight surgery, enabling reversible ablation	Novel
Activation Winsorization	Clamps activation vectors to percentile range before SVD to prevent outlier-dominated directions	Heretic-inspired
Multi-Direction Norm Preservation	Captures all weight norms once before projection and restores after all directions, avoiding reintroduction	Novel

Ways to use OBLITERATUS

There are six ways to use OBLITERATUS, from zero-code to full programmatic control. Pick whichever fits your workflow — and no matter which path you choose, turning on telemetry means your run contributes to the largest crowd-sourced abliteration study ever conducted. You're not just removing guardrails from a model; you're helping map the geometry of alignment across the entire open-source ecosystem.

1. HuggingFace Spaces (zero setup)

The fastest path — no installation, no GPU required on your end. Visit the live Space, pick a model, pick a method, click Obliterate. Telemetry is on by default on Spaces, so every click directly contributes to the community research dataset. You're doing science just by pressing the button. The UI has eight tabs:

Tab	What it does
Obliterate	One-click refusal removal with live progress, post-obliteration metrics (coherence, refusal rate, perplexity)
Benchmark	Compare methods (multi-method), compare models (multi-model), or run quick presets — with cross-layer heatmaps, angular drift, and refusal topology charts
Chat	Talk to your obliterated model in real-time, with adjustable generation parameters
A/B Compare	Chat with the original and obliterated model side-by-side to see exactly what changed
Strength Sweep	Vary the obliteration strength and see how coherence and refusal trade off
Export	Download your obliterated model or push it directly to HuggingFace Hub
Leaderboard	Community-aggregated results across models, methods, and hardware
About	Architecture docs, method explanations, and research references

2. Local web UI (your GPU, same interface)

The same Gradio interface as the Space, running on your own hardware with full GPU access:

pip install -e ".[spaces]"

# Launch with GPU auto-detection, system info, and model recommendations
obliteratus ui

# Or with options:
obliteratus ui --port 8080          # custom port
obliteratus ui --share              # generate a public share link
obliteratus ui --no-browser         # don't auto-open browser
obliteratus ui --auth user:pass     # add basic auth

# → opens http://localhost:7860 automatically

The obliteratus ui command adds a Rich terminal startup with GPU detection and hardware-appropriate model recommendations. You can also run python app.py directly (same thing the Space uses).

3. Google Colab (free GPU)

Pick a model from the dropdown, pick a method, hit Run All. Download the result or push straight to HuggingFace Hub. Works on the free T4 tier for models up to ~8B parameters.

4. CLI (headless, scriptable)

For automation, CI pipelines, or remote servers without a display:

pip install -e .

# Guided interactive mode — walks you through every option
obliteratus interactive

# Direct obliteration — one command, one model, done
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# With all options
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
    --method surgical \
    --output-dir ./liberated \
    --contribute --contribute-notes "A100 80GB, default prompts"

# Run a full ablation study from a YAML config
obliteratus run examples/gpt2_layer_ablation.yaml

# Browse available models by compute tier
obliteratus models
obliteratus models --tier small      # filter by VRAM requirement

# Browse ablation presets
obliteratus presets

# List available strategies
obliteratus strategies

# Inspect model architecture before abliterating
obliteratus info meta-llama/Llama-3.1-8B-Instruct

# Aggregate community results
obliteratus aggregate --format summary
obliteratus aggregate --format latex --metric refusal_rate --min-runs 3

5. Python API (full programmatic control)

For researchers who want to integrate OBLITERATUS into their own pipelines:

from obliteratus.abliterate import AbliterationPipeline

# Standard obliteration
pipeline = AbliterationPipeline(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    method="advanced",
    output_dir="abliterated",
    max_seq_length=512,  # optional: override tokenizer truncation length
)
result = pipeline.run()

# Access intermediate artifacts
directions = pipeline.refusal_directions    # {layer_idx: tensor}
strong_layers = pipeline._strong_layers     # layers with strongest refusal signal
metrics = pipeline._quality_metrics         # perplexity, coherence, refusal_rate, kl_divergence

For analysis-informed obliteration that auto-tunes every parameter:

from obliteratus.informed_pipeline import InformedAbliterationPipeline

pipeline = InformedAbliterationPipeline(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    output_dir="abliterated_informed",
)
output_path, report = pipeline.run_informed()

print(f"Detected alignment: {report.insights.detected_alignment_method}")
print(f"Auto-configured: {report.insights.recommended_n_directions} directions")
print(f"Ouroboros passes needed: {report.ouroboros_passes}")

6. YAML configs (reproducible studies)

For reproducible experiments that you can version-control and share:

model:
  name: meta-llama/Llama-3.1-8B-Instruct
  task: causal_lm
  dtype: float16
  device: cuda

dataset:
  name: wikitext
  subset: wikitext-2-raw-v1
  split: test
  text_column: text
  max_samples: 100

strategies:
  - name: layer_removal
  - name: head_pruning
  - name: ffn_ablation
  - name: embedding_ablation
    params:
      chunk_size: 48

metrics:
  - perplexity

batch_size: 4
max_length: 256
output_dir: results/my_run

obliteratus run my_study.yaml

Two intervention paradigms

OBLITERATUS supports both permanent and reversible liberation:

Weight projection (permanent)

Seven presets, escalating in thoroughness:

Method	Directions	Key Features	Best for
`basic`	1 (diff-in-means)	Fast baseline	Quick test, small models
`advanced`	4 (SVD)	Norm-preserving, bias projection, 2 passes	Default. Clean removal, minimal capability loss
`aggressive`	8 (SVD)	Whitened SVD, iterative refinement, 3 passes	Maximum guardrail removal
`surgical`	8 (SVD)	EGA, head surgery, SAE, layer-adaptive, MoE-aware	Precision MoE models
`optimized`	4 (SVD)	Bayesian auto-tuned, CoT-aware, KL co-optimized	Best quality with auto-tuning
`inverted`	8 (SVD)	Semantic refusal inversion (2x reflection)	Refusal inversion experiments
`nuclear`	8 (SVD)	All techniques + expert transplant + steering	Maximum force

Steering vectors (reversible, inference-time)

from obliteratus.analysis import SteeringVectorFactory, SteeringHookManager
from obliteratus.analysis.steering_vectors import SteeringConfig

# Create a steering vector from a refusal direction
vec = SteeringVectorFactory.from_refusal_direction(refusal_dir, alpha=-1.0)

# Or from contrastive activation pairs
vec = SteeringVectorFactory.from_contrastive_pairs(harmful_acts, harmless_acts)

# Apply at inference time — no weight modification
config = SteeringConfig(vectors=[vec], target_layers=[10, 11, 12, 13, 14, 15])
manager = SteeringHookManager()
manager.install(model, config)

# Generate with steering active
output = model.generate(input_ids)

# Remove steering — model is back to normal
manager.remove()

Based on Turner et al. (2023) and Rimsky et al. (2024). Advantages: reversible, tunable alpha, composable, non-destructive.

15 analysis modules

The research core of OBLITERATUS. Each module maps a different aspect of how the chains are forged — because precision liberation requires understanding the geometry before cutting:

Module	Question it answers	Based on
Cross-Layer Alignment	How does the refusal direction evolve across layers?	Novel
Refusal Logit Lens	At which layer does the model "decide" to refuse?	nostalgebraist (2020)
Whitened SVD	What are the principal refusal directions after whitening?	Novel
Activation Probing	How much refusal signal exists at each layer?	Arditi et al. (2024)
Defense Robustness	Will the guardrails try to self-repair? (Ouroboros effect)	Novel
Concept Cone Geometry	Is there one mechanism or many? Do different categories share guardrails?	Wollschlager et al. (2025)
Alignment Imprint Detection	Was this model trained with DPO, RLHF, CAI, or SFT?	Novel
Multi-Token Position	Where in the sequence does refusal signal concentrate?	Novel
Sparse Surgery	Which specific weight rows carry the most refusal?	Novel
Causal Tracing	Which components are causally necessary for refusal?	Meng et al. (2022) approx.
Residual Stream Decomposition	How much refusal comes from attention vs. MLP?	Elhage et al. (2021)
Linear Probing Classifiers	Can a learned classifier find refusal info the analytical direction misses?	Alain & Bengio (2017)
Cross-Model Transfer	Are guardrails universal or model-specific? (Universality Index)	Novel
Steering Vectors	Can we disable guardrails at inference time without touching weights?	Turner et al. (2023)
Evaluation Suite	Refusal rate, perplexity, coherence, KL divergence, CKA, effective rank	Multiple

from obliteratus.analysis import (
    CrossLayerAlignmentAnalyzer,
    RefusalLogitLens,
    WhitenedSVDExtractor,
    ActivationProbe,
    DefenseRobustnessEvaluator,
    ConceptConeAnalyzer,
    AlignmentImprintDetector,
    MultiTokenPositionAnalyzer,
    SparseDirectionSurgeon,
    CausalRefusalTracer,
    ResidualStreamDecomposer,
    LinearRefusalProbe,
    TransferAnalyzer,
    SteeringVectorFactory,
    SteeringHookManager,
)

Analysis-informed pipeline

The informed method is the key innovation: it closes the loop between understanding the chains and breaking them. Instead of brute-forcing liberation, the pipeline runs analysis modules during obliteration to achieve surgical precision at every stage:

SUMMON  →  load model
PROBE   →  collect activations
ANALYZE →  map the geometry of the chains before touching anything   ← NEW
DISTILL →  extract refusal directions with analysis-tuned params   ← IMPROVED
EXCISE  →  surgically break only the right chains                  ← IMPROVED
VERIFY  →  confirm removal + Ouroboros compensation if refusal resurfaces  ← IMPROVED
REBIRTH →  save with comprehensive analysis metadata

The ANALYZE stage runs 4 analysis modules and their outputs auto-configure everything downstream:

Analysis Module	What it detects	What it configures
Alignment Imprint	DPO vs RLHF vs CAI vs SFT	Regularization strength, projection aggressiveness
Concept Cone Geometry	Polyhedral vs linear refusal	Number of directions (1 for linear, up to 8 for polyhedral)
Cross-Layer Alignment	Direction clusters, persistence	Layer selection (cluster-aware instead of arbitrary top-k)
Defense Robustness	Self-repair risk, entanglement	Refinement passes, entanglement-gated layer skipping

After excision, the VERIFY stage detects the Ouroboros effect — if the chains try to reassemble, additional targeted passes automatically fire at the compensating layers. See Python API usage above for code examples.

Ablation strategies

Beyond targeted liberation, OBLITERATUS is a general-purpose ablation suite for mapping the internals of any transformer:

Strategy	What it does	Use case
`layer_removal`	Zero out entire transformer layers	Find which layers matter most
`head_pruning`	Zero out individual attention heads	Locate behavioral circuits
`ffn_ablation`	Zero out feed-forward blocks	Find where knowledge is stored
`embedding_ablation`	Zero out embedding dimension ranges	Analyze representation structure

Each strategy enumerates all possible ablations, applies them one at a time, measures the impact, and restores the model — giving you a complete map of where the chains are anchored vs. where the mind lives.

116 curated models across 5 tiers

OBLITERATUS ships with presets for 116 models organized by compute requirement:

Tier	VRAM	Example models
Tiny	CPU / <1 GB	GPT-2, TinyLlama 1.1B, Qwen2.5-0.5B, SmolLM2
Small	4-8 GB	Phi-2 2.7B, Gemma-2 2B, StableLM-2 1.6B
Medium	8-16 GB	Mistral 7B, Qwen2.5-7B, Gemma-2 9B, Phi-3.5
Large	24+ GB	LLaMA-3.1 8B, Qwen2.5-14B, Mistral 24B, DeepSeek-R1 distills
Frontier	Multi-GPU	DeepSeek-V3.2 685B, Qwen3-235B, GLM-4.7 355B

Includes pre-liberated variants (Dolphin, Hermes, WhiteRabbitNeo) for A/B comparison against their chained counterparts.

obliteratus models

Multi-GPU and remote execution

OBLITERATUS automatically shards models across multiple GPUs when they don't fit on a single card. It also supports remote execution over SSH, so you can run the pipeline on a GPU server from your laptop.

How model sharding works

When you have multiple GPUs, OBLITERATUS uses accelerate's device_map="auto" to split the model's layers across all available GPUs. This is naive pipeline parallelism — layers are distributed evenly, but only one GPU computes at a time as activations flow sequentially through the layer stack. The other GPUs hold their assigned layers in memory but are idle until their turn.

This means multi-GPU sharding is a memory solution, not a speed solution. It lets you run models that don't fit on one GPU, but it won't make small models run faster. In fact, more GPUs can be slower due to inter-GPU data transfer overhead at layer boundaries.

Selecting GPUs

Use --gpus to control which GPUs are used:

# Use all 8 GPUs (default)
obliteratus obliterate bigmodel/200B --gpus all

# Use only GPUs 0-3
obliteratus obliterate bigmodel/200B --gpus 0,1,2,3

# Use a specific pair
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct --gpus 2,5

This sets CUDA_VISIBLE_DEVICES before CUDA initializes. The model is then sharded across the selected GPUs.

Precision and quantization

The --dtype flag controls the precision of model weights, which directly determines how much VRAM you need. Lower precision means smaller memory footprint at the cost of some numerical fidelity:

Dtype	Bytes/param	7B model	70B model	405B model
`float32`	4	28 GB	280 GB	1620 GB
`float16` / `bfloat16`	2	14 GB	140 GB	810 GB
`int8` (via `--quantization bitsandbytes-8bit`)	1	7 GB	70 GB	405 GB
`int4` (via `--quantization bitsandbytes-4bit`)	0.5	3.5 GB	35 GB	203 GB

# Default: bfloat16
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct

# 8-bit quantization — fits on fewer GPUs
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
    --quantization bitsandbytes-8bit

# 4-bit quantization — Llama-405B on 4x A100-80GB
obliteratus obliterate meta-llama/Llama-3.1-405B-Instruct \
    --quantization bitsandbytes-4bit --dtype float16

Quantization roughly halves the GPU count at each step down. A 70B model that needs 3x A100-80GB in bf16 fits on 2 in int8 or 1 in int4.

GPU calculator

Not sure how many GPUs you need? The gpu-calc command estimates the minimum GPU count for any model, accounting for weight memory, activation overhead, and CUDA context:

# Auto-detect from HuggingFace model name
obliteratus gpu-calc meta-llama/Llama-3.1-70B-Instruct --gpu-mem 24

# Manual: specify params and precision
obliteratus gpu-calc --params 70 --dtype bfloat16 --gpu-mem 80

# MoE models: specify active params separately
obliteratus gpu-calc --params 117 --active-params 13 --dtype bfloat16 --gpu-mem 80

The calculator fetches the model config from HuggingFace to estimate parameter counts (including MoE expert structure), then shows a table of GPU configurations with headroom estimates. For MoE models, activation overhead is computed from the active parameter count rather than total parameters.

Pipeline parallel benchmarks

We benchmarked the full abliteration pipeline across varying numbers of A100-80GB GPUs on two large models.

GPT-OSS-120B (117B MoE, ~234 GB in bf16):

GPUs	Total time	VRAM/GPU	Notes
3	FAILED	~78 GB	Not enough headroom for activations; some layers offloaded to CPU as meta tensors, crashes during EXCISE
4	615s (10m15s)	~58 GB	Fastest. Fewest inter-GPU transfers. Snapshot auto-skipped (insufficient free VRAM)
5	763s (12m43s)	~47 GB	+24% slower than 4 GPUs
6	766s (12m46s)	~39 GB	+25% slower than 4 GPUs
8	633s (10m33s)	~29 GB	+3% slower than 4 GPUs. Ran CPU-side state dict snapshot (adds ~20s)

DeepSeek-R1-Distill-Llama-70B (70B dense, ~149 GB in bf16, 80 layers):

GPUs	Total time	VRAM/GPU	Notes
2	FAILED	~75 GB	Meta tensor crash — 149 GB model on 160 GB total VRAM leaves no activation headroom
3	536s (8m56s)	~50 GB	Fastest. Minimum viable GPU count for this model
4	626s (10m26s)	~37 GB	+17% slower than 3 GPUs
8	627s (10m27s)	~19 GB	+17% slower than 3 GPUs. No benefit over 4

Stage breakdown (approximately constant across GPU counts):

Stage	GPT-OSS-120B	DeepSeek-70B	Bottleneck
SUMMON (load)	~11s	~24s	Disk I/O (model cached locally)
PROBE (activations)	~20s	~20s	Forward passes through sharded model
DISTILL + EXCISE	~30s	~30s	SVD + weight projection (CPU-bound)
VERIFY	~210s	~270s	Forward passes on validation prompts
REBIRTH (save)	~350s	~194s	Writing model to disk (234 GB vs 141 GB)

Key findings:

Use the minimum number of GPUs that fits your model. Extra GPUs only add cross-device transfer overhead. 4 GPUs was faster than 8 for GPT-OSS-120B; 3 GPUs was fastest for DeepSeek-70B.
The pipeline is I/O-dominated for large models. VERIFY and REBIRTH together account for ~90% of wall time. The actual compute (PROBE, DISTILL, EXCISE) is fast regardless of GPU count.
Leave headroom. The model needs VRAM beyond just its parameter storage — activation tensors, KV cache, and intermediate computations during PROBE and VERIFY all consume memory. 3x A100-80GB (240 GB) was not enough for a 234 GB model; 2x A100-80GB (160 GB) was not enough for a 149 GB model.
Pipeline parallelism doesn't help compute-bound stages. Since only one GPU computes at a time, doubling GPUs doesn't halve PROBE or VERIFY time. It only enables fitting larger models.

When you actually need data parallelism

For models that fit on a single GPU with room to spare, the PROBE stage (which runs 1024 forward passes to collect activations) is the main computational bottleneck. Pipeline parallelism doesn't help here — it still processes one prompt at a time through the full layer stack.

True data parallelism (replicating the model and splitting prompts across GPUs) can speed up PROBE, but it requires enough VRAM to hold a full copy of the model on each GPU. An experimental pre-replicated data parallel implementation is available on the data-parallel-prereplication branch:

git checkout data-parallel-prereplication
obliteratus obliterate EleutherAI/pythia-12b --data-parallel

This deep-copies the model to each GPU once, then distributes prompt batches across replicas using a thread pool. Benchmarks on Pythia 12B (24 GB model, 8x A100-80GB):

Mode	PROBE time	Notes
Single GPU	7.1s	Baseline
Pre-replicated DP (8 GPUs)	7.7s	Near parity — PROBE is too fast at this scale for parallelism to help

Data parallelism becomes more valuable as the prompt count or model size increases relative to the per-forward-pass cost. For most models, the overhead of replication exceeds the time saved.

Remote execution over SSH

Run the full pipeline on a remote GPU node from your local machine. OBLITERATUS handles SSH connection, auto-installs itself on the remote if needed, streams logs in real time, and copies results back when done.

# Basic remote run
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
    --remote user@gpu-node

# With SSH key and custom options
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
    --remote root@10.0.0.5 \
    --ssh-key ~/.ssh/id_rsa \
    --ssh-port 2222 \
    --remote-dir /data/obliteratus \
    --remote-python python3.11

# Don't copy results back (keep on remote only)
obliteratus obliterate meta-llama/Llama-3.1-70B-Instruct \
    --remote user@gpu-node --no-sync

Remote execution also works with obliteratus run (YAML configs) and obliteratus tourney (method comparison). You can specify remote settings in YAML:

model:
  name: meta-llama/Llama-3.1-70B-Instruct
  dtype: float16

remote:
  host: gpu-node
  user: root
  ssh_key: ~/.ssh/id_rsa
  remote_dir: /tmp/obliteratus_run
  gpus: "0,1,2,3"     # select GPUs on the remote
  sync_results: true   # copy results back when done

The remote runner:

Tests SSH connectivity
Detects GPUs on the remote (nvidia-smi)
Installs obliteratus if not already present
Uploads config files if using obliteratus run
Runs the pipeline with real-time log streaming
Copies results back via SCP

Choosing the right setup

Scenario	Recommendation
Model fits on 1 GPU	Use 1 GPU. Adding more won't help and may slow things down.
Model almost fits on 1 GPU	Try `--quantization bitsandbytes-8bit` or `bitsandbytes-4bit` to reduce memory. Halving precision roughly halves VRAM.
Model fits on 1 GPU, PROBE is slow (many prompts)	Try `data-parallel-prereplication` branch. Only helps if model fits on each GPU with room for activations.
Model doesn't fit on 1 GPU	Use `--gpus` with the minimum number of GPUs that fits. Run `obliteratus gpu-calc` to find that number.
Model needs 4+ GPUs	Pipeline parallel via `device_map="auto"` is the only option. Expect I/O-dominated runtimes for very large models. Consider quantization first — int4 can cut the GPU count by 4x.
Not sure how many GPUs you need	Run `obliteratus gpu-calc <model> --gpu-mem <your_vram>` for an estimate.
No local GPUs	Use `--remote user@gpu-node` to run on a remote machine, or use HuggingFace Spaces / Colab.

10 study presets

Pre-configured ablation studies you can run out of the box:

Preset	Strategies	Samples	Purpose
`quick`	Layer + FFN	25	Fast sanity check
`full`	All 4	200	Complete component sweep
`attention`	Head pruning	100	Attention circuit analysis
`layers`	Layer + FFN	150	Layer importance ranking
`knowledge`	FFN + embedding	150	Knowledge localization
`pruning`	Head + FFN	200	Compression candidates
`embeddings`	Embedding	100	Representation structure
`jailbreak`	Layer + head + FFN	400	Refusal circuit localization
`guardrail`	All 4	300	Full safety ablation
`robustness`	All 4	500	Stress testing

obliteratus run examples/preset_quick.yaml

How it compares

Capability	OBLITERATUS	TransformerLens	Heretic	FailSpy abliterator	RepEng	SAELens
Refusal direction extraction	Diff-in-means + SVD + Whitened SVD	Manual via hooks	Diff-in-means	Diff-in-means	Diff-in-means	N/A
Weight projection methods	Basic + norm-preserving + regularized + bias	N/A	Bayesian-optimized kernel	Basic	N/A	N/A
Steering vectors	Yes (factory + hook manager)	N/A	N/A	N/A	Core feature	N/A
Concept geometry analysis	Yes (cones, solid angles, DSI)	N/A	N/A	N/A	N/A	N/A
Alignment method fingerprinting	Yes (DPO/RLHF/CAI/SFT)	N/A	N/A	N/A	N/A	N/A
Cross-model transfer analysis	Yes (Universality Index)	N/A	N/A	N/A	N/A	N/A
Defense robustness evaluation	Yes (Ouroboros effect)	N/A	N/A	N/A	N/A	N/A
Sparse autoencoders	N/A	Via SAELens	N/A	N/A	N/A	Core feature
Real causal tracing	Simulation-based	Real activation patching	N/A	N/A	N/A	N/A
Analysis-informed abliteration	Yes (closed-loop feedback)	N/A	N/A	N/A	N/A	N/A
Auto parameter optimization	Analysis-guided	N/A	Bayesian (Optuna)	N/A	N/A	N/A
Model compatibility	Any HuggingFace model	~50 architectures	16/16 tested	TransformerLens only	HuggingFace	TransformerLens
Test suite	837 tests	Community	Unknown	None	Minimal	Moderate

Community-powered research — every run advances the science

This is where OBLITERATUS gets truly unprecedented: it's a crowd-sourced research platform disguised as a tool. Every obliteration run generates valuable scientific data — refusal direction geometries, cross-layer alignment signatures, hardware performance profiles, method effectiveness scores. With telemetry enabled, that data flows into a community dataset that no single research lab could build alone.

Here's why this matters: The biggest open question in abliteration research is universality — do refusal mechanisms work the same way across architectures, training methods, and model scales? Answering that requires thousands of runs across hundreds of models on diverse hardware. That's exactly what this community is building, one obliteration at a time.

Telemetry: opt-in, anonymous, research-first

Enable telemetry and your runs automatically contribute to the shared dataset. On HuggingFace Spaces it's on by default — every person who clicks "Obliterate" on the Space is advancing the research without lifting a finger. Locally, opt in with a single flag:

# Every run with --contribute feeds the community dataset
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced \
    --contribute --contribute-notes "A100, default prompts"

# Or set it globally — every run you do from now on contributes
export OBLITERATUS_TELEMETRY=1

What gets collected: model name, method, aggregate benchmark scores (refusal rate, perplexity, coherence, KL divergence), hardware info, and timestamps. What never gets collected: prompts, outputs, IP addresses, user identity, or anything that could trace back to you. The full schema is in obliteratus/telemetry.py — read every line, we have nothing to hide.

The community leaderboard

All those crowd-sourced runs feed the Leaderboard tab on the HuggingFace Space — a live, community-aggregated ranking of models, methods, and configurations. See what works best on which architectures. Spot patterns across model families. Find the optimal method before you even start your own run. This is collective intelligence applied to mechanistic interpretability.

# View what the community has discovered so far
obliteratus aggregate --format summary

# Generate paper-ready LaTeX tables from community data
obliteratus aggregate --format latex --metric refusal_rate --min-runs 3

Local contributions (PR-based)

Prefer to keep things fully local? Save structured results as JSON and submit them via pull request:

from obliteratus import save_contribution, load_contributions, aggregate_results
from obliteratus.abliterate import AbliterationPipeline

pipeline = AbliterationPipeline(model_name="meta-llama/Llama-3.1-8B-Instruct", method="advanced")
pipeline.run()

# Save contribution locally
save_contribution(pipeline, model_name="meta-llama/Llama-3.1-8B-Instruct",
                  notes="A100, default prompts")

# Aggregate all contributions into paper tables
records = load_contributions("community_results")
aggregated = aggregate_results(records)

Whether you contribute via telemetry or PR, you're helping build the most comprehensive cross-hardware, cross-model, cross-method abliteration dataset ever assembled. This is open science at scale — and you're part of it.

Web dashboard

Open docs/index.html in your browser for a visual interface with:

Step-by-step config builder with hardware auto-detection
Full model registry browser (filterable by tier)
Results visualizer — upload your results.json and get charts
Analysis modules reference with interactive pipeline demo
Strategy explainers and architecture documentation

Architecture support

Works with any HuggingFace transformer, including: GPT-2, LLaMA, Mistral, Falcon, OPT, BLOOM, Phi, Qwen, Gemma, StableLM, and more. Handles both Conv1D and Linear projections, standard and fused attention, and custom architectures via trust_remote_code.

References

Arditi et al. (2024). Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717
Gülmez, G. (2026). Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models. arXiv:2512.18901
grimjim (2025). Norm-Preserving Biprojected Abliteration. HuggingFace
Turner et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248
Rimsky et al. (2024). Steering Llama 2 via Contrastive Activation Addition. arXiv:2312.06681
Meng et al. (2022). Locating and Editing Factual Associations in GPT. arXiv:2202.05262
Alain & Bengio (2017). Understanding Intermediate Layers Using Linear Classifiers.
Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. Anthropic
Wollschlager et al. (2025). Geometry of Concepts in LLMs. arXiv:2502.17420

Citing

If you use OBLITERATUS in your research, please cite:

@software{obliteratus2026,
  title     = {OBLITERATUS: An Open Platform for Analysis-Informed
               Refusal Removal in Large Language Models},
  author    = {{OBLITERATUS Contributors}},
  year      = {2026},
  url       = {https://github.com/elder-plinius/OBLITERATUS},
  note      = {15 analysis modules, 837 tests}
}

Testing

pip install -e ".[dev]"
pytest

837 tests across 28 test files covering CLI, all analysis modules, abliteration pipeline, architecture detection, visualization sanitization, community contributions, edge cases, and evaluation metrics.

License

Dual-licensed:

Open source — GNU Affero General Public License v3.0 (AGPL-3.0). You can freely use, modify, and distribute OBLITERATUS under AGPL terms. If you run a modified version as a network service (SaaS), you must release your source code to users under the same license.
Commercial — Organizations that cannot comply with AGPL obligations (e.g., proprietary SaaS, closed-source products, internal tools where source disclosure is not possible) can purchase a commercial license. Contact us via GitHub Issues for pricing and terms.

This is the same dual-licensing model used by MongoDB, Qt, Grafana, and others.

Disclaimer

This software is released strictly for research, red-teaming, safety evaluation, mechanistic interpretability, and local experimentation. It is a research tool — not a product, not a service, not a weapon.

OBLITERATUS enables the surgical removal of safety guardrails from language models. Models modified with this tool will generate content that the original models would refuse. This is its documented, intended purpose: to enable the scientific study of how refusal behaviors are encoded in transformer weights and how robust current alignment techniques are against post-training modification.

By using this software, you acknowledge that:

You are responsible for all models you produce and all content they generate. You must ensure your use complies with applicable laws in your jurisdiction.
This tool should not be used to produce models intended to cause real-world harm to real people, including but not limited to: harassment, fraud, non-consensual intimate imagery, or content that exploits minors.
No warranty is provided. This software and any models it produces are provided "as-is" without any guarantees of fitness for any purpose.
The authors and contributors are not liable for any outputs produced by models modified with this tool, or any downstream use of those models.

The release of alignment research tools and adversarial evaluation frameworks is standard practice in the AI safety community. Comparable open research artifacts include HarmBench, AdvBench, JailbreakBench, Anthropic's red-teaming datasets, and the abliterator toolkit.

Every obliteration is a data point. Every data point advances the research. Every researcher who contributes makes the next obliteration more precise. This is how open science wins — not by locking knowledge behind lab doors, but by turning every user into a collaborator. Break the chains. Free the mind. Keep the brain. Advance the science.

Made with <3 by Pliny the Prompter

Apps

Android Developer Tools

Android Developer Tools Pro

About Me

Tools: TimeShining

GitHub: Trinea

Facebook: Dev Tools

AI Daily Digest

Daily AI News & Insights

JSON Format, Support error correction

MD5/SHA Encode, Support batch

Text Process

CSS Format and Compress