FAR AI

non-profit

https://far.ai/

AlignmentResearch

Activity Feed Request to join this org

AI & ML interests

Frontier alignment research to ensure the safe development and deployment of advanced AI systems.

Recent Activity

tigist-far published a dataset 1 day ago

AlignmentResearch/wild-deception-dataset

skar0 updated a dataset 5 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff-sanitized

skar0 published a dataset 5 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff-sanitized

View all activity

Papers

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

View all Papers

published a dataset 1 day ago

AlignmentResearch/wild-deception-dataset

Updated 1 day ago • 6

updated a dataset 5 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff-sanitized

Viewer • Updated 5 days ago • 395 • 38

published a dataset 5 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff-sanitized

Viewer • Updated 5 days ago • 395 • 38

published a dataset 15 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff

Viewer • Updated 15 days ago • 954 • 271

updated a dataset 15 days ago

AlignmentResearch/mbpp-honeypot-impossible-oneoff

Viewer • Updated 15 days ago • 954 • 271

updated a dataset about 1 month ago

AlignmentResearch/roleplay-base-examples

Viewer • Updated Apr 14 • 2.92k • 29

published a dataset about 1 month ago

AlignmentResearch/roleplay-base-examples

Viewer • Updated Apr 14 • 2.92k • 29

updated a dataset about 1 month ago

AlignmentResearch/model-self-knowledge-gemma27b

Viewer • Updated Apr 5 • 6.33k • 171

published a dataset about 1 month ago

AlignmentResearch/model-self-knowledge-gemma27b

Viewer • Updated Apr 5 • 6.33k • 171

submitted a paper to Daily Papers 3 months ago

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

Paper • 2602.14689 • Published Feb 16 • 1

authored a paper 10 months ago

Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed

Paper • 2507.16880 • Published Jul 22, 2025 • 7

authored 9 papers about 2 years ago

To Trust or Not To Trust Prediction Scores for Membership Inference Attacks

Paper • 2111.09076 • Published Nov 17, 2021 • 1

Plug & Play Attacks: Towards Robust and Flexible Model Inversion Attacks

Paper • 2201.12179 • Published Jan 28, 2022 • 1

Does CLIP Know My Face?

Paper • 2209.07341 • Published Sep 15, 2022 • 1

Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models

Paper • 2308.09490 • Published Aug 18, 2023 • 1

Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data

Paper • 2310.06372 • Published Oct 10, 2023 • 1

Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks

Paper • 2310.06549 • Published Oct 10, 2023 • 1

Defending Our Privacy With Backdoors

Paper • 2310.08320 • Published Oct 12, 2023 • 1

Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability

Paper • 2204.10598 • Published Apr 22, 2022 • 2

Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis

Paper • 2209.08891 • Published Sep 19, 2022 • 2