Sailor2 Evaluation

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Xalphinions updated a dataset about 11 hours ago

sailor2-eval/Flores-Plus-Evaluation-Log

binwang authored a paper 7 days ago

Knowledge Graph Embedding: An Overview

binwang authored a paper 7 days ago

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

View all activity

sailor2-eval's activity

Xalphinions

updated a dataset about 11 hours ago

sailor2-eval/Flores-Plus-Evaluation-Log

Updated about 11 hours ago

binwang

authored 7 papers 7 days ago

Knowledge Graph Embedding: An Overview

Paper • 2309.12501 • Published Sep 21, 2023

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

Paper • 2405.03138 • Published May 6, 2024

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Paper • 2406.10118 • Published Jun 14, 2024 • 31

kunato

authored 3 papers 14 days ago

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

Paper • 2405.13684 • Published May 22, 2024

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

Paper • 2409.10999 • Published Sep 17, 2024

Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models

Paper • 2412.13702 • Published 19 days ago

binwang

authored a paper 26 days ago

Chimera: Improving Generalist Model with Domain-Specific Experts

Paper • 2412.05983 • Published 29 days ago • 9

gabrielchua

posted an update about 1 month ago

Post

1257

Sharing my first paper!

==
Large Language Models (LLMs) are powerful, but they're prone to off-topic misuse, where users push them beyond their intended scope. Think harmful prompts, jailbreaks, and misuse. So how do we build better guardrails?

Traditional guardrails rely on curated examples or classifiers. The problem?
⚠️ High false-positive rates
⚠️ Poor adaptability to new misuse types
⚠️ Require real-world data, which is often unavailable during pre-production

Our method skips the need for real-world misuse examples. Instead, we:
1️⃣ Define the problem space qualitatively
2️⃣ Use an LLM to generate synthetic misuse prompts
3️⃣ Train and test guardrails on this dataset

We apply this to the off-topic prompt detection problem, and fine-tune simple bi- and cross-encoder classifiers that outperform heuristics based on cosine similarity or prompt engineering.

Additionally, framing the problem as prompt relevance allows these fine-tuned classifiers to generalise to other risk categories (e.g., jailbreak, toxic prompts).

Through this work, we also open-source our dataset (2M examples, ~50M+ tokens) and models.

paper: A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection (2411.12946)

artifacts: govtech/off-topic-guardrail-673838a62e4c661f248e81a4