href / src /md.py
Shane
made lots of changes
92c7f09
raw
history blame
2.76 kB
from datetime import datetime
import pytz
ABOUT_TEXT = """
HREF is evaluation benchmark that evaluates language models' capacity of following human instructions. It is consisted of 4,258 instructions covering 11 distinct categories, including Brainstorm ,Open QA ,Closed QA ,Extract ,Generation ,Rewrite ,Summarize ,Coding ,Classify ,Fact Checking or Attributed QA ,Multi-Document Synthesis , and Reasoning Over Numerical Data.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/dSv3U11h936t_q-aiqbkV.png)
## Why HREF
| Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference|
|--------------------|-------|------------|----------------|----------------|----------|------------|-----------|
| MT-Bench | 80 | Score | --- | gpt4 | βœ“ | βœ— | βœ— |
| AlpacaEval 2.0 | 805 | PWC | gpt4-turbo | gpt4-turbo | βœ— | βœ— | βœ— |
| Chatbot Arena | --- | PWC | --- | Human | βœ— | βœ“ | βœ— |
| Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | βœ— | βœ— | βœ— |
| WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | βœ— | βœ— | βœ— |
| **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct | Llama-3.1-70B-Instruct | βœ“ | βœ“ | βœ“ |
- **Human Reference**: HREF leverages human-written answer as reference to provide more reliable evaluation than previous method.
- **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
- **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination.
- **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models.
"""
# Get Pacific time zone (handles PST/PDT automatically)
pacific_tz = pytz.timezone('America/Los_Angeles')
current_time = datetime.now(pacific_tz).strftime("%H:%M %Z, %d %b %Y")
TOP_TEXT = f"""# HREF: Human Reference Guided Evaluation for Instructiong Following
[Code]() | [Validation Set]() | [Human Agreement Set]() | [Results]() | [Paper]() | Total models: {{}} | * Unverified models | ⚠️ Dataset Contamination | Last restart (PST): {current_time}
"""