Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) → Evaluator
Parameters
-
task (
str
) — The task defining which evaluator will be returned. Currently accepted tasks are:"image-classification"
: will return a ImageClassificationEvaluator."question-answering"
: will return a QuestionAnsweringEvaluator."text-classification"
(alias"sentiment-analysis"
available): will return a TextClassificationEvaluator."token-classification"
: will return a TokenClassificationEvaluator.
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline
functionalify from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
check_required_columns
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset] columns_names: typing.Dict[str, str] )
Ensure the columns required for the evaluation are present in the dataset.
compute_metric
< source >( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )
Compute and return metrics.
get_dataset_split
< source >(
data
subset = None
split = None
)
→
split
Infers which split to use if None is given.
load_data
< source >(
data: typing.Union[str, datasets.arrow_dataset.Dataset]
subset: str = None
split: str = None
)
→
data (Dataset
)
Parameters
-
data (
Dataset
orstr
, defaults to None) — Specifies the dataset we will run evaluation on. If it is of -
type
str
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. — -
subset (
str
, defaults to None) — Specifies dataset subset to be passed toname
inload_dataset
. To be used with datasets with several configurations (e.g. glue/sst2). -
split (
str
, defaults to None) — User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is astr
type, will automatically select the best one viachoose_split()
.
Returns
data (Dataset
)
Loaded dataset which will be used for evaluation.
Load dataset with given subset and split.
A core method of the Evaluator
class, which processes the pipeline outputs for compatibility with the metric.
prepare_data
< source >(
data: Dataset
input_column: str
label_column: str
*args
**kwargs
)
→
dict
Parameters
-
data (
Dataset
) — Specifies the dataset we will run evaluation on. -
input_column (
str
, defaults to"text"
) — the name of the column containing the text feature in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
.
Returns
dict
metric inputs.
list
: pipeline inputs.
Prepare data.
prepare_metric
< source >( metric: typing.Union[str, evaluate.module.EvaluationModule] )
Prepare metric.
prepare_pipeline
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, — -
defaults to
None
) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
preprocessor (
PreTrainedTokenizerBase
orFeatureExtractionMixin
, optional, defaults toNone
) — Argument can be used to overwrite a default preprocessor ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument.
Prepare pipeline.
The task specific evaluators
ImageClassificationEvaluator
class evaluate.ImageClassificationEvaluator
< source >( task = 'image-classification' default_metric_name = None )
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification
.
Methods in this class assume a data format compatible with the ImageClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'image' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
QuestionAnsweringEvaluator
class evaluate.QuestionAnsweringEvaluator
< source >( task = 'question-answering' default_metric_name = None )
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering
.
Methods in this class assume a data format compatible with the
QuestionAnsweringPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True
to
the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
TextClassificationEvaluator
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification' default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification
or with a "sentiment-analysis"
alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline
- a single textual
feature as input and a categorical label as output.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' second_input_column: typing.Optional[str] = None label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
TokenClassificationEvaluator
class evaluate.TokenClassificationEvaluator
< source >( task = 'token-classification' default_metric_name = None )
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification
.
Methods in this class assume a data format compatible with the TokenClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: str = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
"ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
},
features=Features({
"tokens": Sequence(feature=Value(dtype="string")),
"ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
}),
)
For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New York is a city and Felix a person."]],
"starts": [[0, 23]],
"ends": [[7, 27]],
"ner_tags": [["LOC", "PER"]],
},
features=Features({
"tokens": Value(dtype="string"),
"starts": Sequence(feature=Value(dtype="int32")),
"ends": Sequence(feature=Value(dtype="int32")),
"ner_tags": Sequence(feature=Value(dtype="string")),
}),
)
TextGenerationEvaluator
class evaluate.TextGenerationEvaluator
< source >( task = 'text-generation' default_metric_name = None predictions_prefix: str = 'generated' )
Text generation evaluator.
This Text generation evaluator can currently be loaded from evaluator() using the default task name
text-generation
.
Methods in this class assume a data format compatible with the TextGenerationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Text2TextGenerationEvaluator
class evaluate.Text2TextGenerationEvaluator
< source >( task = 'text2text-generation' default_metric_name = None )
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name
text2text-generation
.
Methods in this class assume a data format compatible with the Text2TextGenerationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
SummarizationEvaluator
class evaluate.SummarizationEvaluator
< source >( task = 'summarization' default_metric_name = None )
Text summarization evaluator.
This text summarization evaluator can currently be loaded from evaluator() using the default task name
summarization
.
Methods in this class assume a data format compatible with the SummarizationEvaluator.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
TranslationEvaluator
Translation evaluator.
This translation generation evaluator can currently be loaded from evaluator() using the default task name
translation
.
Methods in this class assume a data format compatible with the TranslationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"text"
) — the name of the column containing the input text in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
generation_kwargs (
Dict
, optional, defaults toNone
) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
AutomaticSpeechRecognitionEvaluator
class evaluate.AutomaticSpeechRecognitionEvaluator
< source >( task = 'automatic-speech-recognition' default_metric_name = None )
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name
automatic-speech-recognition
.
Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'path' label_column: str = 'sentence' generation_kwargs: dict = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
subset (
str
, defaults toNone
) — Defines which dataset subset to load. IfNone
is passed the default subset is loaded. -
split (
str
, defaults toNone
) — Defines which dataset split to load. IfNone
is passed, infers based on thechoose_split
function. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>> data=data,
>>> input_column="path",
>>> label_column="sentence",
>>> metric="wer",
>>> )