モデル評価・ベンチマーク

精度、再現率、F値、BLEU、ベンチマークリーダーボードなど、評価指標を解説

01モデル評価Model Evaluation 02評価指標Evaluation Metrics 03精度(Accuracy)Accuracy 04適合率(Precision)Precision 05再現率(Recall)Recall 06F値F-score / F-measure 07ROC-AUCROC-AUC (Receiver Operating Characteristic - Area Under Curve) 08PR-AUCPrecision-Recall AUC 09ログ損失(Log Loss)Log Loss / Cross-Entropy Loss 10MSEMean Squared Error 11RMSERoot Mean Squared Error 12MAEMean Absolute Error 13R2スコアR-squared / Coefficient of Determination 14決定係数Coefficient of Determination 15調整済みR2Adjusted R-squared 16MAPEMean Absolute Percentage Error 17混同行列Confusion Matrix 18分類レポートClassification Report 19マクロ平均/マイクロ平均Macro Average / Micro Average 20加重平均Weighted Average 21BLEUBilingual Evaluation Understudy 22ROUGERecall-Oriented Understudy for Gisting Evaluation 23METEORMetric for Evaluation of Translation with Explicit ORdering 24CIDErConsensus-based Image Description Evaluation 25BERTScoreBERTScore 26人間評価Human Evaluation 27Elo レーティングElo Rating 28Chatbot ArenaChatbot Arena 29MMLUMassive Multitask Language Understanding 30HellaSwagHellaSwag 31HumanEvalHumanEval 32GSM8KGrade School Math 8K 33ARCAI2 Reasoning Challenge 34TruthfulQATruthfulQA 35MT-BenchMT-Bench 36AlpacaEvalAlpacaEval 37LMSYSLarge Model Systems Organization 38SuperGLUESuper General Language Understanding Evaluation 39GLUEGeneral Language Understanding Evaluation 40ImageNet(ベンチマーク)ImageNet 41COCOCommon Objects in Context 42SQuADStanford Question Answering Dataset 43MLPerfMLPerf 44ベンチマーク汚染Benchmark Contamination 45リーダーボードLeaderboard 46交差検証Cross-Validation 47ブートストラップ信頼区間Bootstrap Confidence Interval 48統計的有意性Statistical Significance 49アブレーションスタディAblation Study 50ハイパーパラメータ感度分析Hyperparameter Sensitivity Analysis 51モデル比較Model Comparison 52ベースラインBaseline 53SOTA(State of the Art)State of the Art (SOTA) 54過適合検出Overfitting Detection 55学習曲線分析Learning Curve Analysis