error_correction_eval/Thesis.tex at master · omriabnd/error_correction_eval · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[
12pt, % The default document font size, options: 10pt, 11pt, 12pt
%oneside, % Two side (alternating margins) for binding by default, uncomment to switch to one side
english, % ngerman for German
% singlespacing % Single line spacing, alternatives: onehalfspacing or
%onehalfspacing
doublespacing
,
%draft, % Uncomment to enable draft mode (no pictures, no links, overfull hboxes indicated)
nolistspacing, % If the document is onehalfspacing or doublespacing, uncomment this to set spacing in lists to single
%liststotoc, % Uncomment to add the list of figures/tables/etc to the table of contents
toctotoc, % Uncomment to add the main table of contents to the table of contents
parskip, % Uncomment to add space between paragraphs
%nohyperref, % Uncomment to not load the hyperref package
headsepline, % Uncomment to get a line under the header
%chapterinoneline, % Uncomment to place the chapter title next to the number on one line
%consistentlayout, % Uncomment to change the layout of the declaration, abstract and acknowledgements pages to match the default layout
]{MastersDoctoralThesis} % The class file specifying the document structure

%\usepackage[round]{natbib}

\usepackage{mathtools}
\usepackage{setspace}
\usepackage{dsfont}
\usepackage{amsfonts}
\usepackage{subcaption}
\usepackage{paralist}
%\usepackage{subfig}
\usepackage{times}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage[T1]{fontenc}
\usepackage{tikz}
\usepackage{url}
\usepackage{pgfplotstable}
\usepackage{titlesec}
\usepackage{color}
\usepackage{lipsum,adjustbox}
\usepackage[font={small}]{caption}
\usetikzlibrary{positioning}
\usepackage{bbm}

%template additions
\usepackage[utf8]{inputenc} % Required for inputting international characters
\usepackage[T1]{fontenc} % Output font encoding for international characters

\usepackage{palatino} % Use the Palatino font by default

\usepackage[backend=bibtex,style=authoryear,natbib=true]{biblatex} % Use the bibtex backend with the authoryear citation style (which resembles APA)

\addbibresource{propose.bib} % The filename of the bibliography

\usepackage[autostyle=true]{csquotes} % Required to generate language-dependent quotes in the bibliography


\graphicspath{{./plots/}}
\newcommand{\com}[1]{}
%\newcommand{\oa}[1]{}
%\newcommand{\lc}[1]{}
\newcommand{\oa}[1]{\footnote{\color{red}OA: #1}}
\newcommand{\lc}[1]{\footnote{\color{blue}LC: #1}}
\newcommand{\newcite}[1]{\citeauthor{#1} (\citeyear{#1})}
\newcommand{\justcite}[1]{ (\cite{#1})}
\com{
	\makeatletter
	\newcommand{\@BIBLABEL}{\@emptybiblabel}
	\newcommand{\@emptybiblabel}[1]{}
	%\makeatother
	\usepackage[hidelinks]{hyperref}
}

\newenvironment{myequation}{
 \begin{equation}
}{
 \end{equation}
}
\newenvironment{myequation*}{
	\begin{equation*}
}{
\end{equation*}
}

%----------------------------------------------------------------------------------------
%	MARGIN SETTINGS
%----------------------------------------------------------------------------------------

\geometry{
	paper=a4paper, % Change to letterpaper for US letter
	inner=2.5cm, % Inner margin
	outer=3.8cm, % Outer margin
	bindingoffset=.5cm, % Binding offset
	top=1.5cm, % Top margin
	bottom=1.5cm, % Bottom margin
	%showframe, % Uncomment to show how the type block is set on the page
}

%----------------------------------------------------------------------------------------
%	THESIS INFORMATION
%----------------------------------------------------------------------------------------

\thesistitle{Conservatism and Over-conservatism in Grammatical Error Correction} % Your thesis title, this is used in the title and abstract, print it elsewhere with \ttitle
\supervisor{Dr. Omri \textsc{Abend}} % Your supervisor's name, this is used in the title page, print it elsewhere with \supname
\examiner{} % Your examiner's name, this is not currently used anywhere in the template, print it elsewhere with \examname
\degree{Master's degree} % Your degree name, this is used in the title page and abstract, print it elsewhere with \degreename
\author{Leshem \textsc{Choshen}} % Your name, this is used in the title page and abstract, print it elsewhere with \authorname
\addresses{} % Your address, this is not currently used anywhere in the template, print it elsewhere with \addressname

\subject{Natural Language Processing} % Your subject area, this is not currently used anywhere in the template, print it elsewhere with \subjectname
\keywords{} % Keywords for your thesis, this is not currently used anywhere in the template, print it elsewhere with \keywordnames
\university{\href{http://new.huji.ac.il/en}{Hebrew University}} % Your university's name and URL, this is used in the title page and abstract, print it elsewhere with \univname
\department{\href{http://en.cognitive.huji.ac.il/}{Department of cognitive sciences}} % Your department's name and URL, this is used in the title page and abstract, print it elsewhere with \deptname
\group{The huji NLP lab} % Your research group's name and URL, this is used in the title page, print it elsewhere with \groupname
\faculty{\href{http://www.hum.huji.ac.il/english/}{Faculty Of Humanities}} % Your faculty's name and URL, this is used in the title page and abstract, print it elsewhere with \facname

\AtBeginDocument{
	\hypersetup{pdftitle=\ttitle} % Set the PDF's title to your title
	\hypersetup{pdfauthor=\authorname} % Set the PDF's author to your name
	\hypersetup{pdfkeywords=\keywordnames} % Set the PDF's keywords to your keywords
}

\begin{document}
\frontmatter
\pagestyle{plain}
%\title{Thesis: Conservatism and Over-conservatism in Grammatical Error Correction}
%\author{
%  Leshem Choshen\textsuperscript{1} and Omri Abend\textsuperscript{2} \\
%  \textsuperscript{1}School of Computer Science and Engineering,
%  \textsuperscript{2} Department of Cognitive Sciences \\
%  The Hebrew University of Jerusalem \\
%  \texttt{leshem.choshen@mail.huji.ac.il, oabend@cs.huji.ac.il}\\
%}


%----------------------------------------------------------------------------------------
%	TITLE PAGE
%----------------------------------------------------------------------------------------

\begin{titlepage}
	\begin{center}

		\vspace*{.06\textheight}
		{\scshape\LARGE \univname\par}\vspace{1.5cm} % University name
		\textsc{\Large Master's Thesis}\\[0.5cm] % Thesis type

		\HRule \\[0.4cm] % Horizontal line
		{\LARGE \bfseries \ttitle\par}\vspace{0.4cm} % Thesis title
		\HRule \\[1.5cm] % Horizontal line

		\begin{minipage}[t]{0.4\textwidth}
			\begin{flushleft} \large
				\emph{Author:}\\
				\authorname % Author name - remove the \href bracket to remove the link
			\end{flushleft}
		\end{minipage}
		\begin{minipage}[t]{0.4\textwidth}
			\begin{flushright} \large
				\emph{Supervisor:} \\
				\href{http://www.cs.huji.ac.il/~oabend/}{\supname} % Supervisor name - remove the \href bracket to remove the link
			\end{flushright}
		\end{minipage}\\[0.5cm]

		\vfill

		\large \textit{A thesis submitted in fulfillment of the requirements\\ for the degree of \degreename} % University requirement text
		\textit{in the}\\[0.4cm]
		\groupname\\\deptname\\[1cm] % Research group name and department name

		\vfill

		{\large \today}\\[1cm] % Date
		%\includegraphics{Logo} % University/department logo - uncomment to place it

		\vfill
	\end{center}
\end{titlepage}

%----------------------------------------------------------------------------------------
%	DECLARATION PAGE
%----------------------------------------------------------------------------------------

\begin{declaration}
	\addchaptertocentry{\authorshipname} % Add the declaration to the table of contents
	\noindent I, \authorname, declare that this thesis titled, \enquote{\ttitle} and the work presented in it are my own. I confirm that:

	\begin{itemize}
		\item This work was done wholly or mainly while in candidature for a research degree at this University.
		\item Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.
		\item Where I have consulted the published work of others, this is always clearly attributed.
		\item Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.
		\item I have acknowledged all main sources of help.
		\item Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.\\
	\end{itemize}
	\com{
	\noindent Signed:\\
	\rule[0.5em]{25em}{0.5pt} % This prints a line for the signature

	\noindent Date:\\
	\rule[0.5em]{25em}{0.5pt} % This prints a line to write the date
}
\end{declaration}

\cleardoublepage
\com{
%----------------------------------------------------------------------------------------
%	QUOTATION PAGE
%----------------------------------------------------------------------------------------

\vspace*{0.2\textheight}

\noindent\enquote{\itshape Thanks to my solid academic training, today I can write hundreds of words on virtually any topic without possessing a shred of information, which is how I got a good job in journalism.}\bigbreak

\hfill Dave Barry
}
%----------------------------------------------------------------------------------------
%	ABSTRACT PAGE
%----------------------------------------------------------------------------------------

\begin{abstract}
	\addchaptertocentry{\abstractname} % Add the abstract to the table of contents
	Grammatical Error Correction systems (henceforth, {\it correctors}) aim to correct ungrammatical text,
	while changing it as little as possible. However, whereas such conservatism is a virtue for correctors,
	we find that state-of-the-art systems make substantially less changes to the source sentences than needed.
	Analyzing the distribution of possible corrections for a given sentence,
	we show that this over-conservatism likely stems from
	the inability of a handful of reference corrections to account for the full variation of valid
	corrections for a given sentence. This results in undue penalization of valid corrections,
	thus disincentivizing correctors to make changes.
	We also show that simply increasing the number of references is unlikely to resolve this problem,
	and conclude by presenting an alternative reference-less approach based on semantic similarity.
\end{abstract}
\com{
%----------------------------------------------------------------------------------------
%	ACKNOWLEDGEMENTS
%----------------------------------------------------------------------------------------

\begin{acknowledgements}
	\addchaptertocentry{\acknowledgementname} % Add the acknowledgements to the table of contents
	The acknowledgments and the people to thank go here, don't forget to include your project advisor\ldots
\end{acknowledgements}
}
%----------------------------------------------------------------------------------------
%	LIST OF CONTENTS/FIGURES/TABLES PAGES
%----------------------------------------------------------------------------------------

\tableofcontents % Prints the main table of contents

\listoffigures % Prints the list of figures

\listoftables % Prints the list of tables

\mainmatter % Begin numeric (1,2,3...) page numbering

\pagestyle{thesis}
\chapter{Introduction}

% Error correction
% evaluation in error correction and its centrality
% faithfulness to the source meaning is important, and this has been noted but prev work, and evaluation is geared towards it
% gap in evaluation: however, steps taken to ensure conservativeness in fact push towards formal conservativism by their definition (theoretical claim about the measure)
% this may result in systems that make few changes. indeed we find that this is the case (empirical claim about systems)
%
% we pursue two approaches to overcome this bias.
%
% 1. increasing the number of references. this has been proposed before and pursued with m=2, but no assessment of its sufficiency or its added value over m=1 has been made. In order to address this gap we first charachterize the distribution of possible corrections for a sentence. We leverage this characterization to characterize the distribution of the scores as a function of $m$, and consequently assess the biases introduced by taking $m=1,2$ as with previous approaches.
% We find that taking these values of $m$ drammatically under-estimate the system scores.
% We back our analysis of these biases with an analysis of the variance of these estimators.
% We analyze the two commonly used scores, the M2 score often used for evalauted, and the accuracy score commonly used in training.
%
% 2. we note that in fact the important factor is semantic conservativism and explore means to directly assess how semantically conservative systems here through the use of semantic annotation.
% We use the UCCA scheme as a test case, motivated by HUME.
% First question: is it well-defined on learner language. it is.
% Second question: are corrections in fact semantically conservate? to show that, we need to verify that the corrections make few (if any) semantic changes. our results indicate that this is the case: we show that the corrections are similar in (UCCA) structure to the source.
%
% conclusion (not in intro): we tried to use semantic similarity to improve systems.
% this is difficult due to semantic conservatism. we expect this will be in issue once evaluation is improved.
% future work.
% also future work: use multiple references in training (did people do that?)
%
% sections:
% 1. Introduction
% 2. Formal conservativism in GEC
% 3. First approach: Multiple References
% 3.1. A Distribution of Corrections
% 3.2. Scores (M2, accuracy index, accuracy exact)
% 3.3. Data
% 3.4. Bias of the Scores (setup + results)
% 3.5. Variance of the Scores (setup + results)
% 4. Second approach: Semantic Similarity
% 4.1. Semantic Annotation of Learner Language (prev work)
% 4.2. UCCA Scheme (see HUME)
% 4.3. Similarity Measures (including prev work of elior)
% 4.4. Empirical Validation: IAA, semantic conservativism vs. gold std
% 5. Conclusion
%
% is a challenging research field, which interfaces with many
%other areas of linguistics and NLP. The field
Grammatical Error Correction (GEC) is receiving considerable
interest recently, notably through the GEC-HOO \justcite{dale2011helping,dale2012hoo} and
CoNLL shared tasks \justcite{kao2013conll,ng2014conll}.
Within GEC, considerable effort has been placed on evaluation
\justcite{tetreault2008native,madnani2011they,felice2015towards,napoles2015ground},
a notoriously difficult challenge, in part due to the many valid corrections each learner's language (LL) sentence may
have \justcite{chodorow2012problems}.

An important criterion in the evaluation of correctors during training validation and test times
is their ability to generate corrections that are faithful to the meaning of the source.
In fact, many would prefer a somewhat cumbersome or even an occasionally ungrammatical
correction over a correction that alters the meaning of the source \justcite{brockett2006correcting}.
As a result, often when compiling gold standard corrections for the task,
annotators are instructed to be conservative in their corrections
(e.g., in the Treebank of Learner English \justcite{nicholls2003cambridge}).
There were different attempts to formally capture this precision/recall asymmetry such as the standardized use of $F_{0.5}$ over $F_{1}$ \justcite{dahlmeier2012better} and the choices of weights in I-measure \justcite{felice2015towards}.

However, in development and training penalizing over-correction more harshly than under-correction
may lead to reluctance of correctors to
make any changes (henceforth, {\it over-conservatism}).
Using one or two reference corrections, a common practice in GEC,
compounds this problem, as correctors are not only harshly penalized for making incorrect changes,
but are often penalized for making {\bf correct} changes not found in the reference.

Indeed, we show that current state of the art systems suffer from over-conservatism.
Evaluating the output of 12 recent correctors, we find that all of them
substantially under-predict corrections relative to the gold standard (\S \ref{sec:formal_conservatism}).

We first assess whether the undue penalization of valid corrections can be resolved by increasing the number
of references, which we denote with $M$ (\S \ref{sec:increase-reference}).
We start by estimating the number and frequency distribution of the valid corrections per sentence,
arriving at a mean estimate of over 1000 corrections for sentences of no more than 15 tokens.
We then consider two representative reference-based measures (henceforth, {\it RBMs}) for
assessing the validity of a proposed correction relative to a set of references
and characterize the distribution of their scores as a function of $M$.
Our results show that both measures substantially under-estimate the true performance of
the correctors. Moreover, they show that increasing $M$ only partially addresses
the incurred bias, as both RBMs approach saturation for $M$ values of 10--20,
indicating that a prohibitively large $M$ may be required for reliable estimation.

Our findings echo the results of \newcite{bryant2015far}, who study the effect of $M$
on evaluation with $F$-score, the most commonly used measure for GEC. Their work focused on
obtaining a more reliable estimate of correctors' performance, and proposed to do so
by normalizing corrector's estimated performance with the performance of a human corrector. However, while such normalization may yield more realistic performance estimates,
it has no effect on the training and tuning of correctors.

We conclude by proposing an alternative reference-less semantic evaluation approach which assesses the extent to which
a correction faithfully represents the semantics of the source, by measuring the similarity of their semantic structures (\S \ref{sec:Semantics}).
This approach can be combined with a reference-less measure of grammaticality, based on automatic error detection, as
proposed by \newcite{napoles-sakaguchi-tetreault:2016:EMNLP2016}.
Our experiments support the feasibility of the proposed approach,
by showing that (1) semantic structural annotation can be consistently applied to LL, and (2) that the proposed measure is less prone to unduly penalize valid corrections.
%
%
%We define a measure, using the UCCA scheme \justcite{abend2013universal} as a
%test case, motivated by its recent use for machine translation
%evaluation \justcite{birch2016hume}.
%We annotate a section of the NUCLE parallel corpus \justcite{dahlmeier2013building},
%
%The two approaches address the insufficiency of using too few references from
%complementary angles. The first attempts to cover more of the probability
%mass of valid corrections by taking a larger $M$,
%while the second uses semantic instead of string similarity, in order
%to abstract away from some of the formal variation between different valid corrections.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Over-Conservativism in GEC Systems}\label{sec:formal_conservatism}
%The field of GEC was always thriving or conservatism in its corrections, with the prominent example of using
%$F_{0.5}$ emphasizing precision over recall \justcite{ng2014conll}. we wish to highlight the problem that
%arises from pursuing this conservatism as done today.
%Then, we wished to be conservative, and we achieved that, why shouldn't we rejoice just yet? Theoretically, we might be progressing towards not correcting at all, instead of progressing towards correcting more accurately.
%
%Manual analysis showed excessive formal conservatism and under correction.
%Albeit important, manual analysis is not enough and we aimed for generating some quantitative measures.
%
We demonstrate that current correctors
suffer from over-conservatism: they tend to make too few changes to the source.
\section{Notation}
We assume each source sentence $x$ has a set of valid corrections $Correct_x$,
and a discrete distribution $\mathcal{D}_x$ over them, where $P_{\mathcal{D}_x}(y)$
for $y \in Correct_x$ is the probability a human annotator would correct $x$ as $y$.

Let $X$ be the evaluated set of source LL sentences where $X$ consists of the sentences $x_{1}\ldots x_N$, each independently sampled from some distribution $\mathcal{L}$ over LL sentences and denote $\mathcal{D}_{i}\coloneqq \mathcal{D}_{x_i}$.
Each $x_i$ is paired with $M$ corrections $Y_i = \left\{y_{i}^{1},\ldots, y_{i}^{M}\right\}$,
which are independently sampled from $\mathcal{D}_{i}$.\footnote{Our analysis assumes $M$
	is fixed across source sentences. Generalizing the analysis to sentence-dependent $M$
	values is straightforward.}
We define the {\it coverage} of $M$ references for a sentence $x_i$ to be
$P(y \in Y_i|y \in Correct_i)$ for $Y_i$ of size $M$, and $y$ sampled
according to $\mathcal{D}_i$.

A corrector $C$ is a function from LL sentences to proposed corrections (strings).
An assessment measure is a function from $X$, $Y$ and $C$ to
a real number. We use the term ``true measure'' to refer to the measure's output where the references include all possible corrections, i.e., $Y_i=Correct_i$ for every $i$.

\subsection{Experimental Setup.}\label{par:experimental_setup}
We conduct all experiments on the NUCLE dataset,
a parallel corpus of LL essays and their corrected versions,
which is the de facto standard in GEC.
The corpus contains 1,414 essays in LL, each of about 500 words.

We evaluate all participating systems in the CoNLL 2014 shared task,
in addition to two of the best performing systems on this dataset.
The particiapting systems and their abbreviations are: Adam Mickiewicz University (AMU),
University of Cambridge (CAMB), Columbia University and the University of Illinois at Urbana-Champaign (CUUI),
Indian Institute of Technology, Bombay (IITB), Instituto Politecnico Nacional (IPN),
National Tsing Hua University (NTHU), Peking University (PKU), Pohang University of Science and Technology (POST),
Research Institute for Artificial Intelligence, Romanian Academy (RAC), Shanghai Jiao Tong University (SJTU),
University of Franche Comt\'{e} (UFC), University of Macau (UMC), \newcite{rozovskaya2016grammatical}(RoRo), \newcite{xie2016neural} (char).
All are trained and tested on the NUCLE corpus.

We compare the prevalence of changes made to the source by the correctors,
relative to their prevalence in the NUCLE reference.
In order to focus on the more substantial changes, we exclude from our evaluation
all non-alphanumeric characters, both within tokens or as tokens of their own.
%
%In order to have better evaluation of the real goal of corrections we also
%compute all of the measures on ,
%based on the Cambridge First Certificate in English
%(FCE) \justcite{yannakoudakis2011new},
%a new large parallel corpus containing only ungrammatical sentences of learners
%native of different languages.\oa{I didn't understand this sentence; how do we use this corpus?}

\subsection{Measures of Conservatism.}
We consider three types of divergences between the source and the reference.
First, we measure to what extent \emph{words} were changed: altered, deleted or added.
To do so, we compute word alignment between the source and the reference, casting it
as a weighted bipartite matching problem, between the source's words and the correction's.
Edge weights are assigned to be the edit distances
between the tokens.
We note that aligning words in GEC is much simpler than in machine translation,
as most of the words are kept unchanged, deleted fully, added, or changed slightly.
Following word alignment, we define the {\sc WordChange} measure
as the number of unaligned words and aligned words that were changed in any way.

Second, we quantify word \emph{order} differences using
Spearman's $\rho$ between the order of the words in the source sentence,
and the order of their corresponding words in the correction according to the word alignment.
$\rho=0$ where the word order is uncorrelated, and $\rho=1$ where the orders exactly match. We report the average $\rho$ over all source sentences pairs.

Third, we report how many source sentences were split and how many
\begin{figure}
  \centering
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{words_differences_heat}
  \end{subfigure}
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{aligned}
  \end{subfigure}
  \begin{subfigure}[]{0.4\textwidth}
    \includegraphics[width = \textwidth]{spearman_ecdf}
  \end{subfigure}
  \caption{The prevalence of changes}{\label{fig:over-conservatism}
    The prevalence of changes of different types in correctors' output and in the NUCLE references.
    The top figure presents the number of sentence pairs (heat) for each number of word changes
    (x-axis; measured by {\sc WordChange}) for each of the different systems and the references (y-axis).
    The middle figure presents the number of source sentences (y-axis) concatenated (right bars) or split (left bars) in the references (striped column) and in the correctors' output (colored columns).
    The bottom figure presents the percentage of sentence pairs (y-axis) where the
    Spearman $\rho$ values do not exceed a certain threshold (x-axis).
    See \S \ref{par:experimental_setup} for a legend of the correctors.
    The three figures show that under all measures, the gold standard references make
    substantially more changes to the source sentences than any of the correctors,
    in some cases an order of magnitude more.
  }
\end{figure}
\subsection{Results.}
Figure \ref{fig:over-conservatism} presents the outcome of the three measures.
%In \ref{fig:split} the amount of sentences each corrector has done is presented. In \ref{fig:words_changed} the accumulated sum of sentences by the words changed in each sentence of each of the correctors is presented. In \ref{fig:rho} the cumulative probability distribution of rho values out of all the sentences.
Results show that the reference corrections make changes to considerably more source sentences than any of the correctors, and within each changed sentence changes more words and makes more word order changes, often an order of magnitude more. For example, the reference has 36 sentences with 6 word changes, where the most sentences with 6 word changes by any corrector is 5.

For completeness, we measured the prevalence of changes in
another corpus, the TreeBank of Learner English \justcite{yannakoudakis2011new},
and the results were similar to those obtained on NUCLE if not more extreme. Being a bit more extreme is not surprising as the two corpora differ in that the TreeBank consists solely of ungrammatical language and NUCLE consists of paragraph (of which $89.6\%$ require corrections).
%While $89.6\%$ of NUCLE sentences need corrections,
%The prevalence of FCE consists only of ungrammatical sentences.
%As expected, FCE is a bit less conservative than NUCLE by our measures.
%
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Multi-Reference Measures}\label{sec:increase-reference}
%
In this section we argue that the observed over-conservatism of correctors likely stems
from them being developed to optimize RBMs that suffer from low-coverage.
We begin with a motivating analysis of the relation between low-coverage and over-conservatism (\S \ref{subsec:motivating_analysis}).We then continue with an empirical assessment of the distribution of corrections for a given sentence (\S \ref{subsec:corrections_distribution})
and the effect of $M$ on commonly used RBMs (\S \ref{subsec:Assessment-values}).
We discuss the implications of our results, concluding that RBMs may only partially address over-conservatism (\S \ref{subsec:mult_discussion}).
%
\section{Motivating Analysis}\label{subsec:motivating_analysis}
%
The relation between coverage and over-conservatism requires some explanation.
We abstract away from the details of the training procedure, and assume that correctors attempt to maximize an objective function, over some training or development data.

Before moving on, it is important to pay attention to the emphasis here on evaluation \emph{not} for testing. While testing, one would only care whether one system is significantly better than another and would not be able to change anything, hence the test would not affect the behaviour of the model. This was a main interest in recent years and some solutions to problems in that aspect can be found in the work of \newcite{bryant2015far}. Today, there is practically no test set, as public test sets are being used as a benchmark and hence imply which system will get published and which will be further improved or abandoned. This should count as a development set and hence regarded as suffering from the problems we describe below.

Assume the corrector is faced with a phrase which it predicts to be ungrammatical. Assume $p_{detect}$ is the probability that this prediction is correct.
Assume $p_{correct}$ is the probability it is able to predict
a valid correction for this phrase (including correctly identifying it as erroneous).
Finally, assume that the corrector is evaluated
against $M$ references for which the coverage of the phrase is $p_{coverage}$,
namely the probability that
a valid correction will be found among $M$ randomly sampled references.

We will now assume that the corrector may either choose to correct with the correction it finds the most likely or not at all. If it selected not to correct, its probability of being rewarded (i.e., its output is in the reference set $Y$) is $(1-p_{detect})$. Otherwise, its probability
of being rewarded is $p_{correct} \cdot p_{coverage}$.
In cases where

\begin{myequation}
  \label{eq:reward}
  p_{correct} \cdot p_{coverage} < 1-p_{detect}
\end{myequation}

a corrector is disincentivized from altering the phrase.
We expect Condition \ref{eq:reward} to frequently hold in cases that
require non-trivial changes, which are characterized both by low $p_{coverage}$ (as non-trivial
changes can often be made in numerous ways), and by lower expected performance by the corrector.

Moreover, asymmetric measures (e.g., $F_{0.5}$) penalize invalidly correcting more
harshly than not correcting an ungrammatical sentence.
In these cases, Condition \ref{eq:reward} should be rephrased as

  \begin{myequation*}
    p_{correct} \cdot p_{coverage} - \left(1-p_{correct}p_{coverage}\right) \alpha < 1-p_{detect}
  \end{myequation*}

where $\alpha$ is the ratio between the penalty for introducing a wrong correction and the reward for a valid correction.
The condition is much more likely to hold in these cases.

This analysis shows that low coverage is a possible cause for over-conservatism. It remains to see empirically to what extent  over-conservatism is caused by the low coverage. To see that one needs to vary the coverage (M) a system trains on and to assess the levels of conservatism. As the largest corpus we are aware of with a substantial amount of references \justcite{bryant2015far} is NUCLE-test we simulate small-scale training in the following manner:
Using 12 references and a 100-best list \footnote{special thanks to \newcite{rozovskaya2016grammatical} for being the only one to send it to us.} we select the candidates maximizing F-Score given different sizes of M.

\com{Ideally, in order to validate this empirically, we should re-train correctors using multiple references, and re-examine their conservatism. However, corpora annotated with more than one correction are scarce.
As a proof of concept, we simulate a re-ranking procedure over the corpora of \newcite{bryant2015far}, which provide additional 10 references for each of sentence in the NUCLE test set. In order to abstract away from implementation details and from artefacts that may result from the small dataset available, we explore an oracle re-ranking setting, where the correction with the best Micro F-score taken from the 100-best list of the RoRo state-of-the-art corrector (see Section 2) is selected. The F-score is computed with varying numbers of references (M).}
Our results show a consistent decrease in conservatism in word changes with the increase in coverage and no significant change in word order, clearly, as we only rerank sentences, sentence conservatism was not changed.
\begin{figure}
	\centering
	\includegraphics[width=8cm]{words_differences_hist_reranking}
	\caption{conservatism after oracle reranking}{Amount of sentences changed (y-axis) by the amount of words changed per sentence (x-axis) after oracle reranking over different M values (column colors).
	}
\end{figure}

\section{Data}
%
Our analysis assumes that we have a reliable estimate for the distribution of corrections
$\mathcal{D}_x$ for the source sentences we evaluate.
Our experiments in the following section are run on a random sample of 52 sentences with a
maximum length of 15 from the NUCLE test data.
Through the length restriction we avoid introducing too many independent
errors that may drastically increase the number of annotations variants (as every combination of corrections for these errors is possible), thus resulting in an unreliable estimation for $\mathcal{D}_x$.
Sentences with less than 6 words were discarded, as they were mostly a result of sentence segmentation errors.

Crowdsourcing has proven effective in GEC evaluation \justcite{madnani2011they,napoles2015ground} and in
related tasks such as machine translation \justcite{zaidan2011crowdsourcing,post2012constructing}. We thus
use crowdsourcing for obtaining a sample from $\mathcal{D}_x$. Specifically, for each of the 52 source
sentences, we elicited 50 corrections from Amazon Mechanical Turk workers.
%allowing for a reliable estimation of the distributions.
Aiming to judge grammaticality rather than fluency, we asked the workers to
correct only when necessary, not for styling.
4 sentences did not require any correction according to almost half the workers and were hence discarded.
%
\section{Estimating The Distribution of Corrections}\label{subsec:corrections_distribution}
%
We begin by estimating $\mathcal{D}_x$ for each sentence, using the crowdsourced corrections.
We use {\sc UnseenEst} \justcite{zou2015quantifying}, a non-parametric algorithm that
estimates a multinomial distribution,
in which the individual values do not matter, only the distribution of probabilities
across values.%\footnote{\href{https://github.com/borgr/unseenest}{UnseenEst}}
{\sc UnseenEst} was originally developed for assessing how many
variants a gene might have, including undiscovered ones,
and their relative frequencies.
This is a similar setting to the one tackled here.
Our manual tests of {\sc UnseenEst} with small artificially created frequencies
showed satisfactory results.\footnote{All data we collected, along with the estimated distributions will be made publicly available}

By the estimates from {\sc UnseenEst}, most source sentences have a large number of
corrections with low probability accounting for the bulk of the probability mass
and a rather small number of frequent corrections.
%The estimated distributions tend to have steps, with many corrections with the same (low) frequency.
Table \ref{tab:corrections_dist} presents the mean numbers of different corrections with frequency at least
$\gamma$ (for different values of $\gamma$), and their total probability mass.
For instance, on average 74.34 corrections account for 75\% of the total probability mass of the corrections, each
occurring with a frequency of 0.1\% or higher.

\begin{table}[h!]
  \centering
  \singlespacing
  \begin{tabular}{c|c|c|c|c|}
    %\cline{2-5}
    & \multicolumn{4}{c|}{Frequency Threshold ($\gamma$)}\\
    %\cline{2-5}
    & \multicolumn{1}{c}{0} & \multicolumn{1}{c}{0.001} & \multicolumn{1}{c}{0.01} & \multicolumn{1}{c|}{0.1}
    \\
    \hline
    Variants & 1351.24 & 74.34 & 8.72 & 1.35
    \\
    Mass & 1 & 0.75 & 0.58 & 0.37\\
    \hline
  \end{tabular}
  \caption{Corrections distribution - mass and number of variants}{\label{tab:corrections_dist}
    Estimating the distribution of corrections $\mathcal{D}_x$.
    The table presents the mean number of corrections per sentence with probability of more than
    $\gamma$ (top row), as well as their total probability mass (bottom row).
  }
\end{table}

The overwhelming number of rare corrections raises the question whether these can be regarded as noise. To check that we asked crowdsource annotators for 3 judgments on each proposed correction, asking whether it is a correction the original. We calculated the mean inter-annotator agreement (IAA) of sentences by their frequency in the proposed corrections (empirical distribution), expecting to see a steep decrease of the less frequent ones if most of them are noise. The results show this is not the case, showing even the rarest ones have a high mean IAA of $0.78$.
\begin{figure}
	\centering
	\includegraphics[width=8cm]{IAA_confirmation_frequency}
	\caption{IAA of corrections}{mean IAA of corrections by their frequency in the empirical distribution.
	} \label{fig:accuracy_vals}
\end{figure}

\section{Under-estimation as a function of M} \label{subsec:Assessment-values}
In the previous section we presented empirical assessment of the distribution of corrections to a sentence. We turn to estimate the resulting bias, the under-estimation of RBMs, for different $M$ values.

We discuss two similarity measures. One is the sentence-level accuracy
(also called ``Exact Match'') and the other is the GEC $F$-score.

\subsection{Sentence-level Accuracy.}
Sentence-level accuracy (also ``Exact Match'') is the percentage of corrections that
exactly match one of the references.
Accuracy is a basic, interpretable measure, used in GEC by, e.g. \newcite{rozovskaya2010annotating}.
It is closely related to the 0-1 loss function commonly used
for training statistical correctors \justcite{chodorow2012problems,rozovskaya2013joint}.

Formally, given test sentences $X=\{x_1,\ldots,x_N\}$,
their references $Y_1,\ldots,Y_N$, and a corrector $C$,
we define $C$'s accuracy to be

\begin{myequation}\label{eq:acc_def}
Acc\left(C;X,Y\right) = \frac{1}{N} \sum_{i=1}^N \mathds{1}_{C(x_i) \in Y_i}.
\end{myequation}

Note that $C$'s accuracy is in fact an estimate of $C$'s probability to produce
a valid correction for a sentence, or $C$'s {\it true accuracy}. Formally:

\begin{myequation*}
 TrueAcc\left(C\right) = P_{x\sim{L}}\left(C\left(x\right)\in Correct_x\right).
\end{myequation*}
%
%We estimate $C$s quality by sampling a set of source sentences
%$x_1,\ldots,x_N \sim \mathcal{L}$, and evaluate the quality of $C(x_1),\ldots,C(x_N)$ relative
%to the source.

The bias of $Acc\left(C;X,Y\right)$ for a sample of $N$ sentences, each paired with $M$ references
is then

\begin{flalign}
&TrueAcc\left(C\right) - \mathbb{E}_{X,Y}\left[Acc\left(C;X,Y\right)\right] = &\\
&TrueAcc\left(C\right) - P\left(C\left(x\right) \in Y\right)  = &\\
&Pr\left(C\left(x\right) \in Correct_x\right)  \cdot
\label{eq:bias} \left(1 - Pr\left(C\left(x\right) \in Y \vert C\left(x\right) \in Correct_x\right) \right)&\\
\end{flalign}

We observe that the bias, denoted $b_M$, is not affected by $N$, only by $M$.
As $M$ grows, $Y$ approximates $Correct_x$ better, and $b_M$ tends to 0.

In order to gain insight into the evaluation measure and the GEC task
(and not the idiosyncrasies of specific systems), we consider an idealized learner,
which, when correct, produces a valid correction with the same
distribution as a human annotator (i.e., according to $\mathcal{D}_x$).
Formally, we assume that, if $C(x) \in Correct_x$ then $C(x) \sim \mathcal{D}_x$.
Hence the bias $b_M$ (Equation \ref{eq:bias}) can be re-written as

\begin{myequation*}
  \centering
  P(C(x) \in Correct_x) \cdot (1 - P_{Y \sim \mathcal{D}_i^M,y\sim \mathcal{D}_x}(y \in Y))
\end{myequation*}

We will henceforth assume that $C$ is perfect (i.e., its true accuracy $Pr\left(C(x) \in Correct_x\right)$ is 1).
Note that assuming any other value for $C$'s true accuracy
would simply scale $b_M$ by that accuracy.
Similarly, assuming only a fraction $p$ of the sentences require correction scales $b_M$ by $p$.
%
%Denote the bias of a perfect corrector with $b_M$. To recap:
%\begin{equation*}
%  b_M = 1 - P_{x \sim L, Y \in \mathcal{D}_x^M, y \sim \mathcal{D}_x}\left(y \in Y\right)
%\end{equation*}
%
%We turn to estimating $b_M$ empirically. We note that $Acc(C;X,Y)$
%is a sum of Bernoulli variables (i.e., a Poisson Binomial distribution),
%with probabilities $p_i = P_{y \sim \mathcal{D}_i}\left(y \in Y_i\right)$.

We estimate $b_M$ empirically using its empirical mean on our experimental corpus:

\begin{myequation*}
\hat{b}_M = 1 - \frac{1}{N}\sum_{i=1}^N P_{Y \sim \mathcal{D}_i^M, y \sim \mathcal{D}_i}\left(y \in Y\right)
\end{myequation*}

Using the {\sc UnseenEst} estimations of $\mathcal{D}_i$, we can compute $\hat{b}_M$
for any size of $Y_i$ (value of $M$).
However, as this is highly computationally demanding, we estimate it using
sampling. Specifically, for every $M = 1,...,20$ and $x_i$, we sample $Y_i$ 1000 times
(with replacement), and estimate $P\left(y \in Y_i\right)$ as the covered probability mass
$P_{\mathcal{D}_i}\{y: y \in Y_i\}$.

We repeated all our experiments where $Y_i$ is sampled without replacement,
in order to simulate a case where reference corrections are collected by a single
annotator, and are thus not repeated. We find similar trends with faster increase
in accuracy reaching above $0.47$ with $M=10$.

%
%The resulting estimates for $p_i$
%define the estimate for the distribution of $Acc(C;X,Y)$.
%Given a set of LL sentences $x_1,...,x_N$ and their corresponding references
%$Y_1,...,Y_N$, we define the coverage of the reference set $Y_i$ for the sentence $x_i$ to be
%
%\begin{equation*}
%Cov\left(x_i,Y_i\right)=.
%\end{equation*}
%
%In order to gain insight into the accuracy measure, we need to know something about the distribution from which the given corrector chooses valid corrections. As each corrector might have its own biases, the most appealing choice would be to evaluate a corrector in which this distribution is the same as the one from which corrections for the gold standard are being drawn from. Formally, if $C\left(x_i\right) \in Correct_i$ then $C\left(x_i\right) \sim \mathcal{D}_i$.
%
%Thus, the second term in Equation \ref{eq:correction-in-gs} is $p_i = \mathbb{E}_{Y_i}[Cov(x_i,Y_i)]$.
%Therefore $Acc(C;X,Y)$ is distributed as
%a Poisson Binomial random variable (divided by $N$), with probabilities $\{p_i \cdot CP\}_{i=1}^N$. \footnote{A Poisson Binomial random
%variable is a sum of Bernoulli variables with different success probabilities.} We also assume our corrector is always
%correct (so $CP=1$), but as noted earlier any other value for $CP$ would only scale the results by $CP$.

\begin{figure}
  \centering
  \includegraphics[width=8cm]{noSig_repeat_1000_accuracy}
  \caption{Changes in accuracy with M}{Accuracy and Exact Index Match values for a perfect corrector (y-axis)
    as a function of the number of references $M$ (x-axis).
    %Each data point is paired with a confidence interval ($p=.95$).
  } \label{fig:accuracy_vals}
\end{figure}

Figure \ref{fig:accuracy_vals} presents the expected accuracy values for our perfect
corrector (i.e., 1-$\hat{b}_M$) for different values of $M$.
Results show that even for values of $M$ which are much larger than those considered in the GEC literature (e.g., $M=20$),
the expected accuracy is only about 0.5. As $M$ increases the contribution of each additional correction
  gets smaller to the point it contributes little to the accuracy (the slope is about 0.004 around $M=20$).

We also experiment with a more relaxed measure, {\it Exact Index Match}, which is only sensitive
to the identity of the changed words and not to what they were changed to.
Formally, two corrections $c$ and $c'$ over a source sentence $x$ match
if for their word alignments with the source (computed as above) $a:\{1,...,\left|x\right|\} \rightarrow \{1,...,\left|c\right|,Null\}$
and $a':\{1,...,\left|x\right|\} \rightarrow \{1,...,\left|c'\right|,Null\}$, it holds that $c_{a\left(i\right)} \neq x_{i}$ iff $c'_{a'\left(i\right)} \neq x_{i}$, where $c_{Null}=c'_{Null}$.

Figure \ref{fig:accuracy_vals} also presents the expected accuracy in this case
for different values of $M$, which indicate that while scores of a perfect corrector are somewhat higher,
still with $M=10$ is 0.54.
As Exact Index Match can be interpreted as an accuracy measure for error detection (rather than correction),
our results indicate that error detection systems may suffer from similar difficulties.

The analytic tools ws have developed support the computation of the entire distribution of the accuracy,
and not only its expected values. From Equation \ref{eq:acc_def} we see that Accuracy has a Poisson Binominal distribution (i.e., it is a sum of independent Bernoulli variables with different success probabilities), whose success probabilities are $P_{y,Y \sim \mathcal{D}_i}(y \in Y)$, which can be computed, as before, using {\sc UnseenEst}'s estimate for $\mathcal{D}_i$. Estimating the density function allows for the straightforward definition of significance tests for the measure, and can be performed efficiently \justcite{hong2013computing}.\footnote{An implementation of this method and the estimated density functions will be released publicly.}
\lc{read until here}
\subsection{$F$-Score.}
While accuracy is commonly used as a loss function for training GEC systems,
the $F_\alpha$ score is standard when reporting system performance (and consequently in hyper-parameter
tuning).

Computing $F$-score for GEC is not at all straightforward.
The score is computed in terms of {\it edit} matches between a correction and the references,
where edits are sub-strings of the source that are replaced in the correction/reference.
HOO shared task used an earlier version of $F$-score, which required that the proposed corrections include edits explicitly.
Later on, relieving correctors from the need to produce edits, $F$-score was redefined optimistically, maximizing
over all possible annotations that generate the correction from the source.\footnote{Since our crowdsourced corrections
	do not include an explicit annotation of edits, we produce edits heuristically.}
$M^2$ \justcite{dahlmeier2012better} was designated to compute this $F$ score and is the standard evaluation for GEC.

The complexity of the measure prohibits an analytic approach, and
we instead use a bootstrapping approach to estimate the bias incurred
by not being able to exhaustively enumerate the set of valid corrections.
%In short, bootstrapping methods sample with repetition from the empirical
%distribution of the observed data to estimate properties (e.g. confidence-interval)
%of the statistic (e.g. $F$-score) over the distribution.
As with accuracy,
in order to avoid confounding our results with system-specific biases,
we assume the evaluated corrector is perfect and samples its corrections from the human distribution of corrections $\mathcal{D}_x$.
This experiment is very similar to that of \newcite{bryant2015far},
who also compared the $F$-score of a human correction against an increasing number of references, although they exclusively addressed $F$-score and did not attempt to estimate the distribution of correction as we do here.

Concretely, given a value for $M$ and for $N$, we uniformly sample from our experimental
corpus source sentences $x_1,...,x_N$, and $M$ corrections for each $Y_1,...,Y_N$ (with replacement).
Setting a realistic value for $N$ in our experiments is important
for obtaining comparable results to those obtained on the NUCLE corpus (see below),
as the expected value of $F$-score may depend on $N$ (unlike Accuracy, it is not additive).
In accordance with the NUCLE's test set,
we set $N=1312$ and assume that 136 of the sentences require no correction.
The latter reduces the overall bias by their frequency in the corpus,
and are thus important to include for obtaining realistic results.

The bootstrapping procedure is carried out by the
accelerated bootstrap procedure \justcite{efron1987better}, with 1000 iterations.
We also report confidence intervals ($p=.95$), computed using the same procedure.\footnote{We
  use Python scikits.bootstrap implementation.}
%
%For each sentence which had at least one error according to the NUCLE gold standard
%we sample $M$ sentences uniformly from the
%gathered empirical data to replace it. We leave sentences that do not need
%corrections untouched. This results in reference texts accounting for the
%variability in different choices of corrections while approximating the reduction of variability
%of a big $N$ by consisting of $N$ sentences overall.

% our results
Figure \ref{fig:F_Ms} presents the results of this procedure, which
further indicate the insufficiency of commonly used $M$ values for training and development (1 or 2)
for obtaining a reliable estimation of a corrector's performance.
For instance, the $F_{0.5}$ score for our perfect corrector, whose true $F$-score is 1,
is only 0.42 with $M=2$.
Moreover, the saturation effect observed for accuracy is even more pronounced with our experiments on $F$-score.
Similar results were obtained by \newcite{bryant2015far}.
%
%\subsection{}
%While our experiments focus on the accuracy and $F$-score measures, we expect
%our results to generalize to other RBMs (see Section \ref{sec:prev_work}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Significance of Real-World Correctors}\label{sec:real_world}
The bootstrapping method for computing the significance of the $F$-score can also
be useful for assessing the significance of the differences in corrector's performance
reported in the literature.
We report results with the bootstrapping protocol (\S \ref{subsec:Assessment-values})
to compute the confidence interval of different correctors with the current NUCLE
test data ($M=2$).
\begin{figure}
	\centering
  \includegraphics[width=8cm]{$F_{0.5}$_Ms_significance}
  \caption{Changes of $F_{0.5}$ by M}{
    $F_{0.5}$ values for a perfect corrector (y-axis) as a function of the number of references $M$ (x-axis).
    Each data point is paired with a confidence interval ($p=.95$).\label{fig:F_Ms}}
\end{figure}

\begin{figure}
	\centering
  \includegraphics[width=8cm]{$F_{0.5}$_significance}
  \caption{Evaluation of current systems}{$F_{0.5}$ values for different correctors, including confidence interval ($p=.95$).
    The left-most column (``source'') presents the $F$-score of a corrector that doesn't make any
    changes to the source sentences.
    See \S \ref{par:experimental_setup} for a legend of the correctors.\label{fig:F_correctors}}
\end{figure}

Figure \ref{fig:F_correctors} shows our results, which present a mixed picture: some
of the differences between previously reported $F$-scores are indeed significant and some are not.
For example, the best performing corrector is significantly better than the second, but the latter
is not significantly better than the third and fourth.

%Nevertheless, it seems that $M=2$ value taken in NUCLE is sufficiently high
%to generally obtain statistically significant ranking of the different correctors.

\section{Discussion}\label{subsec:mult_discussion}
% -- we saw that we have dramatic under-estimation
% -- but is this a problem? for instance, in the last section we saw we can get statistical significance between systems by increasing
%    N even with a low M
% -- balancing $N$ and $M$ is important;
%      other people have looked at similar things (not to re-label or not to re-label). we stress that
%      while for statistical significance increasing $N$ is sufficient, this would not solve this problem:
% -- low coverage entails other problems: it incentivizes systems not to correct, even if they can perfectly predict valid corrections.
% -- mathematical argument
% -- indeed, we see RoRo > Perfect and other systems are close to Perfect. it could be that they are better taylored to
%    those corrections produced by the NUCLE annotators. however in section 2 we saw that they are not.
%    we hypothesize that this is the reason.
%

Our empirical results show that the number of corrections needed for reliable reference-based measures may
be prohibitively large in practice.
Results suggest that there are hundreds of valid corrections with low probability, whose total probability mass
is substantial. RBMs such as accuracy and $F$-score thus show diminishing returns from increasing the value of $M$ over values of about 10.
%
%All these findings suggest that it is too costly to increase $M$ in development data to the extent asymmetric evaluation will not lead to over-conservatism. The exact index match analysis also suggests $M=1,2$ coverage is also low for detection. When detection is separate from correction this might result in over-conservatism.
%
%about a quarter of the probability mass of valid corrections (\S \ref{subsec:Assessment-values}).
%
%The factors controlling significance are two. Variation across sentences themselves (different $D_x$)
%which is reduced with $N$ and the variation across choices of corrections which might be reduced with
%either $M$ or $N$. One can rightly deduce that large $N$ is sufficient for variation, but it will not
%solve the other problems: under-estimation of true performance,
%over-conservatism, possible issues when training systems, and might be more costly than annotating
%a larger $M$ without acquiring more sentences and also annotating them.
%Choosing how to balance is dependent on the goals of the one collecting data, and affects over
%the mean value as well, as discussed in \ref{subsec:Assessment-values}. Thus, we bring supporting data and
%leave the decision to the reader. We just point out that such balancing questions have been studied in
%various fields such as genetics \justcite{ionita2010optimal}.\oa{}

Returning to condition \ref{eq:reward}(\S \ref{subsec:motivating_analysis}), we find that the coverage
(which is equal to the accuracy depicted in Figure \ref{fig:accuracy_vals})
is lower than 0.5 for $M=2$ on average (for short sentences). For cases of non-trivial
changes, we expect it might be even lower, suggesting that condition \ref{eq:reward} often
holds in practice, incentivizing over-conservatism.

Considering the $F$-score of the best performing systems in Figure \ref{fig:F_correctors}, and
comparing them to the $F$-score of a perfect corrector with $M=2$, we find that their scores are comparable,
where RoRo in fact surpasses a perfect corrector's $F$-score.
While it is possible that these correctors outperform the perfect corrector by learning how to
correct a sentence in the same way as one of the NUCLE annotators did, we view this possibility
as unlikely as our results (\S\ref{sec:formal_conservatism}) show that
the output of these systems considerably diverges from NUCLE's references.
A more likely possibility is that these systems' high performance relative to a perfect corrector's
is due to these correctors having learned to predict when not to correct.

Two recent RBMS have been proposed.
One is {\sc I-measure} \justcite{felice2015towards}
which introduces novel features to GEC evaluation, such as distinguishing
different quality levels of ungrammatical corrections (e.g., some improve the quality of
the source, while others degrade it), and restricting edits to only consist of single words,
rather than phrases. The other is GLEU \justcite{napoles2015ground} (an adaptation of BLEU) that has
been shown to correlate well with human rankings. We expect that our findings, that RBMs substantially under-estimate the
performance of correctors, to generalize to these RBMs, as they all
apply string similarity measures relative to a fairly small number of references.
These measures thus address orthogonal gaps in GEC evaluation from the ones presented here.
Following the proposal of \newcite{sakaguchi2016reassessing}, to emphasize fluency over grammaticality
in reference corrections, only compounds this problem, as it results in a larger number of valid corrections.

%Also, changing from grammatical corrections to fluency ones results in more possible corrections, and consequently a larger bias.

Finally, note that addressing under-estimation by comparing to
a human expected score (in our terms, a perfect corrector) with the same $M$ \justcite{bryant2015far} does not address over-conservatism, as it only
scales the original measure. Moreover, as seen above, a human correction's score
is not necessarily an upper bound, as an over-conservative corrector may surpass a perfect corrector in performance.

%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Semantic Faithfulness Measure}\label{sec:Semantics}
%
%
%Conservatism is considered an important trait for a corrector, reflected for example
%in the selection of $F_{0.5}$, which emphasizes precision over recall, as the
%standard evaluation measure in GEC.
%In the previous section we followed the common approach in GEC evaluation and evaluated
%
%The thought that stands behind such emphasis is that a user
%would be understanding towards errors he did, of which he is probably
%not even aware, not being corrected, but would not be so understanding
%of corrections altering what he meant to say, in a way he perceives as wrong.
%

In this section we propose new measures that eschews the use
of reference corrections, instead measuring the semantic faithfulness of the proposed
correction to the source.
Concretely, we propose to measure the semantic similarity of the source and the proposed correction
through the graph similarity of their representations.
Such a measure has to be complemented with an
error detection procedure, as it only captures faithfulness, the extent to which
the meaning of the source is preserved in the correction,
and not its grammaticality.
See \newcite{napoles-sakaguchi-tetreault:2016:EMNLP2016}
for a proposal of a complementary measure based
on automatic error detection.

A similar approach was also taken in the field of machine translation, after showing multiple references are better for evaluation, but costly\justcite{albrecht-hwa:2008:WMT,turian2006evaluation}. Several ways to capture semantics (adequacy) and to achieve reference-less machine translation evaluation were proposed and shown to be at least as reliable the reference based ones.\justcite{reeder2006measuring,albrecht2007regression,specia2009estimating,specia2010machine} Perhaps most similar is the work of \newcite{banchs2015adequacy}, combining a semantic and grammatical measures.


As a test case, we use the UCCA scheme to define semantic structures \justcite{abend2013universal},
motivated by its recent use in semantic machine translation evaluation \justcite{birch2016hume}.

We conduct some experiments supporting the feasibility of our approach.
We show that semantic annotation can be consistently applied to LL,
through inter-annotator agreement (IAA) experiments and that a perfect corrector scores high on this measure.

%As semantic structures represent an abstraction over different realizations
%of a similar meaning,  In fact, $M$ plays no role in this approach, as the measure
%is defined not relative to a refernce but relative to the source sentence.
%
%as LL consists of many grammatical mistakes that makes syntactic
%analysis ill-defined for the task. We show evidence that this is the case, by having
%two annotators annotate a sub-corpus from the NUCLE dataset, and by measuring their
%inter-annotator agreement.
%%%Second, we ask whether corrections for a sentence indeed need to be faithful to the source. We seek to answer this question by measuring
%the semantic similarity between the source and the reference. We show support for an affirmative answer to this question
%by annotating the references provided for the NUCLE dataset,
%and detecting high semantic similarity between the corresponding sentences on both sides.
%
%\section{Background}
%
%Reliable assessment by a gold standard might be hard to obtain (see
%\ref{sec:increase-reference}), and human annotation for each output
%is great \justcite{madnani2011they} but costly, especially considering the
%development process. Under these conditions,
%%given a reliable semantic annotation we can enhance the reliability of our assessment. A simple way to do it is to somehow account in the assessment score for semantic changes.
%Another, more ambitious way to do that might be to decouple the meaning
%from the structure. We propose a broad idea for a reduction from grammatical
%error detection and a comparable semantics annotation to grammatical
%error correction assessment. Lets assume we have both a reliable error
%detection tool and a good way to measure semantic changes. Then, we
%can transform assessment to a 3 steps assessment.
%Step one, detect errors in the original text. Assess the amount of needed corrections, and the percentage of which that were changed.
%Step two, assess how much did the semantics change.
% Give a negative score for changing semantics.
%Last step, use
%the error detection again to assess how many errors exist in the correction
%output, whether uncorrected by the corrector or new errors presented
%by the correction process itself.
%
%This assessment was partially inspired by the WAS evaluation scheme \justcite{chodorow2012problems},
%in short it states we should account in the assessment for 5 types,
%not only the True\textbackslash{}False Positive\textbackslash{}Negative
%but also for the case where the annotation calls for a correction,
%and the corrector did a correction, but one that is unlike the annotation's
%one. With the proposed assessment we can measure how many of the corrections
%were corrected correctly (First + Second), and how many errors do
%we have eventually (Third) and combine them to get something similar
%to the Precision Recall that is widely used. We can also account for
%the places where the error was detected and check if it was corrected
%in a way that makes it grammatical and did not change semantics, the
%fifth type. We do that without getting a human to confirm this is
%indeed a correction.
%
%This system would be even more informative than the current one. Allowing assessment of
%what subtask exactly did the corrector fail to perform. Answering questions
%like: was the corrector too conservative and did not make enough corrections?