Abstract
Many different methods for statistical data editing can be found in the literature but only few of them are based on robust estimates (for example such as BACON-EEM, epidemic algorithms (EA) and transformed rank correlation (TRC) methods of Béguin and Hulliger). However, we can show that outlier detection is only reasonable if robust methods are applied, because the classical estimates are themselves influenced by the outliers. Nevertheless, data editing is essential to check the multivariate data for possible data problems and it is not deterministic like the traditional micro editing where all records are extensively edited manually using certain rules/constraints. The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection. First we review the available multivariate outlier detection methods which can cope with incomplete data. In a simulation study, where a subset of the Austrian Structural Business Statistics is simulated, we compare several approaches. Robust methods based on the Minimum Covariance Determinant (MCD) estimator, S-estimators and OGK-estimator as well as BACON-BEM provide the best results in finding the outliers and in providing a low false discovery rate. Many of the discussed methods are implemented in the R package <InlineEquation ID="IEq1"> <InlineMediaObject> <ImageObject FileRef="11634_2010_75_Article_IEq1.gif" Format="GIF" Color="BlackWhite" Type="Linedraw" Rendition="HTML"/> </InlineMediaObject> <EquationSource Format="TEX">$$rrcovNA$$</EquationSource> </InlineEquation> which is available from the Comprehensive R Archive Network (CRAN) at <ExternalRef> <RefSource>http://www.CRAN.R-project.org</RefSource> <RefTarget Address="http://www.CRAN.R-project.org" TargetType=ÜRL"/> </ExternalRef> under the GNU General Public License.
Users
Please
log in to take part in the discussion (add own reviews or comments).