Data Cleaning Strategies in a Multivariate Research Report (Assessment)
In multivariate quantitative research there is always a margin for potential error that does not relate to the design of the analysis, its conduct, or its implementation of varied strategies targeted at preventing mistakes. These errors slip through the machine and also have the potential of obfuscating the outcomes and resulting in wrong conclusions (Cramer, 2016). Although it is impossible to get rid of every error which could potentially affect the study, there are many data-cleaning strategies which could correct these errors or minimize their negative effect on the study. The objective of this paper is to describe several potential data-cleaning methods and strategies, underline the most common error types, review error deletion and correction rates, as well as identify differences in research outcomes using a multivariate research article as an example.
Article Summary: Multivariate Statistical Analysis of Cigarette Design Feature Influence on ISO TNCO Yields
The study I chose to analyze in this paper is a multivariate statistical analysis of cigarette designs and their effects on reducing the presence of various negative components within the smoke that enters the person’s lungs. The reason why researchers chose a multivariate analysis design is associated with the number of variables they had to include in this research. These variables include not only the quality and type of tobacco used in over 50 domestic US brands, but also the rod length, filter length, circumference, overlap, draw resistance, pressure drop, and filter tip ventilation (Agnew-Heard et al., 2016). The researchers analyzed their data in several steps. The data had undergone a transformation into a set of uncorrelated principal components prior to the logistical regression analysis was performed. Then, these components were analyzed utilizing a K-means cluster algorithm, which allowed classifying different cigarettes into different groups utilizing the results and finding potential data similarities. The final stage of the multivariate analysis was the partial least squares method, which determined the partnership between your original ISO TNCO and the nine design parameters established previously.
The research figured only three components out of nine accounted for the 65% of the variability in the TNCO values for each presented cluster. All the three components featured a solid correlation between them, with a coefficient between 0.5 and 0.9 (Agnew-Heard et al., 2016). Despite these three components having a more substantial influence than most, neither of the 9 tested components showed a specific dominance over others, because the results atlanta divorce attorneys cluster varied differently from others.
Data Cleaning as an activity
The difference between data cleaning and error-prevention strategies lie in the truth that the former eliminates data-related problems after they have occurred, as the latter tries to avoid errors from happening to begin with. A standard data-cleaning technique for multivariate data involves a three-stage procedure for screening, diagnosing, and editing or eliminating any suspected data abnormalities. This technique could be initiated at any stage of the analysis. While it can be done to correct errors as the research has been performed, at that moment, it is suggested to actively and systematically screen for just about any errors in a well planned way, to be able to make sure that all data is adequately scanned and analyzed. Data cleaning process can be done manually or with the assistance of specialized software. Every step of the process has its own techniques and suggestions that could be implemented to analyze the data provided by the article.
During the data screening process in a multivariate research, one has to detect and distinguish five basic types of anomalous data: lacking data, excessive data, outliers and inconsistencies, strange patterns and distributions, unexpected analysis results, and other types of interferences and potential errors. Screening methods can be statistical and non-statistical. Nonconformities in outliers are often detected when compared to prior expectations based on experience, evidence in the literature, previous studies, or common sense (Johnson & Wichern, 2014).
Detecting erroneous inliers, or data that falls within the expected range is harder. Erroneous inliers often escape the screening process. In order to detect erroneous inliers in a multivariate research, they need to be viewed in relation to other variables using regression analyses and consistency checks. Remeasurement is a recommended strategy to deal with erroneous inliers, but it is not always feasible (Johnson & Wichern, 2014). Another strategy involves examination of a multitude of inliers in order to approximately estimate the number of potential errors in the research.
Some of the screening methods that could be implemented in relation to the chosen article are as followed (Aggarwal, 2013):
After the suspected erroneous data points have been identified in the screening phase, it is necessary to determine their nature. There are several potential outcomes of this process – the info points could possibly be erroneous, normal (if the expectations were incorrect), true extreme, or undecided (without the explanation or reason behind the extremity that occurs). A few of these data points could be singled out due to being logically or biologically impossible (Chu, Ilyas, & Papotti, 2013). Oftentimes, several diagnostic procedures could be required to be able to determine the real nature of each troublesome data pattern (Osborne, 2013).
With respect to the number and nature of errors in a multivariate research, the researchers may be necessary to reconsider their expectations for the outcomes. Furthermore, quality control procedures might need to be reviewed and adjusted. Because of this , the diagnostic phase is known as to become a very labor-intensive and expensive procedure (Osborne, 2013). The expenses of data diagnostics could be lowered if the data-cleaning process is implemented through the entire entire research, instead of after it’s been concluded. Data diagnostic software could be implemented to increase the process.
One diagnostic strategy which you can use to investigate the chosen multivariate research article involves time for the previous stages of the data flow in order to reassess their consistency (Osborne, 2013). After any unjustified changes at any stage are detected, the next step involves looking for information to confirm whether the dataset is erroneous, or extreme. For example, the data regarding the length of cigarette filters could be erroneous due to a measurement error or a sampling error. In order to be effective, this strategy requires insight on several statistical and biological levels.
After the diagnostic stage is completed, all of the possible suspected errors in a multivariate research are identified either as actual errors, missing values, or true values. There are only three options to deal with each – correcting, deleting, or keeping the data unchanged (Downey & Fellows, 2012). The latter is implemented towards true values, as removing true extremes from the research would inevitably obfuscate results. Values that are physically or logically impossible are never left unchanged. They should be either removed or corrected, if possible. In the case with two data points measured within a short period of time that only have small variations between each other, it is recommended to take an average between the two, in order to enhance data accuracy (Downey & Fellows, 2012). Based on the severity and number of factual errors, it may be required to amend the research protocol or even start anew.
In regards to the multivariate research article chosen for this paper, it is possible to correct the measurements of every individual component in order to ensure that any suspicious data is either accurate or erroneous. Extremities should be kept in the research if there were no errors in the test samples or the measurements, because they represent a share of statistical probability that could otherwise obfuscate the study. However, to be able to ensure that the outcomes of the study are accurate, yet another screening must be designed for any extraneous influences which could have already been unaccounted for.
Data cleaning is known as an important section of a multivariate research, since it provides the possibility to eliminate any errors which could have been made through the data collection, processing, and analysis stages of the analysis that the study framework along with other error-prevention measurements didn’t account for. The info cleaning process includes screening, diagnostic, and treatment stages. Through the screening process, all the available information found in a multivariate research is analyzed, with suspected data designated for diagnostic process. Through the diagnostic stage, it really is determined if the provided data is actually erroneous or simply extreme. During the treatment stage, erroneous data is either corrected or deleted. This three-step process, when applied multiple times, can ensure accuracy and validity of the research.
Aggarwal, C. C. (2013). Managing and mining sensor data. New York, NY: Springer.
Agnew-Heard, K. A., Lancaster, V. A., Bravo, R., Watson, C., Walters, M. J., & Holman, M. R. (2016). Multivariate statistical analysis of cigarette design feature influence on ISO TNCO yields. Chemical Research in Toxicology, 29 (6), 1051-1063.
Chu, X., Ilyas, I. F., & Papotti, P. (2013). Holistic data cleaning: putting violations into context. Data Engineering (ICDE), 1 (1) , 5-16.
Cramer, H. (2016). Mathematical ways of statistics. Jersey City, NJ: Princeton University Press.
Downey, R. G., & Fellows, M. R. (2012). Parameterized complexity. NY, NY: Springer.
Johnson, R. A., & Wichern, D. W. (2014). Applied multivariate statistical analysis.
Duxbury, MA: Duxbury Publishing.
Osborne, J. W. (2013). Guidelines in data cleaning. NY, NY: Sage.