Quality: About This Data Set
Data is coherent and correct. There are missing values for some attributes, which do not effect other columns or the analysis in scope.
Is the data complete?
A few columns in the dataset had missing values. For example memo_cd and memo_txt. These columns provide information about transaction memo.
For example, following are the details in column memo_text:
There are about 84% missing values in this column and the most frequent value is "*".
Imputing the values does not seem to be an option here because this is just text. Hence, I decided to not use these columns for analysis. All other columns have complete data.
Is the data coherent?
For most of the part, data seems coherent. There are 2-3 heavy hitters candidates who have frequent transactions, and then there's a long tail. This seems in line with expectations
Below is the distribution of candidates split using cand_nm (candidate name) column:
The string length histogram also shows there are no outliers for names.
Disbursment amount, however has a number of outliers. If we look at the column distribution and histogram below, it is evident some transactions have very very large sum of amounts, while some are even negative.
Is the data correct?
I could verify the names of the candidates present in the dataset, because they are frequent in news.
I could not, however, verify the transactions and amounts, but it is reasonable to believe FEC would have done a good job maintaing these records.
I also verified "geography" by mapping all data points on a map using Fusion Tables, and found that all points lie within PA - with most of them in Pittsburgh and Philedelphia.
Is the data accountable?
Yes, the data comes from a credible source. See info here.