資料探勘 python資料清洗cvs裡面帶中文字元

2022-11-24 19:52:04 字數 2405 閱讀 2574

資料清洗,使用python資料清洗cvs裡面帶中文字元,意圖是用字典對應中文字元,即key值是中文字元,value值是index,自增即可;利用字典資料結構沒有重複key值的特性,把中文字元對映到了數值index。

python**如下:(data資料時csv格式)

import csv

dict2 = {}      #c

dict4 = {}      #e

dict25 = {}     #z

dict26 = {}     #aa

dict27 = {}     #ab

dict37 = {}     #al

dict38 = {}     #am

dict40 = {}     #ao

dict41 = {}     #ap

dict42 = {}     #aq

dict45 = {}     #at

dict49 = {}     #ax

index = 0

flag = false

#        print(row[2],dict[row[2]])

with open("e:/test/real/test.csv", 'w+', newline='') as csv_file_write:

writer = csv.writer(csv_file_write)

with open('e:/test/real/b.csv', 'r', newline='') as csv_file_read:

reader = csv.reader(csv_file_read)

for row in reader:

if(flag):

if row[2] not in dict2.keys():

dict2[row[2]] = index

if row[4] not in dict4.keys():

dict4[row[4]] = index

if row[25] not in dict25.keys():

dict25[row[25]] = index

if row[26] not in dict26.keys():

dict26[row[26]] = index

if row[27] not in dict27.keys():

dict27[row[27]] = index

if row[37] not in dict37.keys():

dict37[row[37]] = index

if row[38] not in dict38.keys():

dict38[row[38]] = index

if row[40] not in dict40.keys():

dict40[row[40]] = index

if row[41] not in dict41.keys():

dict41[row[41]] = index

if row[42] not in dict42.keys():

dict42[row[42]] = index

if row[45] not in dict45.keys():

dict45[row[45]] = index

if row[49] not in dict49.keys():

dict49[row[49]] = index

row[2] = dict2[row[2]]

row[4] = dict4[row[4]]

row[25] = dict25[row[25]]

row[26] = dict26[row[26]]

row[27] = dict27[row[27]]

row[37] = dict37[row[37]]

row[38] = dict38[row[38]]

row[40] = dict40[row[40]]

row[41] = dict41[row[41]]

row[42] = dict42[row[42]]

row[45] = dict45[row[45]]

row[49] = dict49[row[49]]

index = index + 1

writer.writerow(row)

flag = true

csv_file_read.close()

csv_file_write.close()

print('done!')

上例是真實的資料處理,有兩百列屬性,三萬條資料的原始資料。其中包括中文字元,及缺失值,需要一步步清洗。

備註:發生異常permission denied異常;

解決方案: 是因為正在開啟著csv檔案,所以python沒有許可權以w的方式開啟檔案。關閉該檔案即可;