问题

  • 数据集出现一个顶点有重复属性:词形还原后没有去重
  • 异常的home和pag:查看xml发现会有作者的home page出现在title块中,page变成pag是因为我去掉了最后一个字符

#

官网下载

dblp网站:https://dblp.uni-trier.de/
数据集网站:https://dblp.org/xml/
数据集:dblp-2018-04-01.xml.gz

仔细看看会发现dblp数据集网站会添加每天的数据,所以会实时更新数据。所以在下载资料的时候不要下载实时更新的数据集。查看release/文件夹发现,一般会把一个月的稳定版本在1-3日,最好是下载release版本,实时更新的数据集会被覆盖,再也下载不到那种独一无二的数据集。所以推荐下载release版本,方便后面找回。我就遇到了这种问题,当初我用的版本是实时版本,再也找不回数据集了。

由于我的版本数据集找不到xml了,决定使用师姐的数据集dblp-2018-04-01.xml.gz。

1、解压后用snap的库解析程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
/*************************************************************************************************
函 数 名: SplitStr
功能描述: 将字符串按指定分隔符进行分割
输入参数:
输出参数:
返 回 值:
调用函数:
被调函数:
修改历史:
1.日期: 2017年11月11日
作者: 何健
修改: 创建文件
2.日期: 2019年01月07日
作者: 何健
修改: 忽略了以“-”等分隔的单词
*************************************************************************************************/
inline void SplitStr(string& s, vector<string>& v, const string& c)
{
string title;
transform(s.begin(), s.end(), back_inserter(title), ::tolower); // 全部变为小写字母
s = "";
for (int i = 0; i < title.size(); i++) {
if (title[i] >= 'a' && title[i] <= 'z') {
s += title[i];
}
else s += c;
}
string::size_type pos1, pos2;
pos2 = s.find(c);
pos1 = 0;
while (string::npos != pos2) {
string str = s.substr(pos1, pos2 - pos1);
string word = "";
for (int i = 0; i < str.size(); i++) { // 去掉非法字符
if (str[i] >= 'a' && str[i] <= 'z') word += str[i];
}
if (word.size() != 0) v.push_back(word); // 不能为空

pos1 = pos2 + c.size();
pos2 = s.find(c, pos1);
}
if (pos1 != s.length()) {
string str = s.substr(pos1);
string word = "";
for (int i = 0; i < str.size(); i++) {
if (str[i] >= 'a' && str[i] <= 'z') word += str[i];
}
if (word.size() != 0) v.push_back(word);
}
return;
}

extern map<string, int> keyword_in;

inline void InvalidKeyword()
{
string stopwords[] = { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you’re",\
"you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself",\
"she", "she's", "her", "hers", "herself", "it", "it's", "its", "itself", "they", "them", "their",\
"theirs", "themselves", "what", "which", "who", "whom", "this", "that", "that'll", "these",\
"those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",\
"do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until",\
"while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during",\
"before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over",\
"under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all",\
"any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own",\
"same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "don't", "should", "should've",\
"now", "d", "ll", "m", "o", "re", "ve", "y", "ain", "aren", "aren't", "couldn", "couldn't", "didn",\
"didn't", "doesn", "doesn't", "hadn", "hadn't", "hasn", "hasn't", "haven", "haven't", "isn", "isn't",\
"ma", "mightn", "mightn't", "mustn", "mustn't", "needn", "needn't", "shan", "shan't", "shouldn", "shouldn't",\
"wasn", "wasn't", "weren", "weren't", "won", "won't", "wouldn", "wouldn't", "without" };
int len = sizeof(stopwords) / sizeof(string);
for (int i = 0; i < len; i++) {
keyword_in[stopwords[i]] = 1;
}
return;
}

inline bool Check(string keyword)
{
if (keyword_in[keyword] == 1) return false;
return true;
}

inline void DblpGenerate(char* file_in, char* ungraph_out, char* vertex_out, char* original_out)
{
/*
PWgtNet Net = TWgtNet::New();
TWgtNet DBL;
const TStr db = "./DataSets/dblp.xml";
DBL.LoadDblpCoAuth("ds");
DBL.LoadDblpCoAuth(db);
TDblpLoader *dblp = new TDblpLoader(TStr("./DataSets/dblp.xml"));
*/
cout << "Start Process dblp Dataset....." << endl;
TDblpLoader dblp(file_in);
dblp.GetFPosStr();
int authorCount = 1;
map<string, int> author_vertex; //方便hash查找,从1开始,默认是0. 存储作者对应的顶点
map<string, vector<string> > author_attributes; //存储作者的全部属性

int cnt = 0;
InvalidKeyword();
FILE* fout_ungraph = fopen(ungraph_out, "w");
vector<string> keywords;
vector<int> nodes;
PUNGraph graph = PUNGraph::New(); //添加图是为了去重边
while (dblp.Next()) {
if (cnt++ % 10000 == 0) cout << cnt << endl;
//cout << dblp.Year << endl;
string titleName = dblp.Title.CStr();

// 20190619改:利用最后是否为句号来去掉home page字段
if (titleName[titleName.size() - 1] != '.') continue;

titleName[titleName.size() - 1] = '\0'; //去掉末尾的句号
//cout << titleName << endl;
keywords.clear();
SplitStr(titleName, keywords, " ");

nodes.clear();
for (int i = 0; i < dblp.AuthorV.Len(); i++) { //当前文章的所有作者
string authorName = dblp.AuthorV[i].CStr();
//cout << authorName << endl;
if (author_vertex[authorName] == 0) { //如果当前作者第一次出现,给他分配一个顶点值
graph->AddNode(authorCount);
author_vertex[authorName] = authorCount++;
}
nodes.push_back(author_vertex[authorName]);
for (string keyword : keywords) {
if (keyword != " " && keyword != "") {
if (Check(keyword)) {
//cout << keyword << endl;
author_attributes[authorName].push_back(keyword);
}
}
}
}
if (nodes.size() > 1) {
for (int i : nodes) {
for (int j : nodes) {
if (i != j) { // 去掉自环边
//if (!graph->IsEdge(i, j)) { // 去掉了重边
fprintf(fout_ungraph, "%d\t%d\n", i, j);
// graph->AddEdge(i, j);
//}
}
}
}
}
}
graph.Clr();
cout << "作者数量 = " << author_vertex.size() << endl;
cout << author_attributes.size() << endl;
cout << "dblp_author_ungraph.txt over!" << endl;

FILE* fout = fopen(vertex_out, "w");
map<string, int>::iterator it = author_vertex.begin();
for (; it != author_vertex.end(); it++) {
fprintf(fout, "%s", it->first.data());
fprintf(fout, "\t");
fprintf(fout, "%d\n", it->second);
}
cout << "dblp_author_vertex.txt over!" << endl;
/*
map<string, int> topAttributes;
map<string, vector<string> > author_attrs; //最好筛选出来的前20属性数据集
for (map<string, vector<string> >::iterator it = author_attributes.begin(); it != author_attributes.end(); it++) {
//if (it->second.size() <= 20) { //可能里面有重复的
// author_attrs[it->first].assign(it->second.begin(), it->second.end());
// continue;
//}
topAttributes.clear();
for (string attr : it->second) {
topAttributes[attr]++;
}
//把map中元素转存到vector中
vector<PAIR> topAttributes_vec(topAttributes.begin(), topAttributes.end());

//对vector排序
sort(topAttributes_vec.begin(), topAttributes_vec.end(), CmpByValue());
for (int i = 0; i < topAttributes_vec.size() && i < 20; i++) { //限制条件前20条,问题在于没有做词形还原,有点无意义
author_attrs[it->first].push_back(topAttributes_vec[i].first);
}
topAttributes_vec.clear();
}
*/
//对上面注释进行修改,先做词形还原,然后进行频繁排序,所以会把所有的属性存储下来,会很大
FILE* fout_attrs = fopen(original_out, "w");
//map<string, vector<string> >::iterator itor = author_attrs.begin();
map<string, vector<string> >::iterator itor = author_attributes.begin();
for (; itor != author_attributes.end(); itor++) {
string authorName = itor->first;
fprintf(fout_attrs, "%d", author_vertex[authorName]); // %-20s字符串左对齐占20个字符位置
fprintf(fout_attrs, "\t");
bool flag = false;
for (string str : itor->second) {
//if (str == "" || str == " ") continue;
if (!flag) {
flag = true;
fprintf(fout_attrs, "%s", str.data());
}
else fprintf(fout_attrs, ",%s", str.data());
}
fprintf(fout_attrs, "\n");
}
cout << "dblp_author_attr_original.txt over!" << endl;
fclose(fout);
fclose(fout_ungraph);
fclose(fout_attrs);
return;
}
1
2
3
4
5
6
7
8
9
10
6160001
6170001
作者数量 = 2081308
2081295
dblp_author_ungraph.txt over!
dblp_author_vertex.txt over!
dblp_author_attr_original.txt over!
Generate Time:892.714000 s

请按任意键继续. . .

dblp_attr_original.txt 759M
dblp_ungraph.txt 504M
dblp_vertex.txt 47.3M

2、去重边去自环边

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <bits/stdc++.h>
using namespace std;

int main()
{
char* input = "dblp_ungraph.txt";
char* output = "dblp100%_vertices.txt";
FILE* fin = fopen(input, "r");
FILE* fout = fopen(output, "w");

cout << "input = " << input << endl;
cout << "output = " << output << endl;
cout << "Process Data...." << endl << endl;

set<pair<int, int> > edges;
int maxn_vertex = -1;
int u, v, cnt = 0;
while(!feof(fin)) {
cnt++;
fscanf(fin, "%d\t%d\n", &u, &v);
if(u > v) swap(u, v);
if(u != v) {
edges.insert(make_pair(u, v));
if(v > maxn_vertex) maxn_vertex = v;
}
if(cnt % 1000000 == 0) cout << "已完成" << cnt << endl;
}
for(auto edge : edges) {
fprintf(fout, "%d\t%d\n", edge.first, edge.second);
}
cout << "原来图边的数量 = " << cnt << endl;
cout << "处理后边的数量 = " << edges.size() << endl;
cout << "最大的顶点值 = " << maxn_vertex << endl;
fclose(fin);
fclose(fout);
return 0;
}

正如题目,处理过后效果明显,504M->141M

1
2
3
4
5
已完成34000000
已完成35000000
原来图边的数量 = 35205510
处理后边的数量 = 9735108
最大的顶点值 = 2081308

3、词形还原

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# -*- coding: utf-8 -*-
"""
Created on Mon Apr 9 22:02:33 2018

@author: HanKin

[Python nltk.WordNetLemmatizer() Examples](https://www.programcreek.com/python/example/81649/nltk.WordNetLemmatizer)
词干提取和词形还原
[在线词形还原](http://text-processing.com/demo/stem/)
[nltk教程](http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization)
https://www.cnblogs.com/itdyb/p/5914467.html
"""

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import time

def crawl_lemmas(vocab):
"""Add Wordnet lemmas as definitions."""
spwords=stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()
ret = []
for word in vocab:
if len(word) < 2:
continue
definitions = []
word = word.lower() # 变小写
try:
for part_of_speech in ['a', 's', 'r', 'n', 'v']:
lemma = lemmatizer.lemmatize(word, part_of_speech)
if lemma != word and not lemma in definitions:
if lemma not in spwords:
definitions.append(lemma)
#print(lemma)
if len(definitions) == 0:
#print(word)
definitions.append(word)
except:
print('lemmatizer crashed')
if definitions:
#self._data[word] = definitions
for num in definitions:
ret.append(num)
return ret
#self.save()

def file_open(filename):
with open(filename, 'r') as file_to_read:
#print(file_to_read.encoding) #文件编码
#print(file_to_read.mode) #打开模式
print(file_to_read.name) #文件名
while True:
line = file_to_read.readline()
if not line: break
content = line.split('\t')
print(content)
pass #Python pass是空语句,是为了保持程序结构的完整性。pass 不做任何事情,一般用做占位语句。
#当你在编写一个程序时,执行语句部分思路还没有完成,这时你可以用pass语句来占位,也可以当做是一个标记,是要过后来完成的代码。
#file_to_read.close() python的with语句能处理异常并且会自动关闭文件句柄

def dblp_process(file_in, file_out):
with open(file_out, 'w') as fout:
with open(file_in, 'r') as fin:
print('fin: ' + fin.name)
print('fout: ' + fout.name)
print('start lemmas......')
schedule = 0;
while True:
line = fin.readline()
if not line: break
#print(line)
content = line.split('\t')
fout.write(content[0] + '\t') # content[0]表示顶点
#print(len(content))
#print(content[1])
#print(content[1][:-1])
content[1] = content[1][:-1]
content = content[1].split(',')
#print('hejian')
content = crawl_lemmas(content)
#print(type(content))
#print(content)

dic = {}
for attr in content:
if attr in dic:
dic[attr] += 1
else:
dic[attr] = 1
dic = sorted(dic.items(),key = lambda x:x[1],reverse = True)

flag = True #处理末尾不添加逗号
cnt = 0
for key, val in dic:
if cnt == 20: break #选取前20频繁属性
cnt += 1
if flag:
fout.write(key)
flag = False
else:
fout.write(',' + key)
fout.write('\n')
schedule += 1
if schedule % 10000 == 0:
print(schedule) #显示进度

'''
2019.1.10 添加进度条
去掉长度为1的属性单词(crawl_lemmas函数)
'''
if __name__ == '__main__':
'''
lis = ['a', 'b', 'a', 'c', 'd', 'c', 'a']
dic = {}
for elem in lis:
if elem in dic:
dic[elem] += 1
else:
dic[elem] = 1
print(sorted(dic.items(),key = lambda x:x[1],reverse = True))
dic = sorted(dic.items(),key = lambda x:x[1],reverse = True)
for key, val in dic:
print(key + ' ' + str(val))
'''

sestence = 'Stemming is funnier than a bummer says the sushi loving computer scientist doing nets are crying parsing problem affix grammars best better does were'
words = sestence.split(' ')
#words = ['stemming']
#print(words)
#crawl_lemmas(words)

#NLTK词性tag含义https://blog.csdn.net/john159151/article/details/50255101
#print(nltk.help.upenn_tagset())

start_time = time.time()
dblp_process('dblp_attr_original.txt', 'dblp_nltk_attributes.txt')
#dblp_process('fin.txt', 'fout.txt')
print('total time = %lf s' %(time.time() - start_time))

#nltk.WordNetLemmatizer().lemmatize('mapping')
1
2
3
4
2060000
2070000
2080000
total time = 2029.313343 s

检查词形还原后的结果

包含自建字典,修改python没有修正过来的单词,如subfigures。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
#include <bits/stdc++.h>
using namespace std;

void SplitString(const string& s, vector<string>& v, const string& c)
{
string::size_type pos1, pos2;
pos2 = s.find(c);
pos1 = 0;
while (string::npos != pos2) {
v.push_back(s.substr(pos1, pos2 - pos1));
pos1 = pos2 + c.size();
pos2 = s.find(c, pos1);
}
if (pos1 != s.length()) v.push_back(s.substr(pos1));
return;
}

void GetPostfix(const char* fileName)
{
int index = 1;
FILE *attributeF = fopen(fileName, "r");
FILE *out = fopen("out.txt", "w");
while (!feof(attributeF)) {
index++;
if (index % 10000 == 0) cout << index << endl;

int node;
char attr[20005];
fscanf(attributeF, "%d\t%s\n", &node, &attr);
fprintf(out, "%d\t", node);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");
for (int i = 0; i < vec.size(); i++) {
int len = vec[i].size();
if (vec[i][len - 1] == 's') {
fprintf(out, "%s,", vec[i].data());
}
else if (len > 1 && vec[i][len - 1] == 'd' && vec[i][len - 2] == 'e') {
fprintf(out, "%s,", vec[i].data());
}
else if (len > 2 && vec[i][len - 1] == 'g' && vec[i][len - 2] == 'n' && vec[i][len - 3] == 'i') {
fprintf(out, "%s,", vec[i].data());
}
}
fprintf(out, "\n");
}
fclose(attributeF);
fclose(out);
return;
}

map<string, int> attrCnt;

void Vertex_Attribute(const char* fileName, const char* outFileName)
{
int index = 1;
FILE *attributeF = fopen(fileName, "r");
while (!feof(attributeF)) {
index++;
if (index % 10000 == 0) cout << index << endl;

int node;
char attr[20005];
fscanf(attributeF, "%d\t%s\n", &node, &attr);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");
for (int i = 0; i < vec.size(); i++) {
attrCnt[vec[i]]++; // 计数
}
}

index = 1;
FILE *in = fopen(fileName, "r");
FILE *out = fopen(outFileName, "w");
while (!feof(in)) {
index++;
if (index % 10000 == 0) cout << "进度=" << index << endl;

int node;
char attr[20005];
fscanf(in, "%d\t%s\n", &node, &attr);
fprintf(out, "%d\t", node);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");

set<string> ss;
for (int i = 0; i < vec.size(); i++) {
int len = vec[i].size();
string tmp = vec[i];
if (vec[i][len - 1] == 's') { // 自建字典,如果去掉s的2倍数量大于加s的单词就删除s
string s = vec[i].substr(0, len - 1);
if (attrCnt[s] * 2 > attrCnt[vec[i]]) tmp = s;
}
else if (len > 1 && vec[i][len - 1] == 'd' && vec[i][len - 2] == 'e') {
string s = vec[i].substr(0, len - 2);
if (attrCnt[s] * 2 > attrCnt[vec[i]]) tmp = s;
}
else if (len > 2 && vec[i][len - 1] == 'g' && vec[i][len - 2] == 'n' && vec[i][len - 3] == 'i') {
string s = vec[i].substr(0, len - 3);
if (attrCnt[s] * 2 > attrCnt[vec[i]]) tmp = s;
}
ss.insert(tmp);
}
bool flag = true;
for (string i : ss) {
if (flag) {
flag = false;
fprintf(out, "%s", i.data());
}
else fprintf(out, ",%s", i.data());
}
fprintf(out, "\n");
}
fclose(in);
fclose(out);
fclose(attributeF);
return;
}

void ComputeFrequency(const char* fileName, const char* outFileName)
{
int sum = 0;
double tmp;
int index = 0;
FILE *attributeF = fopen(fileName, "r");
while (!feof(attributeF)) {
index++;
if (index % 10000 == 0) cout << index << endl;

int node;
char attr[20005];
fscanf(attributeF, "%d\t%s\n", &node, &attr);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");
int len = vec.size();
sum += len;

// ???单个顶点有重复的属性
map<string, int> flag;
for (int i = 0; i < len; i++) {
attrCnt[vec[i]]++; // 计数
flag[vec[i]]++;
if (flag[vec[i]] >= 2) {
cout << "顶点出现重复属性:" << index << endl;
}
}
}

set<pair<int, string>, greater<pair<int, string> > > SET;
for (map<string, int>::iterator it = attrCnt.begin(); it != attrCnt.end(); it++) {
SET.emplace(make_pair(it->second, it->first));
}
int cnt = SET.size();
FILE *out = fopen(outFileName, "w");
tmp = sum * 1.0 / index;
cout << tmp << endl;
fprintf(out, "顶点数量=%d\t属性总数量=%d\t平均数量=%.3f\t属性数量=%d\n", index, sum, tmp, cnt);
fprintf(out, "\n属性\t出现次数\t占比例\n");
for (set<pair<int, string> >::iterator it = SET.begin(); it != SET.end(); it++) {
//cout << it->second << ' ' << it->first << endl;
tmp = it->first * 1.0 / index;
fprintf(out, "%s\t%d\t%.3f\n", (it->second).data(), it->first, tmp);
}
fclose(out);
fclose(attributeF);
return;
}

void Transform2Int(const char* fileName, const char* outFileName1, const char* outFileName2)
{
int index = 1, attr_index = 1; // 从1开始好些,0可以当作无
map<string, int> attribute_index;
FILE *attributeF = fopen(fileName, "r");
FILE *out1 = fopen(outFileName1, "w");
while (!feof(attributeF)) {
index++;
if (index % 100000 == 0) cout << index << endl;

int node;
char attr[20005];
fscanf(attributeF, "%d\t%s\n", &node, &attr);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");
int len = vec.size();
for (int i = 0; i < len; i++) {
//把str转换成int
if (attribute_index[vec[i]] == 0) {
attribute_index[vec[i]] = attr_index;
fprintf(out1, "%s\t%d\n", vec[i].data(), attr_index);
attr_index++;
}
}
}

index = 1;
FILE *in = fopen(fileName, "r");
FILE *out2 = fopen(outFileName2, "w");
while (!feof(in)) {
index++;
if (index % 10000 == 0) cout << "进度=" << index << endl;

int node;
char attr[20005];
fscanf(in, "%d\t%s\n", &node, &attr);
fprintf(out2, "%d\t", node);

vector<string> vec;
string str = string(attr);
if (str.size() == 0) continue;
SplitString(str, vec, ",");
bool flag = true;
for (int i = 0; i < vec.size(); i++) {
if (flag) {
flag = false;
fprintf(out2, "%d", attribute_index[vec[i]]);
}
else fprintf(out2, ",%d", attribute_index[vec[i]]);
}
fprintf(out2, "\n");
}
fclose(in);
fclose(out1);
fclose(out2);
fclose(attributeF);
return;
}

typedef pair<string, int> psi;
ostream& operator<<(ostream& out, const psi& p) {
return out << p.first << "\t" << p.second;
}

bool cmp(psi a, psi b)
{
return a.second > b.second;
}

void Debug(const char* inFileName, const char* outFileName)
{
int index = 0;
FILE *in = fopen(inFileName, "r");
FILE *out = fopen(outFileName, "w");

while (!feof(in)) {
index++;
if (index % 10000 == 0) cout << index << endl;

int node;
char attr[20005];
fgets(attr, 20005, in);
//cout << attr << ' ' << strlen(attr) << endl; // 长度包含空格和换行符
if (strlen(attr) > 9) { // 本身顶点长度为7,空格,换行符
fputs(attr, out);
}
else {
cout << index << endl;
}
}
fclose(in);
fclose(out);
return;
}

int main()
{
// 去掉无属性的顶点
//const char* fileName1 = "dblp_nltk_attributes.txt";
//const char* fileName2 = "out.txt";
//Debug(fileName1, fileName2);
//return 0;


const char* fileName = "dblp_nltk_attributes.txt";
const char* outFileName = "dblp_attributes.txt";
const char* freFileName = "frequency.txt";
const char* transFileName1 = "dblp_String2Int.txt";
const char* transFileName2 = "dblp_attributes_int.txt";
Vertex_Attribute(fileName, outFileName); // 自建字典检查s、ed、ing(20190623注意要去重复)
ComputeFrequency(outFileName, freFileName); // 计算顶点属性频繁性
Transform2Int(outFileName, transFileName1, transFileName2); // 转换int
return 0;
}

出现文件读不完情况???居然是有个顶点没有一个属性造成的。
使用fgets和fputs判断字符长度不超过9的删除。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void Debug(const char* inFileName, const char* outFileName)
{
int index = 0;
FILE *in = fopen(inFileName, "r");
FILE *out = fopen(outFileName, "w");

while (!feof(in)) {
index++;
if (index % 10000 == 0) cout << index << endl;

int node;
char attr[20005];
fgets(attr, 20005, in);
//cout << attr << ' ' << strlen(attr) << endl; // 长度包含空格和换行符
if (strlen(attr) > 9) { // 本身顶点长度为7,空格,换行符
fputs(attr, out);
}
else {
cout << index << endl;
}
}
fclose(in);
fclose(out);
return;
}

1
2
3
4
5
进度=2070000
进度=2080000

Process returned 0 (0x0) execution time : 414.761 s
Press any key to continue.

发现一个顶点有重复的属性在自建字典词形还原的时候出现这种情况。但处理的还不是完美,应该在python词形还原的时候不应该取前20,在这里取前20。时间紧迫,python处理大概需要半小时。

运行,建立超图和属性关键字索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
[hejian@sklse ETAttriOnLinux]$ ./main_by_hejian 
HanKin.build: 16:16:09, Jun 23 2019. Time: 16:16:29 [Dec 13 2017]
=================================================================
usage: main_by_hejian

<<紧密子图查询(基于k-truss的属性网络社区发现和搜索)>>

###选择数据集###:
1.dblp
2.dbpedia
3.yago
4.tencent
5.test

请输入: 1
*************预处理*************
1、计算数据集的顶点平均度
2、随机生成数据集的查询点
3、随机生成数据集的顶点伸缩性
4、随机生成数据集的属性伸缩性
5、查看图数据信息
6、其他(跳过)
请输入: 23
aRate = 1.00 vRate = 0.20
attrNumMax = 5 kValue = 6

请输入点占的范围比例(0-100):100

==========1.Read Local Data File(建图)==========
DataSet: ./DataSets/Graph/dblp100%_vertices.txt
nodes = 1981567
edges = 9735108
Run Time:5.020000 s

==========Start EquiTrussIndex==========
==========2.Compute the Support of Edges==========
Run Time:48.290000 s

==========3.Compute the Trussness of Edges==========
kMax = 287
Node = 2 287
Edge = 2 287
Run Time:169.820000 s

==========4.Index Construction for EquiTruss==========
superNode's size:972699
超点存储到本地完成!!!
superEdge's size:1461225
超边存储到本地完成!!!
全局trussness存储到本地完成!!!
超图点数 = 972699
超图边数 = 1461225
EquiTruss索引创建时间:359.810000 s

顶点包含在超点映射时间:367.950000 s

==========5.Community Search Based on AttributeSearch(准备搜索)==========
==========5-1.Read Attribute Dataset File(读取属性到内存)==========
DataSet: ./DataSets/Attribute/dblp_attributes_int.txt
Run Time:585.270000 s


========5-2start find attribute truss(建立属性索引)...=====
attr_max_id = 259969
time = 2548.5s

属性索引创建时间:2548.500000 s

#
难受,以前因为没有后面的两种方法,所以选取设计的ATCImprove方法进行比较,想改缺没有时间了,就这样吧。