Data, Knowledge & Life

Weiwei Cheng's blog

Archive for 八月 2009

Timelapse – Los Angeles Wildfire

leave a comment »

Written by Weiwei

31/08/2009 at 21:30

发表在 转贴

Google celebrating MJ today

leave a comment »

Written by Weiwei

29/08/2009 at 15:28

发表在 杂话

The ISI Case – A Critical View

leave a comment »

An issue of concern to computer scientists is the common tendency to use as reference for publications and citations the database of Thomson Scientific’s ISI Web of Science. This resource was devised for the natural sciences; while the corresponding community seems to be satisfied with its applicability there, it is inadequate for computer science.

The principal deficiency of ISI comes from its arbitrary classification of what is or is not worthy of being counted. The selection criteria are arbitrary and opaque. Most conferences are not listed; books are generally not listed; conversely, some references are included indiscriminately.

The results make any computer scientist cringe. For Niklaus Wirth, a famous computer scientist honored by the Turing Award (the highest honor in computer science, commonly referred to as the Nobel equivalent of the discipline) and known in particular for his design of Pascal, the ISI database lists a number of minor papers that happen to have appeared in indexed publications, but not his 1970 “Pascal User’s Manual and Report” (with Kathleen Jensen), published as a book and one of the best known references in all of computer science. Ask any computer scientist what is the most influential publication in the discipline, and most will cite Donald E. Knuth’s The Art of Computer Programming book series, which has acquired legend status; that reference does not figure in the ISI database. (On Google Scholar it gets over 15,000 citations, an astounding number.) Of the many articles that Knuth — also a Turing Award winner — has published, the three most frequently cited according to Google Scholar, each with about 1000 citations, do not even appear in the ISI records.

Evidence of how ISI collapses for computer science is “internal coverage”: the percentage of citations that cite a publication in the same database. Whereas ISI’s internal coverage exceeds 80% for physics or chemistry, it is only 38% for computer science.

An example of the arbitrariness of ISI criteria is Springer’s Lecture Notes in Computer Science (LNCS), which ISI for until recently classified as a journal, whereas it is simply a book series offering numerous conference proceedings and some monographs. Lumping all LNCS publications into a single journal category was inappropriate, especially since many high-quality conferences not published by LNCS are not listed. For example:

  • The International Conference on Software Engineering (ICSE), the top conference in a field that has its own special ISI category, is considered a premier publication venue by anyone in the field; it is not indexed by ISI.
  • Any software engineering workshop published in LNCS, the kind of venue where an author would typically try out an idea before it is ready for submission to ICSE, is indexed by ISI.

As another example, ISI indexes SIGPLAN Notices, a publication of the Programming Languages group of the ACM (one of the two major professional societies in computer science). SIGPLAN Notices is actually an unrefereed publication, used in its ordinary issues to publish drafts, notes, letters; but it devotes special issues to the proceedings of some of the most prestigious conferences such as POPL and PLDI. Unlike those to ICSE, contributions to these conferences will appear in ISI, but treated in the same way as an informal reader’s note in a regular issue.

The database has little understanding of what constitutes computer science. The 50 most cited references in computer science according to ISI include such entries as Chemometrics in food science (#13), from a journal called Chemometrics and Intelligent Laboratory Systems, a topic and a publication entirely alien to computer science. This is not just an isolated example; most of the entries on the list, even those which are related to computer science (usually from specialized areas rather than the core of the discipline) are not recognizable to a computer scientist as milestone contributions. The cruelest comparison is with the list of most cited computer science works on the CiteSeer site, devoted to computer science; while imperfect like any such selection, the CiteSeer lists many articles and books familiar to all computer scientists. It has not a single entry in common with the ISI list.

Merlet et al. note that the top-ranked ISI journal is 195th on CiteSeer, and the top CiteSeer journal is 26th for ISI. While some might be tempted to use this as a reason to dismiss rankings altogether, examination of the differences shows that they simply reflect how far off ISI is from the general understanding of computer scientists.

The ISI list of “highly cited researchers” reflects the database’s ignorance of computer science. Wirth, Parnas and Knuth, all iconic names in the field, do not appear. Of the ten Turing Award winners between 2000 and 2006, only one is listed (Ronald Rivest, the R of the RSA cryptographic algorithm), but not, for example, Adi Shamir (the S of RSA), another revered figure of theoretical computer science.

Although one might indeed expect ISI to give better results for Theory work, closer than Systems to mathematics (a long-established discipline), the difference is marginal because of the fundamental deficiencies in the process.

Since ISI indexing is based on an opaque process with no room for assessment or appeal of decisions, the situation is unlikely to improve.

Alternatives such as CiteSeer and Google Scholar are subject to criticism as well. While CiteSeer attempts to eliminate self-references, Google Scholar does not; neither project publishes its precise inclusion criteria. These deficiencies, however, are negligible when viewed against those of the ISI Web of Science:

In assessing publications and citations, the ISI Web of Science is inadequate for most areas of computer science and must not be used. Alternatives, imperfect but preferable, include Google Scholar, CiteSeer and (potentially) the ACM Digital Library.

Anyone in charge of research assessment should be aware that attempts to use ISI for computer science will cause massive opposition. Announcements of such plans have led some computer scientists to reject all measurement-based techniques, as in the examples cited above. This is an overreaction; but decision-makers and scientists from other fields must not try to impose on computer scientists a scheme that is demonstrably inapplicable to their discipline.

Beyond the specific deficiencies of ISI, all systematic studies of citation databases show wide variations between the results they yield, in particular for computer science. This has led some authors to suggest systematic reliance on several databases, a commendable practice that seems hard to impose in practice. Perhaps software will appear that performs this automatically. In the meantime it is again essential to remember the limitations of data quality in today’s databases, and avoid any career-affecting decision based on metric indicators whose validity has not be checked thoroughly.

Written by Weiwei

29/08/2009 at 15:14

发表在 转贴

The Human Journey

leave a comment »

Written by Weiwei

23/08/2009 at 11:35

发表在 转贴

Word-cloud from my university homepage

leave a comment »

Top-50 frequent words from www.chengweiwei.com

Written by Weiwei

12/08/2009 at 20:32

发表在 杂话

Get a true random number

leave a comment »

As you may know, most random numbers used in computer programs are pseudo-random, which means they are generated in a predictable fashion using a mathematical formula. They are not really random! Though this is fine for many purposes, it may not be random in the way you expect if you’re used to dice rolls and lottery drawings.

A true random number generator is offered by random.org (more options). The randomness comes from atmospheric noise, which for many purposes is better than the pseudo-random number algorithms typically used in computer programs. See more on randomness at here.

Written by Weiwei

11/08/2009 at 14:15

发表在 杂话

我在德意志银行的第一周

with 9 comments

  我在德意志银行(Deutsche Bank,简称DB)工作了已经整整一个礼拜。就在我决定做DB这个项目的时候,很多朋友和同事都很诧异的问我“为什么”?在他们看来,虽然能够在DB工作是一件很让人兴奋的事情,但是对于我这样一个处在博士学习第二年,本应把全部精力都放在博士论文上的科研人员来说,选择在这个时候给DB打工,不像是个很符合逻辑的行为。更何况为了能够接DB的项目,我甚至不得不降低在大学里的薪水。这一切都让他们很好奇。对于我来讲,这个决定其实是自己对于未来没有把握的表现:博士毕业在即,我却还是不能决定自己将来去留的方向。我不知道自己是适合留在学校还是进入企业;我在学校里呆了很久,一切也都还算令人满意,但是却从来不知道学校外边的世界是什么样子。也许自己在企业里会更适合?这个问题,如果没有在企业亲身体验过的话,是永远无法回答的。这大约就是我此时选择走出校门的原始动力。人们很多时候都是很害怕改变的,但是也只有适时选择改变才能不断前进。与Google、Microsoft这样以IT为导向的企业不同,DB对于我这样一个计算机研究人员来讲,是一个完全不一样的世界。

  事实就是这样的。这一个礼拜,我在DB最大的体会就是改变。银行工作当中的方方面面都与在大学里截然不同。一般讲来,当改变刚刚发生的时候,人们总是下意识的消极。这一点,我也一样。比如说,我还是不能理解为什么大家天天都要穿着西装上班;不能理解为什么没有人上班的时候听音乐;不能理解为什么我不能在公司里访问Gmail。在DB,哪怕只是呆上一天时间,你也能够深深的感受到周围的条条框框。这一点,与大学里自由的氛围,非常不同。可以这么说,公司里出现的效率问题都是体制上的原因,而不是个人的原因。DB的员工都是很优秀的。或许对于DB这样的超大型企业,这些条条框框是必须的?答案显然是肯定的,只是我还没能够领悟到而已。

  很多刚来到DB的intern都埋怨自己没有实际的任务。这一点上,我是很幸运的。之前我跟DB的主管有过很好的沟通,他们对我的专长以及他们自己的需要都有比较好的认识,因此我在DB的工作有着很好的定义(我在DB的直接主管是NYU计算机博士毕业,主修Information System)。我可以充分的发挥自己在数据挖掘方面的专业知识,没有被浪费的感觉。另外一个让我很高兴的是,自己能够有机会接触到DB很多比较敏感的数据(大都是数据安全方面的)。不过这一点也不见得完全是好事:我不得不小心翼翼的处理这些数据,绝对不能让它们外泄;另外,将来要是想要应用这些数据发表一些文章,估计会很有难度。

Written by Weiwei

09/08/2009 at 20:57

发表在 杂话