Data Mining Social Cyberspaces: Tools for Enhancing Online Communities MarcÊSmith Microsoft Research Sociologist at MS Research in Collective Technologies group. Applying data mining techniques to conversational socual cyberspaces. Anywhere a thread might live. Roams across Usenet/netscan interface (a reputation ssystem for conve. cyberspace). A second project, Aura, barcode-reading wireless handheld, allows one to find meta data about scanned books and then annotate that data (aka ObjectBlogging). The net is the best thing for Sociologists. Environment of 700M network accessible people. A vast petri dish of human interaction. The spontaneous emergence of places where association can take place. These groups then take part in collective action that are more than the sum of their parts. There are many tools that enable this collaboration - mail, irc, lists, chat, IM, usenet, web boards, wikis, blogs, muds, etc etc. Each with dimensions of variation for public, private, etc. IT catalyzes collective action because it CHANGES the nature of core social resources like Identity, Reputation, Exchange, and individual Incentive. New technologies do raise issues of trust and authenticity. Is that person on the other end who one believes they are, and ar ethey thus trustworthy. Slide to advertise phone system, where identity is unclear. The Challenge: Offline is easy to determine where people are, wher ethey are going to or coming from, how they are grouped and what they are doing. But online it is difficult to isolate the size and composition of a crowd, where people are in the (virtal) crown. IRL, we can gather all those signifiers, but online we can't readily do so. The state of the art in newsgrouptools (j/k) Outlook Express. Problem in usenet: how does one find what one cares about. Comon interfaces are very dry, not rich in user-relateable content. We don't know the who's or the veracity of those whos. Is the community active or inactive. "An interface to one of the most social communicaties designed by some of the most antisocial people on earth" http://netscan.research.microsoft.com/ Laying of metadata across - earliest to latest, link to deeper reports, how many people to messages, returning posters, line count avg per message, number of replied messages. Try to identify core groups. Looking to add additional data - mailing lists, distribution lists. Comparable to large urban aggregations that are often studied by social scientists. The 3000+ microsoft.* groups are his core target - 1.5M authors, 9.7M messages a year. Upward trend of posts and posters (graph) Do see spam spikes and spam storms in graph. (discussion of how to ID spam, etc) [aside in timeseries data: professional newsgroups across time series display differently than entertainment newsgroups (duh)] Provide a social accounting page - proof of concept for ideas about selecting content. Show neighbor newsgroups; 1st timers, other stats, degree of crossposting. Looking for signpostings. Mine for big threads - those topics that grew, and how long they grow over time. Derive threaded view from these. Broad vs Deep threads. 67% of messages have 2 messages only. Another metric "high-value person" find me their threads. A certain quality shared by answer people. 1) number of postings that are replies 2) consistent time intervals (daily) 3) longevity of connection to group. Q) wouldn't a clever spammer just post replies? A) arms race, yes. This is a method, generically, of evaluating, but not the only way? The author track gives some level of authenticity that is useful, if incomplete. Thread tree visualisation. Time series, etc Author profiling, but 'nothing special' How many people post per month - most post only once per month; minority (2%) post more than 3days a month. Usenet tree visualization. Based on UMA research for flattened tree. Message count, with change over prior month. A halo'd dataspace for *comp*, 11K newsgroups. More interesting graphs of the heirarchy for traffic patterns, with time relevance. Maps of inter/intra newsgroup linkages. What is a healthy community? It's about getting people to come back. Retention of leader - more is good. Interaction - more good, unreplied to bad. Size and Growth, within reason. Speed should be fast, MSFT involvement is a fuzzy read (don't want to stomp outsiders but need to support them as well). Proposed my.microsoft page aggregation of data, including usenet dropin. displaying the derived reputation. My watched people, tracking of asked questions and answered questions. Delay in release until later - want to be able to do daily updates - which they expect to be able to do by end of year. Will be moving beyond the grid map to 'visualised' newsgroup crowd (developed by intern). Size of circle for poster volume, time frequency, posts per threadd, etc. Comparison of newsgroups across crowds pix. Authorlines, showing threads created or contributed by person, identifying the volume, genealogy of a particular thread to see time started, resurgence. Providing ways to sort by multiple properties. Watch themselves at groups.google, were people talking about them, using it largely as they expected it to be used, for comparative matching. Social Implications, how does it move into broader environment. Aura: off teh shelf products (PocketPC and barcode reader). Scan an object, backhaul over 802.11 to lookup svcobject database, then with google for information. "Every object has a story to tell and one of those stories is that if you eat me I will kill you" ---metadata--- Add email address for bounceback below. Please annotate with additional links esinclai@pobox.com etcon@crystalflame.net jmay@pobox.com Links Referred to in talk: Whyte's book: http://www.amazon.com/exec/obidos/ASIN/0385262094/ City: Rediscovering the Center NetScan; http://netscan.research.microsoft.com/