EPCA Integration

EPCA Integration

Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst 1 Motivation Capture confidence of records in extracted database First Name Last Name Confidence Bill Gates 0.96 Bill banks 0.43 Alerts data mining to possible errors in database 9 Confidence Estimation in

Linear-chain CRFs [Culotta, McCallum 2004] Finite State Lattice y t-1 yt y t+1 y t+2 output sequence y t+3 ORG OTHER ... PERSON Lattice of FSM states TITLE observations x t -1 said

x t Arden x t +1 Bement x t +2 NSF x t +3 Director input sequence 1 T p(y | x) = y (y t , y t1) xy (x t , y t ) Z(x) t=1 10 Confidence Estimation in

Linear-chain CRFs [Culotta, McCallum 2004] Constrained Forward-Backward y t-1 yt y t+1 y t+2 output sequence y t+3 ORG OTHER ... PERSON Lattice of FSM states TITLE observations x t -1 said

x t Arden x t +1 Bement x t +2 NSF x t +3 Director input sequence 1 T p(Arden Bement = PERSON | x) = y (y t , y t1) xy (x t , y t ) Z(x) y C t=1 11

Forward-Backward Confidence Estimation improves accuracy/coverage optimal our forward-backward confidence traditional token-wise confidence no use of confidence 12 Application of Confidence Estimation Interactive Information Extraction: To correct predictions, direct user to least confident field 13 Interactive Information Extraction

IE algorithm calculates confidence scores UI uses confidence scores to alert user to possible errors IE algorithm takes corrections into account and propagates correction to other fields 14 User Correction User Corrects a field, e.g. dragging Stanley to the First Name field x1 x2 x3 x4 x5 First Name Last Name Address Line Charles y1 Stanley y2

100 y3 Charles y4 Street y5 15 Remove Paths User Corrects a field, e.g. dragging Stanley to the First Name field x1 x2 x3 x4 x5 First Name Last Name Address Line Charles y1 Stanley y2

100 y3 Charles y4 Street y5 16 Constrained Viterbi Viterbi algorithm is constrained to pass through the designated state. x1 x2 x3 x4 x5 First Name Last Name Address Line Charles y1 Stanley

y2 100 y3 Charles y4 Street y5 Adjacent field changed: Correction Propagation 17 Constrained Viterbi After fixing least confident field, constrained Viterbi automatically reduces error by another 23%. Recent work reduces annotation effort further A) B) simplifies annotation to multiple-choice First Name Last Name

City Bill Gates Redmond WA Bill Gates Redmond 18 User feedback in the wild as labeling Labeling for Classification Labeling for Extraction Seminar: How to Organize your Life Seminar: How to Organize your Life by Jane Smith, Stevenson & Smith Mezzanine Level, Papadapoulos Sq by Jane Smith, Stevenson & Smith

Mezzanine Level, Papadapoulos Sq 3:30 pm Thursday March 31 3:30 pm Thursday March 31 In this seminar we will learn how to use CALO to... In this seminar we will learn how to use CALO to... Seminar announcement Todo request Other Easy: Often found in user interfaces e.g. CALO IRIS, Apple Mail Click, drag, adjust, label, Click, drag, adjust, label, ... Painful: Difficult even for paid labelers Complex tools 19 Multiple-choice Annotation for Learning Extractors in the wild

[Culotta, McCallum 2005] Task: Information Extraction. Fields: NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq. Interface presents top hypothesized segmentations Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. user corrects labels, not segmentations 20 Multiple-choice Annotation for Learning Extractors in the wild [Culotta, McCallum 2005] Task: Information extraction. Fields: NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq. Interface presents top hypothesized segmentations Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. user corrects labels, not segmentations 21 Multiple-choice Annotation for Learning Extractors in the wild

[Culotta, McCallum 2005] Task: Information extraction. Fields: NAME COMPANY ADDRESS (and others) Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq. Interface presents top hypothesized segmentations Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq. 29% percent reduction in user actions needed to train 22 Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Emailed seminar annmt entities Email English words GRAND CHALLENGES FOR MACHINE LEARNING 60k words training. Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth,

machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Too little labeled training data. 24 Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Train on related task with more data. Newswire named entities Newswire English words 200k words training. CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. 25 Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] At test time, label email with newswire NEs...

Newswire named entities Email English words 26 Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] then use these labels as features for final task Emailed seminar annmt entities Newswire named entities Email English words 27 Piecewise Training in Factorial CRFs for Transfer Learning [Sutton, McCallum, 2005] Use joint inference at test time. Seminar Announcement entities Newswire named entities English words An alternative to hierarchical Bayes. Neednt know anything about parameterization of subtask. Accuracy No transfer < Cascaded Transfer < Joint Inference Transfer 28 A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira

QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions. String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. 31 String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Apex International Hotel

Grassmarket Street Apex Internatl Grasmarket Street Records are duplicates of the same hotel? 32 String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. 33 String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences Machine Translation Il a achete une pomme He bought an apple

34 String Edit Distance Distance between sequences x and y: cost of lowest-cost sequence of edit operations that transform string x into y. Applications Database Record Deduplication Biological Sequences Machine Translation Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening 35

Levenshtein Distance Edit operations Align two strings copy insert delete subst x1 = x2 = Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another i a m _ W . _ C o h o n copy subst copy copy copy copy delete

delete delete copy copy subst insert copy copy copy copy operation cost (cost 0) (cost 1) (cost 1) (cost 1) William W. Cohon Willleam Cohen W i l l Lowest cost alignment

[1966] W i l l l e a m _ C o h e n 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 Total cost = 6 = Levenshtein Distance 36 Levenshtein Distance Edit operations copy insert delete subst Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another (cost 0) (cost 1) (cost 1) (cost 1) Dynamic program D(i,j) = score of best alignment from x1... xi to y1... yj. D(i-1,j-1) + (xiyj ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1

W i l l i a m 0 1 2 3 4 5 6 7 W 1 0 1 2 3 4 5 6 i 2 1 0 1 2

3 4 5 l 3 2 1 0 1 2 3 4 l 4 3 2 1 0 1 2 3 l 5 4 3 2 1 1 2 3

e 6 5 4 3 2 2 2 3 a 7 6 5 4 3 3 2 3 m 8 7 6 5 4 4 4 2 insert subst total

cost = distance37 Levenshtein Distance with Markov Dependencies Edit operations copy insert delete subst repeated delete is cheaper Cost after a Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another c 0 1 1 1 i d s 0 0 0 1 1 1 2 1 12 1 1 1 1

Learn these costs from training data subst copy delete insert W i l l i a m 0 1 2 3 4 5 6 7 W 1 0 1 2

3 4 5 6 i 2 1 0 1 2 3 4 5 l 3 2 1 0 1 2 3 4 l 4 3 2 1 0 1 2 3

l 5 4 3 2 1 1 2 3 e 6 5 4 3 2 2 2 3 a 7 6 5 4 3 3 2 3 m 8 7

6 5 4 4 4 2 3D DP table 38 Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings string 1 p(x1,x 2 ) = copy Match score = 8 subst p(a,x1,x 2 ) = p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) 8 copy W i l l l e a m

8 copy 8 copy 7 10 11 12 13 14 15 16 copy 6 9 delete 5 8 delete 4 7 delete 3 6

copy 2 5 copy 1 4 i a m _ W . _ C o h o n subst 4 insert 3 copy 2 copy 1 copy string 2

W i l l copy alignment x1 a.i1 a.e a.i2 x2 9 10 11 12 13 14 _ C o h e n complete data likelihood t p(a a:x 1 ,x 2 t | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) incomplete data likelihood (sum over all alignments consistent with x1 and x2) t Given training set of matching string pairs, objective fn is

O = p(x (1 j ),x (2 j ) ) j Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. 39 Ristad & Yianilos Regrets Limited features of input strings Limited edit operations Examine only single character pair at a time Difficult to use upcoming string context, lexicons, ... Example: Senator John Green John Green Difficult to generate arbitrary jumps in both strings Example: UMass University of Massachusetts. Trained only on positive match data

Doesnt include information-rich near misses Example: ACM SIGIR ACM SIGCHI So, consider model trained by conditional probability 40 Conditional Probability (Sequence) Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Dont have to explicitly model their dependencies. 41 Linear-chain ^ From HMMs to Conditional Random Fields v s = s1,s2 ,...sn v

o = o1,o2,...on [Lafferty, McCallum, Pereira 2001] yt-1 |x| P(y,x) = P(y t | y t1 )P(x t | y t ) Joint yt yt+1 ... t=1 v |o| 1 P(y t | y t1 )P(x t | y t ) Conditional P(y | x) = P(x) t=1 1 |x| = s (y t , y t1 ) o (x t , y t ) Z(x) t=1

where o (x t , y t ) = exp k f k (y t , x t ) k xt-1 xt yt-1 xt-1 xt+1 yt xt ... yt+1 ... xt+1 ... (A super-special case of Set parameters by maximum likelihood, using optimization method on L. Conditional Random Fields.) Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT03], [CoNLL03] Protein structure prediction [ICML04] IE from Bioinformatics text [Bioinformatics 04], Asian word segmentation [COLING04], [ACL04] IE from Research papers [HTL04] 42 Object classification in images [CVPR 04] CRF String Edit Distance string 1 copy 8 subst joint complete data likelihood 8 copy W i l l l e a m 8 copy 8 copy

7 10 11 12 13 14 15 16 copy 6 9 delete 5 8 delete 4 7 delete 3 6 copy 2 5 copy

1 4 subst 4 i a m _ W . _ C o h o n insert 3 copy 2 copy 1 copy string 2 W i l l copy alignment x1 a.i1

a.e a.i2 x2 9 10 11 12 13 14 _ C o h e n p(a,x1,x 2 ) = p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) t conditional complete data likelihood p(a | x1,x 2 ) = 1 Z x 1 ,x 2 (a ,a t ,x1,x 2 ) t1 t pairs, Want to train from set of string each labeled one of {match, non-match} match non-match match

match non-match William W. Cohon Bruce DAmbrosio Tommi Jaakkola Stuart Russell Tom Deitterich Willlleam Cohen Bruce Croft Tommi Jakola Stuart Russel Tom Dean 44 CRF String Edit Distance FSM subst copy delete insert 45 CRF String Edit Distance FSM conditional incomplete data likelihood p(m | x1,x 2 ) =

a S m subst 1 Z x 1 ,x 2 (a ,a t ,x1,x 2 ) t1 t copy match m=1 delete insert subst copy Start non-match m=0 delete

insert 46 CRF String Edit Distance FSM x1 = Tommi Jaakkola x2 = Tommi Jakola subst copy match m=1 delete insert subst copy Probability summed over all alignments in match states 0.8 Start non-match m=0 delete Probability summed over

all alignments in non-match states 0.2 insert 47 CRF String Edit Distance FSM x1 = Tom Dietterich x2 = Tom Dean subst copy match m=1 delete insert subst copy Probability summed over all alignments in match states 0.1 Start non-match m=0

delete Probability summed over all alignments in non-match states 0.9 insert 48 Parameter Estimation Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood log( p(m j O = log( p(m( j ) | x (1 j ),x (2 j ) )) j ( j) | a,x (1 j ),x (2 j ) ) p(a | x (1 j ),x (2 j ) )) a Expectation Maximization

E-step: Estimate distribution over alignments, p(a | x ( j ),x ( j ) ) , using current parameters M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) 1 2 This is conditional EM, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form. 49 Efficient Training Dynamic programming table is 3D; |x1| = |x2| = 100, |S| = 12, .... 120,000 entries Use beam search during E-step [Pal, Sutton, McCallum 2005] Unlike completely observed CRFs, objective function is not convex. Initialize parameters not at zero, but so as to

yield a reasonable initial edit distance. 50 What Alignments are Learned? x1 = Tommi Jaakkola x2 = Tommi Jakola T o m m i subst copy match m=1 delete insert subst copy Start J a a k k o l a T o m m i J a

k o l a non-match m=0 delete insert 51 What Alignments are Learned? x1 = Bruce Croft x2 = Tom Dean subst copy match m=1 delete insert Start B r u c e subst copy non-match

m=0 delete insert C r o f t T o m D e a n 52 What Alignments are Learned? x1 = Jaime Carbonell x2 = Jamie Callan subst copy match m=1 delete insert Start J a i m e subst

copy non-match m=0 delete insert C a r b o n e l l J a m i e C a l l a n 53 Summary of Advantages Arbitrary features of the input strings

Extremely flexible edit operations Examine past, future context Use lexicons, WordNet Single operation may make arbitrary jumps in both strings, of size determined by input features Discriminative Training Maximize ability to predict match vs non-match 55 Experimental Results: Data Sets Restaurant name, Restaurant address People names, UIS DB generator

864 records, 112 matches E.g. Abes Bar & Grill, E. Main St Abes Grill, East Main Street synthetic noise E.g. John Smith vs Snith, John CiteSeer Citations In four sections: Reason, Face, Reinforce, Constraint E.g. Rusell & Norvig, Artificial Intelligence: A Modern... Russell & Norvig, Artificial Intelligence: An Intro... 56 Experimental Results: Features same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

57 Experimental Results: Edit Operations insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string 58 Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Distance metric Restaurant name Restaurant address

CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290 0.354 0.365 0.433 0.686 0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.924 0.941 0.923 0.913 0.952 0.966

0.922 0.875 0.893 0.907 0.903 0.808 59 Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Distance metric Restaurant name Restaurant address CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290

0.354 0.365 0.433 0.686 0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.952 0.966 0.922 0.875 0.893 0.907 0.903 0.808 0.924 0.941 0.923 0.913 CRF Edit Distance 0.448 0.783

0.964 0.918 0.917 0.976 60 Experimental Results Data set: person names, with word-order noise added F1 Without skip-if-present-in-other-string With skip-if-present-in-other-string 0.856 0.981 61 Joint Co-reference Decisions, Discriminative Model [Culotta & McCallum 2005] People Stuart Russell Y/N Stuart Russell

Y/N Y/N S. Russel 63 Co-reference for Multiple Entity Types People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Berkeley Y/N Y/N S. Russel [Culotta & McCallum 2005] Y/N Berkeley 64

Joint Co-reference of Multiple Entity Types People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Berkeley Y/N Y/N S. Russel [Culotta & McCallum 2005] Y/N Reduces error by 22% Berkeley 65 Social network from my email

QuickTime and a TIFF (LZW) decompressor are needed to see this picture. 68 Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003] Generative Process: Example: For each document: Sample a distribution over topics, 70% Iraq war 30% US election For each word in doc Sample a topic, z Sample a word from the topic, w Iraq war bombing 69 Example topics

induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE

NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK

SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION

INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN

STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE CERTAIN UNDERWATER [Tennenbaum et al] 70 Example topics induced from a large collection of text JOB SCIENCE BALL FIELD STORY MIND DISEASE WATER WORK STUDY GAME MAGNETIC

STORIES WORLD BACTERIA FISH JOBS SCIENTISTS TEAM MAGNET TELL DREAM DISEASES SEA CAREER SCIENTIFIC FOOTBALL WIRE CHARACTER DREAMS GERMS SWIM KNOWLEDGE BASEBALL EXPERIENCE NEEDLE THOUGHT CHARACTERS FEVER SWIMMING WORK PLAYERS EMPLOYMENT CURRENT AUTHOR IMAGINATION CAUSE POOL OPPORTUNITIES RESEARCH

PLAY COIL READ MOMENT CAUSED LIKE WORKING CHEMISTRY FIELD POLES TOLD THOUGHTS SPREAD SHELL TRAINING TECHNOLOGY PLAYER IRON SETTING OWN VIRUSES SHARK SKILLS MANY BASKETBALL COMPASS TALES REAL INFECTION TANK CAREERS MATHEMATICS COACH LINES PLOT LIFE

VIRUS SHELLS POSITIONS BIOLOGY PLAYED CORE TELLING IMAGINE MICROORGANISMS SHARKS FIND FIELD PLAYING ELECTRIC SHORT SENSE PERSON DIVING POSITION PHYSICS HIT DIRECTION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION FIELD LABORATORY TENNIS FORCE ACTION STRANGE COMMON SWAM OCCUPATIONS STUDIES TEAMS

MAGNETS TRUE FEELING CAUSING LONG REQUIRE WORLD GAMES BE EVENTS WHOLE SMALLPOX SEAL OPPORTUNITY SPORTS MAGNETISM SCIENTIST TELLS BEING BODY DIVE EARN STUDYING BAT POLE TALE MIGHT INFECTIONS DOLPHIN ABLE SCIENCES TERRY INDUCED NOVEL HOPE

CERTAIN UNDERWATER [Tennenbaum et al] 71 From LDA to Author-Recipient-Topic (ART) 72 Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r 73 Outline Email, motivation ART Graphical Model.

Experimental Results Enron Email (corpus) Academic Email (one person) RART: Roles for ART Group-Topic Model Experiments on voting data Voting data from U.S. Senate and the U.N. 74 Enron Email Corpus 250k email messages

23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: [email protected] To: [email protected] Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 [email protected] 75 Topic names, by hand Topics, and prominent senders / receivers discovered by ART 76 Topics, and prominent sender/receivers discovered by ART Beck = Chief Operations Officer Dasovich = Government Relations Executive Shapiro = Vice President of Regulatory Affairs Steffes = Vice President of Government Affairs

77 Comparing Role Discovery Traditional SNA ART Author-Topic distribution over authored topics distribution over authored topics connection strength (A,B) = distribution over recipients 78 Comparing Role Discovery Tracy Geaconne Dan McCarty Traditional SNA ART Similar roles Different roles Author-Topic

Different roles Geaconne = Secretary McCarty = Vice President 79 Comparing Role Discovery Tracy Geaconne Rod Hayslett Traditional SNA Different roles ART Not very similar Author-Topic Very similar Geaconne = Secretary Hayslett = Vice President & CTO 80 Comparing Role Discovery Lynn Blair Kimberly Watson Traditional SNA Different roles ART

Very similar Author-Topic Very different Blair = Gas pipeline logistics Watson = Pipeline facilities planning 81 McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: [email protected] Subject: NIPS and .... Date: June 14, 2004 2:27:41 PM EDT To: [email protected] There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate 82 McCallum Email Blockstructure

83 Four most prominent topics in discussions with ____? 84 85 Two most prominent topics in discussions with ____? Words love house time great hope dinner saturday left ll visit evening stay bring weekend road sunday kids flight Prob 0.030514

0.015402 0.013659 0.012351 0.011334 0.011043 0.00959 0.009154 0.009154 0.009009 0.008282 0.008137 0.008137 0.007847 0.007701 0.007411 0.00712 0.006829 0.006539 0.006539 Words today tomorrow time ll meeting week talk meet morning monday back call free

home won day hope leave office tuesday Prob 0.051152 0.045393 0.041289 0.039145 0.033877 0.025484 0.024626 0.023279 0.022789 0.020767 0.019358 0.016418 0.015621 0.013967 0.013783 0.01311 0.012987 0.012987 0.012742 0.012558 86 89

Outline Email, motivation ART Graphical Model. Experimental Results Enron Email (corpus) Academic Email (one person) RART: Roles for ART Group-Topic Model

Experiments on voting data Voting data from U.S. Senate and the U.N. 90 Role-Author-Recipient-Topic Models 91 Results with RART: People in Role #3 in Academic Email olc gauthier irsystem system allan valerie tech steve lead Linux sysadmin sysadmin for CIIR group

mailing list CIIR sysadmins mailing list for dept. sysadmins Prof., chair of computing committee second Linux sysadmin mailing list for dept. hardware head of dept. I.T. support 92 Roles for allan (James Allan) Role #3 Role #2 I.T. support Natural Language researcher Roles for pereira (Fernando Pereira) Role #2 Natural Language researcher Role #4 SRI CALO project participant Role #6 Grant proposal writer Role #10 Grant proposal coordinator Role #8 Guests at McCallums house 93

Outline Email, motivation ART Graphical Model. Experimental Results Enron Email (corpus) Academic Email (one person) RART: Roles for ART

Group-Topic Model Experiments on voting data Voting data from U.S. Senate and the U.N. 94 ART & RART: Roles but not Groups Traditional SNA Block structured ART Not Author-Topic Not Enron TransWestern Division 95 A Group Model: Stochastic Blockstructures Model 96 Group-Topic Model [Wang, Mohanty, McCallum 2005]

97 U.S. Senate Data sets 3426 bills from 16 years of voting records from the U.S. Senate Yea / Nea / Abstain (absent) Each bill comes with an abstract (text describing the contents of the bill). 98 Topics Discovered Traditional Mixtures of Unigrams GroupTopic Model 99 Groups Discovered Agreement Index Groups from topic Education + Domestic

100 Senators who change Coalition Dependent on Topic e.g. Senator Shelby (D-AL) votes with the Republicans on Economic with the Democrats on Education + Domestic with a small group of maverick Republicans on Social Security + Medicaid 101 U.N. Data Set 931 U.N. Resolutions, voted on by 192 countries, from 1990-2003. Yes / No / Abstain votes List of keywords summarizes the content of the resolution. Also experiments later with resolutions from 19602003 102 Topics Discovered Traditional

mixture of unigrams Group-Topic Model 103 Groups Discovered 104 Groups and Topics, Trends over Time 105

Recently Viewed Presentations

  • Facultat dEconomia i Empre PROGRAMES DE MOBILITAT INTERNACIONAL

    Facultat dEconomia i Empre PROGRAMES DE MOBILITAT INTERNACIONAL

    Cal tenir en compte que només hi ha garantit que es puguin sol·licitar les beques addicionals si es realitza la sol·licitud en la convocatòria ordinària de novembre/desembre. Per a la segona crida i la convocatòria d'abril maig només es podran...
  • Workplace Health Centers: Driving employee wellness and productivity

    Workplace Health Centers: Driving employee wellness and productivity

    afeditab cr 30 mg tablet. afeditab cr 60 mg tablet. aggrenox capsule sa. ah-chew tablet chew. alavert d-12 allergy-sinus tab. albuterol 0.83 mg/ml solution. albuterol 5 mg/ml solution. albuterol 90 mcg inhaler. albuterol sul 1.25 mg/3 ml sol. albuterol sulf...
  • Ethical Thought Meta-ethical approaches

    Ethical Thought Meta-ethical approaches

    Searle. and James Rachels argue that in some cases you can derive an ought from an is: (P) You promised to pay me back my £5 (C) Therefore you ought to pay me back. Searle argues that the institution of...
  • La variabilità genetica (V.G.)

    La variabilità genetica (V.G.)

    Si assume che la probabilità dei singoli eventi (genotipi) corrisponda alla frequenza attesa del genotipo in questione in una popolazione che sia all'equilibrio di Hardy-Weinberg Prima di vedere come calcolare questa probabilità, è necessario (ri)prendere in con siderazione alcuni concetti...
  • Annual Report 2017 Younger Ones Home from Local

    Annual Report 2017 Younger Ones Home from Local

    "Thank you so much for letting me visit and to be able to dance with the children and see all their enthusiastic faces. It was great to meet you Bonnie and see what a wonderful home you have created for...
  • Using Music and Rhythm in the everyday teaching of English ...

    Using Music and Rhythm in the everyday teaching of English ...

    of English Language. in Malaysian Schools. Using Music and Rhythm. So, where are we going today? ... Using Chants and Rhymes. Nursery Rhymes. Ready-made chants. Making your own chants: 3. Using Simple Songs ... Using Music and Rhythm in the...
  • Why Conceptual modeling? - IDEALS

    Why Conceptual modeling? - IDEALS

    Why Conceptual Modeling? A Case Study of A11y First Text Editor. Lightning Talk: Library Research Showcase 2016. JaEun Ku, PhD. This presentation reports how conceptual modeling can be used in building a text editor in order to ensure accessibility in...
  • General Engineering Knowledge General Engineering Knowledge Part 1

    General Engineering Knowledge General Engineering Knowledge Part 1

    Bilge and transfer pumps should be fitted with remote shutdowns that will allow them to be stopped by a member of the deck crew if they detect any pollutant being discharged. Fire Mains The fire main is a network of...