Designing ML - Week 4!

michelle.carney@berkeley.edu

This is a jupyter notebook on clustering meetup.com data! In this notebook, I have used the location and group information to cluster the members into 6 clusters - but it is up to you to figure out what they mean!

In [1]:
#these are all of the libraries i'll be using - and I load the groups.csv data
import pandas as pd
import numpy as np
import random
import sklearn
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
df = pd.read_csv('groups.csv')

Data Step

This step we're going to look at our dataframes and become familiar with what's in them. This data was collected via meetup.com API in Dec 2017

In [2]:
#this is what the groups.csv looks like as a dataframe - it is about the groups
df.head()
Out[2]:
group_id category_id category.name category.shortname city_id city country created description group_photo.base_url ... organizer.photo.photo_link organizer.photo.thumb_link organizer.photo.type rating state timezone urlname utc_offset visibility who
0 6388 14 health/wellbeing health-wellbeing 10001 New York US 2002-11-21 16:50:46 Those who practice or hold a strong interest i... https://secure.meetupstatic.com ... https://secure.meetupstatic.com/photos/member/... https://secure.meetupstatic.com/photos/member/... member 4.39 NY US/Eastern alternative-health-nyc -14400 public Explorers of Health
1 6510 4 community/environment community-environment 10001 New York US 2003-05-20 14:48:54 The New York Alternative Energy Meetupis for t... https://secure.meetupstatic.com ... https://secure.meetupstatic.com/photos/member/... https://secure.meetupstatic.com/photos/member/... member 4.31 NY US/Eastern alternative-energy-meetup -14400 public Clean Energy Supporters
2 8458 26 pets/animals pets-animals 10001 New York US 2004-03-27 09:55:41 not_found https://secure.meetupstatic.com ... https://secure.meetupstatic.com/photos/member/... https://secure.meetupstatic.com/photos/member/... member 4.84 NY US/Eastern Animals -14400 public Animal Voices
3 8940 29 sci-fi/fantasy sci-fi-fantasy 10001 New York US 2002-11-16 04:49:16 Welcome to the The New York City Anime Meetup ... https://secure.meetupstatic.com ... https://secure.meetupstatic.com/photos/member/... https://secure.meetupstatic.com/photos/member/... member 4.46 NY US/Eastern NYC-Anime -14400 public Anime Fans
4 10104 26 pets/animals pets-animals 10001 New York US 2003-10-22 21:39:49 We welcome those who support pits, even if you... https://secure.meetupstatic.com ... https://secure.meetupstatic.com/photos/member/... https://secure.meetupstatic.com/photos/member/... member 4.09 NY US/Eastern NYC-Pitbull -14400 public_limited NYC Pits & People, Dog Lovers

5 rows × 36 columns

In [3]:
#this is the df about the members
df2 = pd.read_csv('members.csv', encoding = "ISO-8859-1")
In [4]:
df2.head()
Out[4]:
member_id bio city country hometown joined lat link lon member_name state member_status visited group_id
0 3 not_found New York us New York, NY 2007-05-01 22:04:37 40.72 http://www.meetup.com/members/3 -74.0 Matt Meeker NY active 2009-09-18 18:32:23 490552
1 3 not_found New York us New York, NY 2011-01-23 14:13:17 40.72 http://www.meetup.com/members/3 -74.0 Matt Meeker NY active 2011-03-20 01:02:11 1474611
2 3 Hi, I'm Matt. I'm an entrepreneur who has star... New York us New York, NY 2010-12-30 18:47:34 40.72 http://www.meetup.com/members/3 -74.0 Matt Meeker NY active 2011-01-18 20:37:23 1490492
3 3 Hi, I'm Matt. I'm an entrepreneur who has star... New York us New York, NY 2011-01-03 14:45:21 40.72 http://www.meetup.com/members/3 -74.0 Matt Meeker NY active 2011-07-23 03:42:28 1515830
4 3 Hi, I'm Matt. I'm an entrepreneur who has star... New York us New York, NY 2010-12-30 18:34:50 40.72 http://www.meetup.com/members/3 -74.0 Matt Meeker NY active 2011-06-13 18:33:23 1574965

Data Cleaning Phase

In this phase, we'll sample a % of the data, and then use "one hot" encoding to turn string features into numbers for our mathematical models! (read more here: http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple)

In [7]:
# I am sampling to 50,000 random samples from the dataframe since it is just SOOO big! This will help my code run faster
df2_sample = df2.sample(n=50000)
In [12]:
#let's explore the df by member id and the first record number of each - what are the features we want to use?
df2_sample.groupby(['member_id']).first()
Out[12]:
bio city country hometown joined lat link lon member_name state member_status visited group_id
member_id
3 not_found New York us New York, NY 2007-05-01 22:04:37 40.72 http://www.meetup.com/members/3 -74.00 Matt Meeker NY active 2009-09-18 18:32:23 490552
6 Community organizer New York us not_found 2013-09-11 00:42:06 40.73 http://www.meetup.com/members/6 -74.00 Scott Heiferman NY active 2014-09-20 12:28:38 113455
36 not_found New York us New York 2010-07-27 18:44:24 40.80 http://www.meetup.com/members/36 -73.97 Mark Hurst NY active 2013-06-26 13:31:37 703741
65 I work on Go at Google. San Francisco us Portland 2012-03-20 05:29:10 37.74 http://www.meetup.com/members/65 -122.44 Brad Fitzpatrick CA active 2017-06-03 06:22:28 2701562
82 I write code for a living and occasionally dab... San Francisco us NY / SF 2014-05-16 23:24:44 37.78 http://www.meetup.com/members/82 -122.42 Maggie Nelson CA active 2014-05-16 23:24:44 1811614
117 not_found New York us not_found 2002-06-16 17:10:32 40.75 http://www.meetup.com/members/117 -73.99 DaveVockell NY active 2004-08-05 02:48:10 131291
150 not_found New York us New York 2008-05-12 20:29:12 40.71 http://www.meetup.com/members/150 -74.02 Rex Sorgatz NY active 2017-01-06 21:16:26 272793
176 not_found San Francisco us not_found 2006-04-01 20:36:27 37.79 http://www.meetup.com/members/176 -122.40 Cal Henderson CA active 2014-05-12 21:18:22 120903
210 Chris Kramer - I've been running affiliate pro... New York us not_found 2008-03-12 00:11:27 40.74 http://www.meetup.com/members/210 -74.00 chris kramer NY active 2013-09-28 11:38:07 255307
227 not_found San Francisco us not_found 2003-07-10 22:08:47 37.79 http://www.meetup.com/members/227 -122.41 Patrick Breitenbach CA active 2011-07-26 21:34:33 54659
335 not_found San Francisco us Grass Valley, CA 2013-04-22 19:02:36 37.78 http://www.meetup.com/members/335 -122.42 Barak CA active 2016-07-10 17:30:51 107592
428 not_found San Francisco us San Francisco 2012-01-16 00:15:20 37.72 http://www.meetup.com/members/428 -122.44 David Pippenger CA active 2012-01-16 00:15:20 1060260
819 not_found Chicago us Chicago 2015-03-19 04:00:42 42.01 http://www.meetup.com/members/819 -87.74 James IL active 2015-03-19 04:00:42 514628
848 not_found New York us not_found 2015-03-25 21:19:29 40.75 http://www.meetup.com/members/848 -73.99 alex chan NY active 2016-06-02 14:00:11 87095
883 not_found San Francisco us Boston 2014-09-18 21:43:44 37.80 http://www.meetup.com/members/883 -122.44 Todd Agulnick CA active 2014-10-10 19:42:14 17009192
887 I'm a software developer consultant with Thoug... New York us Dallas 2012-10-30 17:43:59 40.76 http://www.meetup.com/members/887 -73.97 Kris NY active 2013-02-07 00:37:16 1777521
1230 Hi, I'm Ryan... I like hacking hardware, 3d pr... San Francisco us San Francisco 2013-12-15 19:59:59 37.78 http://www.meetup.com/members/1230 -122.46 ryan nelson CA active 2014-03-04 22:28:34 1240980
1502 not_found New York us New York 2009-10-08 06:31:31 40.72 http://www.meetup.com/members/1502 -73.98 Anil NY active 2009-10-08 06:31:31 1282709
1581 not_found San Francisco us not_found 2012-08-06 18:52:59 37.78 http://www.meetup.com/members/1581 -122.44 Mark Ballew CA active 2017-05-23 05:21:55 1788730
1945 not_found Chicago us Chicago 2005-06-30 13:15:46 41.94 http://www.meetup.com/members/1945 -87.65 Ben IL active 2008-04-14 21:52:45 107575
2629 Co-founder of awe.sm. I do databases and anyth... San Francisco us not_found 2012-03-15 23:01:22 37.75 http://www.meetup.com/members/2629 -122.42 Seldo CA active 2017-09-28 20:38:26 107604
2889 I've done a little bit of everything in the di... New York us not_found 2010-08-26 00:25:22 40.74 http://www.meetup.com/members/2889 -73.99 Lisa NY active 2010-09-08 18:42:58 1642043
3045 Hello there. I'm an entrepreneur and developer... San Francisco us USA 2011-03-30 00:39:36 37.79 http://www.meetup.com/members/3045 -122.41 Neil Mansilla CA active 2013-05-28 21:15:17 54659
3402 not_found New York us NYC 2016-06-10 20:00:51 40.79 http://www.meetup.com/members/3402 -73.95 Chaz Antonelli NY active 2017-07-15 14:56:01 107592
3588 not_found San Francisco us not_found 2015-03-16 18:14:28 37.78 http://www.meetup.com/members/3588 -122.42 David Gustafson CA active 2015-10-03 20:59:39 230033
3705 not_found Chicago us not_found 2011-06-06 15:54:08 41.97 http://www.meetup.com/members/3705 -87.70 Jeremy McMillan IL active 2015-02-15 16:08:22 192016
3735 Frequent commuter and occasional weekend rider San Francisco us not_found 2013-05-05 02:06:42 37.79 http://www.meetup.com/members/3735 -122.40 Chris CA active 2013-09-29 17:55:27 618694
3811 not_found San Francisco us not_found 2016-01-08 19:32:37 37.79 http://www.meetup.com/members/3811 -122.40 David Barr CA active 2016-01-26 19:35:01 19253477
3944 not_found San Francisco us not_found 2007-06-25 09:08:38 37.77 http://www.meetup.com/members/3944 -122.44 Liz Dizon CA active 2016-05-09 20:51:30 228852
3999 Rubyist at Goldbely.com New York us New York 2011-01-19 18:13:57 40.75 http://www.meetup.com/members/3999 -73.99 Trevor Stow NY active 2017-08-14 15:32:03 1768544
... ... ... ... ... ... ... ... ... ... ... ... ... ...
240816906 not_found San Francisco us not_found 2017-11-09 06:07:09 37.78 http://www.meetup.com/members/240816906 -122.42 Christi Spann CA active 2017-11-09 06:07:09 25137308
240817162 not_found New York us not_found 2017-11-09 06:13:59 40.75 http://www.meetup.com/members/240817162 -73.99 GOKTUG KASAL NY active 2017-11-09 06:13:59 18899254
240817878 not_found San Francisco us not_found 2017-11-09 06:32:18 37.77 http://www.meetup.com/members/240817878 -122.40 Tanay Rashinkar CA active 2017-11-09 06:32:18 18825676
240818081 not_found New York us not_found 2017-11-09 06:38:56 40.75 http://www.meetup.com/members/240818081 -73.99 Kevin Wright NY active 2017-11-09 06:38:56 20197789
240819321 not_found New York us not_found 2017-11-09 07:08:11 40.75 http://www.meetup.com/members/240819321 -73.99 Jungyoon Kim NY active 2017-11-09 07:08:11 19435902
240820767 not_found New York us not_found 2017-11-09 07:47:04 40.72 http://www.meetup.com/members/240820767 -73.98 Dannah Gottlieb NY active 2017-11-09 07:47:04 24834040
240823125 not_found New York us not_found 2017-11-09 08:42:40 40.75 http://www.meetup.com/members/240823125 -73.99 Christine Pandjaitan NY active 2017-11-09 08:42:40 23695230
240830560 Parts Sales Rep @ Standard Equipment Company &... Chicago us not_found 2017-11-09 11:46:56 41.94 http://www.meetup.com/members/240830560 -87.75 Mike Kowalczyk IL active 2017-11-09 11:46:56 24317440
240830739 not_found Chicago us not_found 2017-11-09 13:12:54 41.70 http://www.meetup.com/members/240830739 -87.66 Jalen Onorati IL active 2017-11-09 13:12:54 7508692
240833111 not_found Chicago us not_found 2017-11-09 12:44:06 41.92 http://www.meetup.com/members/240833111 -87.65 Leaquat Hassan Junu IL active 2017-11-09 12:44:06 24252421
240833981 not_found New York us not_found 2017-11-09 13:04:26 40.75 http://www.meetup.com/members/240833981 -73.99 Daniel Valcourt NY active 2017-11-09 13:04:26 25484015
240834173 not_found New York us not_found 2017-11-09 13:04:22 40.75 http://www.meetup.com/members/240834173 -73.98 Felix NY active 2017-11-09 13:04:22 21016346
240835211 not_found New York us not_found 2017-11-09 13:25:49 40.75 http://www.meetup.com/members/240835211 -73.99 Vijay Shingala NY active 2017-11-09 13:25:49 26327411
240837395 not_found New York us not_found 2017-11-09 14:09:47 40.75 http://www.meetup.com/members/240837395 -73.99 Brooke Noell NY active 2017-11-09 14:09:47 20343769
240837474 not_found New York us not_found 2017-11-09 14:15:49 40.75 http://www.meetup.com/members/240837474 -73.99 Joseph Cahill NY active 2017-11-09 14:15:49 23412860
240838597 not_found New York us not_found 2017-11-09 14:43:07 40.75 http://www.meetup.com/members/240838597 -73.98 Kary Herrera NY active 2017-11-09 14:43:07 24834040
240838614 not_found New York us not_found 2017-11-09 14:33:20 40.74 http://www.meetup.com/members/240838614 -73.99 Andriana NY active 2017-11-09 14:33:20 860035
240840567 not_found New York us not_found 2017-11-09 15:03:19 40.75 http://www.meetup.com/members/240840567 -73.99 Aileen Z NY active 2017-11-09 15:03:19 26327411
240840580 not_found New York us not_found 2017-11-09 15:13:39 40.75 http://www.meetup.com/members/240840580 -73.99 Stefaniya Lexandrovna NY active 2017-11-09 15:13:39 20648888
240841318 not_found Chicago us not_found 2017-11-09 15:18:18 41.88 http://www.meetup.com/members/240841318 -87.62 Altan Erdemir IL active 2017-11-09 15:18:18 26071452
240841346 not_found New York us not_found 2017-11-09 15:18:04 40.75 http://www.meetup.com/members/240841346 -73.99 Missy Smith NY active 2017-11-09 15:18:04 25815190
240841863 not_found New York us not_found 2017-11-09 15:32:37 40.75 http://www.meetup.com/members/240841863 -73.99 Tara M. NY active 2017-11-09 15:32:37 25815190
240842594 not_found New York us not_found 2017-11-09 15:38:53 40.75 http://www.meetup.com/members/240842594 -73.99 Jade Wang NY active 2017-11-09 15:38:53 20167049
240842680 not_found San Francisco us not_found 2017-11-09 15:55:24 37.77 http://www.meetup.com/members/240842680 -122.41 Liviu-Marian Negrila CA active 2017-11-09 15:55:24 20234705
240842986 not_found Chicago us not_found 2017-11-09 15:48:47 41.94 http://www.meetup.com/members/240842986 -87.65 Trisha Orozco IL active 2017-11-09 15:48:47 23270826
240845614 not_found New York us not_found 2017-11-09 16:39:43 40.84 http://www.meetup.com/members/240845614 -73.94 Priya NY active 2017-11-09 16:39:43 23738973
240845866 not_found New York us not_found 2017-11-09 16:40:42 40.75 http://www.meetup.com/members/240845866 -73.99 Eric Seaman NY active 2017-11-09 16:40:42 25783205
240846998 not_found New York us not_found 2017-11-09 16:51:53 40.75 http://www.meetup.com/members/240846998 -73.99 Janeille Pita NY active 2017-11-09 16:51:53 20979932
240849026 not_found New York us not_found 2017-11-09 17:24:14 40.81 http://www.meetup.com/members/240849026 -73.95 HU Yang NY active 2017-11-09 17:24:14 26298738
240852081 not_found New York us not_found 2017-11-09 18:18:05 40.71 http://www.meetup.com/members/240852081 -74.00 James Weitz NY active 2017-11-09 18:18:05 26226036

1087923 rows × 13 columns

In [48]:
#One feature I want to use is the GROUP ID - one thing that we can do is "get dummies" or "one-hot encoding" to 
#turn string variables into numbers! look at it below
df2_sample_dummies = pd.get_dummies(df2_sample['group_id'], prefix = 'group_id')
In [11]:
# df2_sample_dummies_first = df2_sample_dummies.groupby(['member_id']).first()
In [7]:
#this is what it looks like to have "dummies" or one-hot encoded variables! 
#http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple 
df2_dummies.head()
Out[7]:
group_id_6388 group_id_6510 group_id_8458 group_id_8940 group_id_12542 group_id_12907 group_id_14573 group_id_15324 group_id_16620 group_id_17921 ... group_id_26371769 group_id_26372763 group_id_26373602 group_id_26374579 group_id_26374655 group_id_26375445 group_id_26376543 group_id_26377698 group_id_26378067 group_id_26378128
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 12546 columns

In [10]:
#Let's combine it back to our original dataframe 
df2_sample_dummies_concat = pd.concat([df2_sample, df2_sample_dummies], axis=1)
In [12]:
df2_sample_dummies_concat.head()
Out[12]:
member_id bio city country hometown joined lat link lon member_name ... group_id_26344309 group_id_26347789 group_id_26350181 group_id_26350972 group_id_26352410 group_id_26355546 group_id_26361954 group_id_26365189 group_id_26371769 group_id_26372763
5715312 234880949 not_found San Francisco us not_found 2017-08-29 18:43:26 37.78 http://www.meetup.com/members/234880949 -122.42 Justine Jennings ... 0 0 0 0 0 0 0 0 0 0
49395 1831033 not_found San Francisco us not_found 2011-04-29 05:36:18 37.77 http://www.meetup.com/members/1831033 -122.40 Ines Sombra ... 0 0 0 0 0 0 0 0 0 0
1177488 15422371 looking forward to playing more soccer Chicago us Chicago 2013-02-01 05:41:47 41.92 http://www.meetup.com/members/15422371 -87.70 Enrique ... 0 0 0 0 0 0 0 0 0 0
2243458 101653742 not_found New York us not_found 2013-07-10 18:53:45 40.76 http://www.meetup.com/members/101653742 -73.99 M ... 0 0 0 0 0 0 0 0 0 0
5738270 235589417 not_found New York us not_found 2017-09-08 02:03:22 40.80 http://www.meetup.com/members/235589417 -73.97 Yuanyuan (Yoannie) Lei ... 0 0 0 0 0 0 0 0 0 0

5 rows × 8200 columns

In [15]:
#let's repeat the same process for the "cities" feature
df2_sample_dummies_cities = pd.get_dummies(df2_sample_dummies_concat['city'], prefix = 'cities_')
df2_sample_dummies_concat_cities2 = pd.concat([df2_sample_dummies_concat, df2_sample_dummies_cities], axis=1)
In [16]:
df2_sample_dummies_concat_cities2.head()
Out[16]:
member_id bio city country hometown joined lat link lon member_name ... cities__Chicago cities__Chicago Heights cities__Chicago Ridge cities__East Chicago cities__New York cities__North Chicago cities__San Francisco cities__South San Francisco cities__West Chicago cities__West New York
5715312 234880949 not_found San Francisco us not_found 2017-08-29 18:43:26 37.78 http://www.meetup.com/members/234880949 -122.42 Justine Jennings ... 0 0 0 0 0 0 1 0 0 0
49395 1831033 not_found San Francisco us not_found 2011-04-29 05:36:18 37.77 http://www.meetup.com/members/1831033 -122.40 Ines Sombra ... 0 0 0 0 0 0 1 0 0 0
1177488 15422371 looking forward to playing more soccer Chicago us Chicago 2013-02-01 05:41:47 41.92 http://www.meetup.com/members/15422371 -87.70 Enrique ... 1 0 0 0 0 0 0 0 0 0
2243458 101653742 not_found New York us not_found 2013-07-10 18:53:45 40.76 http://www.meetup.com/members/101653742 -73.99 M ... 0 0 0 0 1 0 0 0 0 0
5738270 235589417 not_found New York us not_found 2017-09-08 02:03:22 40.80 http://www.meetup.com/members/235589417 -73.97 Yuanyuan (Yoannie) Lei ... 0 0 0 0 1 0 0 0 0 0

5 rows × 8210 columns

In [17]:
#write it to a CSV before your kernel dies! this could be helpful if you want to use the same sample again in the future
# df2_sample_dummies_concat_cities2.to_csv('members2.csv')
In [38]:
#print(list(df2_sample_dummies_concat_cities2.columns.values))

Training the model

I am deciding that we use groupIDs and Cities as our main features to train the model - and since we want the data itself to group and tell us what the major groups are - we are going to use a technique called "clustering" (or k-means clustering, where k= the number of clusters). I don't know in advance what the best number of clusters will be, so I will try a bunch of different k (k= 2,4,6,8) and then see how well our clusters are performing. See below :)

In [21]:
df2_sample_dummies_concat_cities2_train = df2_sample_dummies_concat_cities2.loc[:, 'group_id_6388':'cities__West New York']
df2_sample_dummies_concat_cities2_train.head()
Out[21]:
group_id_6388 group_id_6510 group_id_8458 group_id_8940 group_id_12542 group_id_12907 group_id_14573 group_id_15324 group_id_17921 group_id_18843 ... cities__Chicago cities__Chicago Heights cities__Chicago Ridge cities__East Chicago cities__New York cities__North Chicago cities__San Francisco cities__South San Francisco cities__West Chicago cities__West New York
5715312 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
49395 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1177488 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2243458 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
5738270 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0

5 rows × 8196 columns

In [22]:
#k = 8 training model
km = KMeans(n_clusters=8)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k8 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
CPU times: user 6min 12s, sys: 2min 54s, total: 9min 7s
Wall time: 10min 23s
In [ ]:
# km = KMeans(n_clusters=7)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k7 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
In [23]:
#k=6 training model
km = KMeans(n_clusters=6)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k6 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
CPU times: user 5min 21s, sys: 2min 37s, total: 7min 59s
Wall time: 8min 54s
In [ ]:
# km = KMeans(n_clusters=5)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k5 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
In [24]:
# k = 4 training model
km = KMeans(n_clusters=4)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k4 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
CPU times: user 4min 34s, sys: 2min 21s, total: 6min 55s
Wall time: 8min
In [ ]:
# km = KMeans(n_clusters=3)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k3 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
In [25]:
# k = 2 training model
km = KMeans(n_clusters=2)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k2 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
CPU times: user 3min 51s, sys: 2min 43s, total: 6min 34s
Wall time: 8min 23s

Model Evaluation

Ok now that we've trained 4 models on different size clusters (different k) and we calculated a silhouette coefficient - The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

We can plot the number of clusters by the silhouette score and use the elbow method (visually looking at the data to see where's an "elbow") to see which cluster does the best. The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in a dataset.

http://www.awesomestats.in/python-cluster-validation/

In [26]:
# silhouette = [silhouette_k2, silhouette_k3, silhouette_k4, silhouette_k5, silhouette_k6, silhouette_k7, silhouette_k8]
# count_k = [2, 3, 4, 5, 6, 7, 8]

silhouette = [silhouette_k2, silhouette_k4, silhouette_k6, silhouette_k8]
count_k = [2, 4, 6, 8]

count_silhouette = list(zip(count_k, silhouette))
print(count_silhouette)
[(2, 0.20347691177588109), (4, 0.23169483895092327), (6, 0.075606478197469767), (8, 0.0028336940867501044)]
In [27]:
plt.plot(*zip(*count_silhouette))
Out[27]:
[<matplotlib.lines.Line2D at 0x11947ee48>]

From this plot - I am going to go with k=6 being the "elbow" of the data - it is doing the best in terms of clustering and not significantly better than k=8 clusters.

In [29]:
km = KMeans(n_clusters=6)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters6 = km.labels_.tolist()
silhouette_k6 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
CPU times: user 5min 8s, sys: 2min 39s, total: 7min 47s
Wall time: 8min 40s
In [32]:
#Let's assign these clusters back to the original df and take a look!
df2_sample_dummies_concat_cities2_train.loc[:, "cluster_number"] = clusters6
In [34]:
df2_sample_dummies_concat_cities2_train.head()
Out[34]:
group_id_6388 group_id_6510 group_id_8458 group_id_8940 group_id_12542 group_id_12907 group_id_14573 group_id_15324 group_id_17921 group_id_18843 ... cities__Chicago Heights cities__Chicago Ridge cities__East Chicago cities__New York cities__North Chicago cities__San Francisco cities__South San Francisco cities__West Chicago cities__West New York cluster_number
5715312 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
49395 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
1177488 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
2243458 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
5738270 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 8197 columns

In [36]:
#it is important to investigate how many samples are in each of your clusters - we can seee here that the first 3 
#clusters have WAAAY more samples than the last 3! So, when we plot our visualizations, let's see what makes them
#so different!
df2_sample_dummies_concat_cities2_train["cluster_number"].value_counts()
Out[36]:
0    27013
1    12500
2    10266
3      106
4       91
5       24
Name: cluster_number, dtype: int64

Model Output

Now that we've decided on k=6 clusters, let's assign the cluster labels back to the original data, and make it interpretable!

In [39]:
df2_sample.head()
Out[39]:
member_id bio city country hometown joined lat link lon member_name state member_status visited group_id
5715312 234880949 not_found San Francisco us not_found 2017-08-29 18:43:26 37.78 http://www.meetup.com/members/234880949 -122.42 Justine Jennings CA active 2017-09-05 17:34:49 4260482
49395 1831033 not_found San Francisco us not_found 2011-04-29 05:36:18 37.77 http://www.meetup.com/members/1831033 -122.40 Ines Sombra CA active 2015-11-29 04:23:38 1811614
1177488 15422371 looking forward to playing more soccer Chicago us Chicago 2013-02-01 05:41:47 41.92 http://www.meetup.com/members/15422371 -87.70 Enrique IL active 2017-09-06 23:02:43 565564
2243458 101653742 not_found New York us not_found 2013-07-10 18:53:45 40.76 http://www.meetup.com/members/101653742 -73.99 M NY active 2016-11-07 20:56:57 2662432
5738270 235589417 not_found New York us not_found 2017-09-08 02:03:22 40.80 http://www.meetup.com/members/235589417 -73.97 Yuanyuan (Yoannie) Lei NY active 2017-09-08 02:03:22 8639012
In [40]:
df2_sample_dummies_concat_cities2_train.head()
Out[40]:
group_id_6388 group_id_6510 group_id_8458 group_id_8940 group_id_12542 group_id_12907 group_id_14573 group_id_15324 group_id_17921 group_id_18843 ... cities__Chicago Heights cities__Chicago Ridge cities__East Chicago cities__New York cities__North Chicago cities__San Francisco cities__South San Francisco cities__West Chicago cities__West New York cluster_number
5715312 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
49395 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
1177488 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
2243458 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
5738270 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0

5 rows × 8197 columns

In [41]:
df2_sample.loc[:, "cluster_number"] = clusters6
In [42]:
df2_sample_merged = df2_sample.merge(df[['group_id', 'category.shortname']], on=['group_id'])

Tadah! We have a merged dataframe of members, clustered by their city and groups they're interested in, merged on what the group categories are (from the original groups df) and we can now export this and explore!

In [43]:
df2_sample_merged.head()
Out[43]:
member_id bio city country hometown joined lat link lon member_name state member_status visited group_id cluster_number category.shortname
0 234880949 not_found San Francisco us not_found 2017-08-29 18:43:26 37.78 http://www.meetup.com/members/234880949 -122.42 Justine Jennings CA active 2017-09-05 17:34:49 4260482 1 socializing
1 204944223 not_found San Francisco us not_found 2016-05-11 14:15:36 37.76 http://www.meetup.com/members/204944223 -122.48 Leslie W CA active 2017-05-05 23:19:38 4260482 1 socializing
2 235052959 not_found San Francisco us not_found 2017-08-28 03:03:10 37.72 http://www.meetup.com/members/235052959 -122.44 Lauren Waterman CA active 2017-08-28 03:03:10 4260482 1 socializing
3 118747522 Love walking! San Francisco us not_found 2014-06-05 01:12:03 37.78 http://www.meetup.com/members/118747522 -122.42 Hannah K CA active 2014-11-30 21:56:50 4260482 1 socializing
4 81700682 not_found San Francisco us not_found 2015-05-23 21:43:23 37.76 http://www.meetup.com/members/81700682 -122.44 Charlotte CA active 2017-09-24 06:09:25 4260482 1 socializing
In [47]:
# This is the final file you will be using for this assignment to explore :) 
df2_sample_merged.to_csv('members_cluster_group.csv')