This is a jupyter notebook on clustering meetup.com data! In this notebook, I have used the location and group information to cluster the members into 6 clusters - but it is up to you to figure out what they mean!
#these are all of the libraries i'll be using - and I load the groups.csv data
import pandas as pd
import numpy as np
import random
import sklearn
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
df = pd.read_csv('groups.csv')
This step we're going to look at our dataframes and become familiar with what's in them. This data was collected via meetup.com API in Dec 2017
#this is what the groups.csv looks like as a dataframe - it is about the groups
df.head()
#this is the df about the members
df2 = pd.read_csv('members.csv', encoding = "ISO-8859-1")
df2.head()
In this phase, we'll sample a % of the data, and then use "one hot" encoding to turn string features into numbers for our mathematical models! (read more here: http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple)
# I am sampling to 50,000 random samples from the dataframe since it is just SOOO big! This will help my code run faster
df2_sample = df2.sample(n=50000)
#let's explore the df by member id and the first record number of each - what are the features we want to use?
df2_sample.groupby(['member_id']).first()
#One feature I want to use is the GROUP ID - one thing that we can do is "get dummies" or "one-hot encoding" to
#turn string variables into numbers! look at it below
df2_sample_dummies = pd.get_dummies(df2_sample['group_id'], prefix = 'group_id')
# df2_sample_dummies_first = df2_sample_dummies.groupby(['member_id']).first()
#this is what it looks like to have "dummies" or one-hot encoded variables!
#http://www.insightsbot.com/blog/zuyVu/python-one-hot-encoding-with-pandas-made-simple
df2_dummies.head()
#Let's combine it back to our original dataframe
df2_sample_dummies_concat = pd.concat([df2_sample, df2_sample_dummies], axis=1)
df2_sample_dummies_concat.head()
#let's repeat the same process for the "cities" feature
df2_sample_dummies_cities = pd.get_dummies(df2_sample_dummies_concat['city'], prefix = 'cities_')
df2_sample_dummies_concat_cities2 = pd.concat([df2_sample_dummies_concat, df2_sample_dummies_cities], axis=1)
df2_sample_dummies_concat_cities2.head()
#write it to a CSV before your kernel dies! this could be helpful if you want to use the same sample again in the future
# df2_sample_dummies_concat_cities2.to_csv('members2.csv')
#print(list(df2_sample_dummies_concat_cities2.columns.values))
I am deciding that we use groupIDs and Cities as our main features to train the model - and since we want the data itself to group and tell us what the major groups are - we are going to use a technique called "clustering" (or k-means clustering, where k= the number of clusters). I don't know in advance what the best number of clusters will be, so I will try a bunch of different k (k= 2,4,6,8) and then see how well our clusters are performing. See below :)
df2_sample_dummies_concat_cities2_train = df2_sample_dummies_concat_cities2.loc[:, 'group_id_6388':'cities__West New York']
df2_sample_dummies_concat_cities2_train.head()
#k = 8 training model
km = KMeans(n_clusters=8)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k8 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
# km = KMeans(n_clusters=7)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k7 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
#k=6 training model
km = KMeans(n_clusters=6)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k6 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
# km = KMeans(n_clusters=5)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k5 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
# k = 4 training model
km = KMeans(n_clusters=4)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k4 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
# km = KMeans(n_clusters=3)
# %time km.fit(df2_sample_dummies_concat_cities2_train)
# clusters = km.labels_.tolist()
# silhouette_k3 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:10000], clusters[0:10000])
# k = 2 training model
km = KMeans(n_clusters=2)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters = km.labels_.tolist()
silhouette_k2 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
Ok now that we've trained 4 models on different size clusters (different k) and we calculated a silhouette coefficient - The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
We can plot the number of clusters by the silhouette score and use the elbow method (visually looking at the data to see where's an "elbow") to see which cluster does the best. The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in a dataset.
# silhouette = [silhouette_k2, silhouette_k3, silhouette_k4, silhouette_k5, silhouette_k6, silhouette_k7, silhouette_k8]
# count_k = [2, 3, 4, 5, 6, 7, 8]
silhouette = [silhouette_k2, silhouette_k4, silhouette_k6, silhouette_k8]
count_k = [2, 4, 6, 8]
count_silhouette = list(zip(count_k, silhouette))
print(count_silhouette)
plt.plot(*zip(*count_silhouette))
km = KMeans(n_clusters=6)
%time km.fit(df2_sample_dummies_concat_cities2_train)
clusters6 = km.labels_.tolist()
silhouette_k6 = silhouette_score(df2_sample_dummies_concat_cities2_train[0:50000], clusters[0:50000])
#Let's assign these clusters back to the original df and take a look!
df2_sample_dummies_concat_cities2_train.loc[:, "cluster_number"] = clusters6
df2_sample_dummies_concat_cities2_train.head()
#it is important to investigate how many samples are in each of your clusters - we can seee here that the first 3
#clusters have WAAAY more samples than the last 3! So, when we plot our visualizations, let's see what makes them
#so different!
df2_sample_dummies_concat_cities2_train["cluster_number"].value_counts()
Now that we've decided on k=6 clusters, let's assign the cluster labels back to the original data, and make it interpretable!
df2_sample.head()
df2_sample_dummies_concat_cities2_train.head()
df2_sample.loc[:, "cluster_number"] = clusters6
df2_sample_merged = df2_sample.merge(df[['group_id', 'category.shortname']], on=['group_id'])
df2_sample_merged.head()
# This is the final file you will be using for this assignment to explore :)
df2_sample_merged.to_csv('members_cluster_group.csv')