Below are some random notes from the discussion.
Generally speaking, the eventual goal is growth curve that indicates whether subject is getting better or worse over time. This might split into three time-regions with different dynamics: pre-treatment, treatment, recovery.
The population could be split into three groups: radiation, chemo, untreated; start by picking one and modeling it (probably chemo is a good start). For now, untreated participant will probably be left out until we have a better idea of what we're modeling.
Absolute elapsed time may be less interesting than relative progress through treatment (but need to investigate).
Will probably want to synchronize sequential data on an event (e.g. first treatment), since the first visit holds no strong meaning.
Immediate taks: Cluster trajectories using GMM with missing data. Use visit-number as time index, syncing on (x \in) {first-treatment, final treatment}. Split into chemo and radiation groups and handle separately. Be smart about choosing dataset to minimize missing data. Visualize by plotting single dimension, one curve per cluster.
Send kobus MARRS paper, Hinton tech report Read relevant chapters of murphy
Emily says to use these datasets specifically for the first approach:
['sri3.csv', 'das4.csv', 'ea4.csv', 'fact4.csv', 'pss4.csv']
Several redundant columns here. Questions 4 and 5 ( "PartnerUpsetting4" and "Conflicted5") are recoded as "repartnerupsetting4" and "reconflicted5", and then all columns are cloned into XXXn_conv, for n in {1,...,5}. Thus, the relevant columns are:
partnerimp1_conv
partnerpredictable2_conv
partnerhelpful3_conv
repartnerupsetting4_conv
reconflicted5_conv
Relevant columns:
Affection1
Sex2
Kiss3
TooTired4
ShowLove5
Relevant columns:
FigureOut1
Comfortable2
GladPleased3
Worthwhile4
Cherish5
AttendTo6
Enjoy7
Appreciate8
LetGo9
TakeCareOf10
EnergyFigureOut11
Like12
InTouch13
Relevant columns (all should be reversed)
LackEnergy1
Nausea2
TrblMeetNeedsFam3
HavePain4
SideEffectsBother5
FeelSick6
TimeInBed7
Like SRI, we used the recoded *_conv
fields:
upsetunexpectantly1_conv
unabletocontrol2_conv
stressed3_conv
redealtwithhassles4_conv
ineffectivecoping5_conv
reconfidenthandleprob6_conv
regoingyourway7_conv
notcopewithall8_conv
recontrolirritations9_conv
reontopofthings10_conv
Forty total dimensions.
These datasets apply to everyone, and should have minimal missing data (30 total):
PSS
FACT
EA
These datasets apply only to those in a relationship (10 total):
DAS
SRI
Take three approaches:
(1) look for nontemporal relationships among these dimensions (PCA). Several of these dimensions may collapse into one. (2) concatenate all visits into a vector and cluster (40 * 9 = 360 dimensional points) (3) repeat (2) but re-center based on treatment date
999 nonempty records in the full 40-dimensional model.
651 records are missing-data-free in the full 40-dimensional model.
863 records are missing-data-free in the 30-dimensional relationship-free model.
TODO: concatenate for approach (2) above, and re-evaluate coverage.
We didn't really expect PCA to give interesting results, but I ran it anyway. Definitely no obvious clusters.
About 20 of the 40 dimensions seem relevant?
Posted by Kyle Simek