Many methods have been proposed for human activity classification, which rely either on Inertial Measurement Unit (IMU) data or data from static cameras watching subjects. There have been relatively less work using egocentric videos, and even fewer approaches combining egocentric video and IMU data. Systems relying only on IMU data are limited in the complexity of the activities that they can detect. In this paper, we present a robust and autonomous method, for fine-grained activity classification, that leverages data from multiple wearable sensor modalities to differentiate between activities, which are similar in nature, with a level of accuracy that would be impossible by each sensor alone. We use both egocentric videos and IMU sensors on the body. We employ Capsule Networks together with Convolutional Long Short Term Memory (LSTM) to analyze egocentric videos, and an LSTM framework to analyze IMU data, and capture temporal aspect of actions. We performed experiments on the CMU-MMAC dataset achieving overall recall and precision rates of 85.8% and 86.2%, respectively. We also present results of using each sensor modality alone, which show that the proposed approach provides 19.47% and 39.34% increase in accuracy compared to using only ego-vision data and only IMU data, respectively.