Hearing-impaired people and non-native speakers rely on captions for access to video content, yet most videos remain uncaptioned or have machine-generated captions with high error rates. In this paper, we present the design, implementation and evaluation of BandCaption, a system that combines automatic speech recognition with input from crowd workers to provide a cost-efficient captioning solution for accessible online videos. We consider four stakeholder groups as our source of crowd workers: (i) individuals with hearing impairments, (ii) second-language speakers with low proficiency, (iii) second-language speakers with high proficiency, and (iv) native speakers. Each group has different abilities and incentives, which our workflow leverages. Our findings show that BandCaption enables crowd workers who have different needs and strengths to accomplish micro-tasks and make complementary contributions. Based on our results, we outline opportunities for future research and provide design suggestions to deliver cost-efficient captioning solutions.