关键词:
Task analysis
Encoding
Resource management
Termination of employment
Machine learning algorithms
Vectors
Optimization
Deep reinforcement learning (DRL)
distributed coded machine learning (DCML)
worker selection
workload allocation
摘要:
Machine learning (ML) has been widely applied to successfully address a variety of different problems across diverse domains, such as robotics, healthcare, and finance. However, high-complexity ML algorithms often require overlong computation time, which significantly impacts their feasibility. Distributed ML (DML) has been used to tackle the slow computation problem with high-complexity ML algorithms. Nevertheless, with DML, the computation results from all participating computing devices need to be collected in order to complete an ML task. When part of the participating devices, known as the stragglers, cannot return their results in time, the overall computation time will be extended. Distributed coded ML (DCML) is a promising solution to mitigate the negative impact of the stragglers. With DCML, redundancy is injected into an ML task so that only a subset of the results from participating devices are required to finish the ML task. In DCML, how to select proper participating devices, referred to as workers, and how to allocate appropriate workloads to the selected workers are two challenging problems. In this article, we consider a DCML scenario where numerous computing devices are available for an ML task. These devices are willing to offer their computation capacity in exchange for compensation. To encourage the computing devices to participate in the distributed computation, a reverse auction-based incentive mechanism is employed. With the objective of minimizing both the completion time of the ML task and the compensation for participating devices, we propose a deep-reinforcement-learning-based workload allocation and worker selection scheme for DCML, DAS. To our knowledge, this is the first attempt to simultaneously tackle both the workload allocation and worker selection issues in DCML. Our experimental results indicate that DAS outperforms the state-of-the-art schemes in terms of completion time and compensation.