Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

How to evaluate worker performance independently on a distributed training #520

@delucca

Description

@delucca

Hi

I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.

Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.

Anyone has any idea how can I do that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions