@nsplattner @thempel and me were discussing the general recommended flow of data and this is somehow related to the questions in #23 .
Question
In #23 I asked about the structure of reduced trajectories (multiple-files...) so that PyEMMA or another analysis can be always used. Now, we the new directory approach there is no fixed structure and hence the framework cannot guess, what do to with the trajectories. Which files in the directory to use, etc... Still, the Trajectory objects, will have information about strides, but I guess that an engine will subclass from Trajectory to add certain information that is needed for restart with the engines particular way of storing things.
It means, that an engine writing the files in a trajectory folder, could also add information about filenames, etc to the Trajectory object it returns. We could agree that an engine needs to provide functions or bash snippets to extract a frame from such a trajectory. A trajectory know its generating engine and so a trajectory would have access to code that can extract frames, etc.
So, in theory it would be possible to write the trajectory analysis independent of the engine that generated the data. That was my original approach, but I guess that this will not reflect the way, things are currently done by people. Everyone wants something specific for whatever reason and so to trivial solution is, that
1. Trajectory generation and trajectory analysis goes in pairs.
You need to pass exactly the files PyEMMA needs and tell PyEMMA about the stride, etc you used to generate these. The downside is that code becomes less reuseable and hence easier to screw up.
This is easy because everyone writes their own code.
2. Write engine specific functions to read trajectories into pyemma
Hmmm, that would mean to add analysis specific code to the engine and I would really like to keep these separate. Still, it could make sense to have functions that allow you get certain files for certain aspects
t = Trajectory(...)
reduced_traj = engine.get_reduced(t) # find the file for the reduced traj
full_traj = engine.get_full(t) # find the file for the full traj
# this would be the normal way and you need to know `reduced1.dcd` as filename
reduced_trajs_for_analysis = project.trajectories.all.to_path('reduced1.dcd')
3. Use feature trajectories
This is what we discussed and could make sense. Instead of re-writing the PyEMMA input you need to write an engine specific featurizer which could be much simpler. It will also cache features for all trajectories. Useful, if these are expensive to compute but cheap to store.
It requires an intermediate featurization step, but then you just pass featurized trajectories to PyEMMA
In this approach we still need to figure out on where to store the feature_trajs. Could be in the trajectory folder, since this needs to exists before you can compute features.
IDEAS?
@nsplattner @thempel and me were discussing the general recommended flow of data and this is somehow related to the questions in #23 .
Question
In #23 I asked about the structure of reduced trajectories (multiple-files...) so that PyEMMA or another analysis can be always used. Now, we the new directory approach there is no fixed structure and hence the framework cannot guess, what do to with the trajectories. Which files in the directory to use, etc... Still, the Trajectory objects, will have information about strides, but I guess that an engine will subclass from
Trajectoryto add certain information that is needed for restart with the engines particular way of storing things.It means, that an engine writing the files in a trajectory folder, could also add information about filenames, etc to the
Trajectoryobject it returns. We could agree that an engine needs to provide functions or bash snippets to extract a frame from such a trajectory. A trajectory know its generating engine and so a trajectory would have access to code that can extract frames, etc.So, in theory it would be possible to write the trajectory analysis independent of the engine that generated the data. That was my original approach, but I guess that this will not reflect the way, things are currently done by people. Everyone wants something specific for whatever reason and so to trivial solution is, that
1. Trajectory generation and trajectory analysis goes in pairs.
You need to pass exactly the files PyEMMA needs and tell PyEMMA about the stride, etc you used to generate these. The downside is that code becomes less reuseable and hence easier to screw up.
This is easy because everyone writes their own code.
2. Write engine specific functions to read trajectories into pyemma
Hmmm, that would mean to add analysis specific code to the engine and I would really like to keep these separate. Still, it could make sense to have functions that allow you get certain files for certain aspects
3. Use feature trajectories
This is what we discussed and could make sense. Instead of re-writing the PyEMMA input you need to write an engine specific featurizer which could be much simpler. It will also cache features for all trajectories. Useful, if these are expensive to compute but cheap to store.
It requires an intermediate featurization step, but then you just pass featurized trajectories to PyEMMA
In this approach we still need to figure out on where to store the feature_trajs. Could be in the trajectory folder, since this needs to exists before you can compute features.
IDEAS?