Stabilization appears to be a subset of a literally wider, but more rewarding, challenge: reconstructing the whole area that is scanned by the camera. It could be better to work on that challenge, not on simple stabilization.
That's similar to how the human visual system 'paints' a coherent scene from a quite narrow field of high-resolution view, with educated guesses and assumptions
I guess yes. Having worked on video processing, it's always better if you can stabilize because it significantly reduces the number of unique tokens, which would be even more useful for the present method.
However, you probably lose in generalization performance and not all videos can be stabilized.
That was my feeling too for the most part, but The run length is a significant source of information and if it enables tokens to be skipped it is essentially gaining performance by working with a smaller but more dense form of the same information. My instinct is that run-length would be just the most basic case of a more generalized method for storing token information to encompass time and area and for the density of information in tokens to be more even, The area and duration being variable but the token stream containing a series of tokens containing similar quantities of semantic data.
I feel like this is very much like the early days of data compression where a few logical but kind of ad-hoc principles are being investigated in advance of a more sophisticated theory that integrates the ideas of what is being attempted, how to identify success, and recognizing pathways that move towards the optimal solution.
As far as I can can tell though the core idea is the same, to focus on the differences, the implementation is different. Differential transformers 'calculates attention scores as the difference between two separate softmax attention maps'. So they must process the redundant areas. This removes them altogether, which would significantly reduce compute. Very neat idea.
However, I do think that background information can sometimes be important. I reckon a mild improvement on this model would be to leave the background in the first frame, and perhaps every x frames, so that the model gets better context cues. This would also more accurately replicate video compression.
What would be the applications of this that is different from regular transformers? Perhaps stupid question.
For training, would it be useful to stabilize the footage first?
Stabilization appears to be a subset of a literally wider, but more rewarding, challenge: reconstructing the whole area that is scanned by the camera. It could be better to work on that challenge, not on simple stabilization.
That's similar to how the human visual system 'paints' a coherent scene from a quite narrow field of high-resolution view, with educated guesses and assumptions
I guess yes. Having worked on video processing, it's always better if you can stabilize because it significantly reduces the number of unique tokens, which would be even more useful for the present method. However, you probably lose in generalization performance and not all videos can be stabilized.
Isn't this like Differential Transformers that worked based on differences?
That was my feeling too for the most part, but The run length is a significant source of information and if it enables tokens to be skipped it is essentially gaining performance by working with a smaller but more dense form of the same information. My instinct is that run-length would be just the most basic case of a more generalized method for storing token information to encompass time and area and for the density of information in tokens to be more even, The area and duration being variable but the token stream containing a series of tokens containing similar quantities of semantic data.
I feel like this is very much like the early days of data compression where a few logical but kind of ad-hoc principles are being investigated in advance of a more sophisticated theory that integrates the ideas of what is being attempted, how to identify success, and recognizing pathways that move towards the optimal solution.
These papers are the foundations of that work.
As far as I can can tell though the core idea is the same, to focus on the differences, the implementation is different. Differential transformers 'calculates attention scores as the difference between two separate softmax attention maps'. So they must process the redundant areas. This removes them altogether, which would significantly reduce compute. Very neat idea.
However, I do think that background information can sometimes be important. I reckon a mild improvement on this model would be to leave the background in the first frame, and perhaps every x frames, so that the model gets better context cues. This would also more accurately replicate video compression.