Reverse engineering the obfuscated TikTok VM

(github.com)

412 points | by xfeeefeee 8 months ago ago

130 comments

kleiba 8 months ago

I've been using a shitty streaming website whose player interrupts the playback of a video in irregular intervals and presents a cryptic error message. I've started looking into the JavaScript code to see if I can't code up a work-around mechanism (basically debugging their garbage implementation), and of course (why actually?) their player code is also obfuscated.

And I've gotta say, emplying an AI assistant has proven to be an invaluable help in trying to understand obfuscated code. It's actually really cool to take a function of gobbledegook JavaScript and ask the AI to rewrite it in a more canonical and easily understandable way, with inline comments. Of course, there are flaws every now and then, but the ability to do this has been such a game changer for reverse engineering, IMO.

I can even ask to take a guess at finding better variable/function names and the AI can infer from the code (maybe has seen the unobfuscated libraries during training?) what this code is actually doing on a high-level and turn something like e.g(e.g) into player.initialize(player.state) which is nothing short of amazing.

So for anyone doing similar work, I cannot recommend highly enough to have an AI agent as another tool in your tool belt.

[-]

poincaredisk 8 months ago

I'm surprised by this. As a professional reverse engineering I've actually found LLMs to be terrible at deobfuscation of JS (especially in the context of JS malware). But maybe my requirements are higher and it's actually OK for occasional use against weak packers?

[-]

ctoth 8 months ago

Have you seen this?

https://github.com/jehna/humanify

What they do is ground the LLM to the AST with Babel to ensure you still get the same shape of AST out of your deobfuscation pass. Probably this tool could be cleaned up, made to work with multiple llm and parser backends, have its prompts improved, &c.

[-]

rfoo 8 months ago

This is great idea! But it's more about having LLMs to give function & variables names, instead of having LLM to deobfuscate. The (traditional) deobfuscations (e.g. unpack, de-flatten, de-virtualization etc) were done by 100% precise human made Babel plugins and is totally unrelated to a LLM.

Bilal_io 8 months ago

I've used it for small files and it did very well prettifying, naming the variables and adding comments for context. But I can imagine it doing a bad job with large files.

saagarjha 8 months ago

Is it truly obfuscated, or just minified?

[-]

johann8384 8 months ago

Well the example in the article was obfuscated with several specific examples.

[-]

saagarjha 8 months ago

I mean the JavaScript the LLM reversed for them

pcwalton 8 months ago

I tried ChatGPT 4o to help me reverse engineer some game code with the symbols missing and the results were quite disappointing. To say it had a tendency to hallucinate is an understatement. It didn't have any clue what was going on.

For me, those AI tools are much better at saving me time looking up documentation when doing simple things where it has examples of the exact code pattern I'm looking for in its training set. ChatGPT is great at writing one-off Blender scripts for me to give to artists, for instance.

lukan 8 months ago

Which AI agents did you use?

[-]

kleiba 8 months ago

I've tried different ones, they all seem to do a great job.

[-]

sureIy 8 months ago

Could you name a couple?

ImPostingOnHN 8 months ago

next up is using AI to obfuscate it better in the first place, and then the terrible code gets scraped and used in further training, with an arms race ensuing, until all code on the internet is unintelligible but somehow works and can only be maintained by a specific AI that has a particularly encoded form of insanity

titaphraz 8 months ago

> they all seem to do a great job

Yeah right.

klabetron 8 months ago

Out of curiosity (as someone disappointingly new to prompt engineering), what’s an example prompt you used with some success?

[-]

nurettin 8 months ago

Actually knowing the subject and presenting insights gives me much better results than simply asking it to do what I mean.

Loughla 8 months ago

For help with prompt engineering, take a graduate level grant writing course. It teaches you how to ask the right questions to get answers from humans and how to break down complicated processes into bite size pieces; really useable for llm's.

[-]

specialist 8 months ago

Heh. Probably also useful should a djinn ever grant you three wishes.

esseph 8 months ago

Ask questions. Be disappointed in the outcomes.

Ask more questions. Get some right answers. Repeat.

Make question asking muscle get swole.

SoKamil 8 months ago

> As this is a Javascript file executed on the web, it is actually possible to replace the normal webmssdk.js with the deobfuscated file and use TikTok normally.

> This can be achieved by using two browser extensions known as Tampermonkey for executing custom code and CSP to disable CSP so I can fetch files from blocked origins. This is so I can put latestDeobf.js in my own file server and have it be fetched each time, this is so I can easily edit the file and let the changes take effect each time I refresh. This makes it much easier to bebug when reversing functions.

I believe you can achieve the same effect without any 3rd party extensions. You can use Local Overrides in Chrome DevTools.

Great work!

[-]

wutwutwat 8 months ago

You can also install some trusted certs and MITM the requests, replacing the content with whatever you'd like

Likely overkill for this use case, but no matter the client, you can in theory do whatever you want to any traffic up until the point it leaves your network.

[-]

ImPostingOnHN 8 months ago

what toolset do you use for on-the-fly translation?

ad-hoc code, or something with a more structured workflow, maybe?

this sounds like a fun thing to try, thanks for your time

[-]

SoKamil 8 months ago

Charles, Proxyman, or mitmproxy if you like open source + terminal would do the job.

[-]

geoka9 8 months ago

mitmproxy will even allow you to script the intercept/override behavior, which can be really handy.

18172828286177 8 months ago

See Burpsuite

godelski 8 months ago

This seems like quite a lot of work to hide the code. What would the legitimate reasons for this be? Because it looks like it would make the program less optimized and more complexity just leads to more errors.

I understand the desire to make it harder for bots, but 1) it doesn't seem to be effective and bots seem to be going a very different route 2) there's got to be better ways that are more effective. It's not like you're going to stop clones through this because clones can replicate by just seeing how things work and reverse engineer blackbox style.

[-]

noduerme 8 months ago

A generous take would be that they have their own internal GUI tools that make it easier for non-programmers to set up visual elements in this. That was historically the reason to invent VMs like Flash. A less generous take would account for the enormous potential for hiding nefarious code inside such a thing, and account for the nature of the government which deployed it, and conclude that it was a national security / defense project disguised as a candy-coated trojan horse.

[-]

supriyo-biswas 8 months ago

VM-based architectures are really common in the obfuscation space, which is why you have executable packers[1], JS packers[2] and bot management products[3][4] leveraging similar techniques.

As for why the obfuscation is needed: bot management products suffer from a fundamental weakness in that ultimately, all of them simply collect static data from the environment, therefore it would make much more sense to make the steps involved as difficult to reverse engineer as possible. Once that is done, all you need to do is slightly change the schematics of your script every few weeks and publish a new bundle, and you've got yourself a pretty unsubvertible* protection scheme.

Regarding the "trojan horse", I think someone is yet to show proof that it's a Javascript exploit.

(*Unsubvertible is obviously relative, but raising the cost the attack, from say, $0.01/1000 requests to $10/1000 requests would massively cut down on abuse.)

[1] https://vmpsoft.com/

[2] https://jscrambler.com/

[3] https://github.com/neuroradiology/InsideReCaptcha

[4] https://www.zenrows.com/blog/bypass-cloudflare#_qEu5MvVdnILJ...

[-]

noduerme 8 months ago

[flagged]

[-]

supriyo-biswas 8 months ago

> Packers and obfuscators are not a "VM". It may look like a VM and act like a VM, in that it has opcodes that you can write to in another higher-level "language", but that is not a VM. It's in the same sandbox as the thing you're trying to obfuscate.

Indignant, caustic comments that simply discard all presented evidence such as this is the primary reason why people with the relevant experience have reduced their contributions on HN. I suspect I'll join their ranks too.

[-]

esseph 8 months ago

If you explain what you disagree with, maybe people would learn something.

If you joint "their ranks", you've simply gone down the bit of the "social media + expertise" bell curve, where the more expertise you get on a particular topic, the less you want to engage about it in public.

This is not unique to any one field or realm of knowledge.

[-]

oefrha 8 months ago

> I can't think of a scenario where you'd need to deconstruct their front-end to mimic the calls. Just observe the calls and mimic them.

Just about everything in that wall of text is wrong, but it’s rather pointless to engage someone who clearly has never reverse engineered a single thing with a modicum of defense, yet has tons of opinions on the subject. It’s like debating the pros and cons of programming languages with someone whose claim of expertise is having used computers. I totally get gp’s frustration. Denial of service with too many absurd points is real.

> If you explain what you disagree with, maybe people would learn something.

TFA explains in detail why reverse engineering is necessary, and specifically what is achieved by reverse engineering which part. ggp clearly has zero interest in learning when much of what they wrote has been refuted in the damn post being discussed. Other people can RTFA to learn as well, it’s a good technical post, and we don’t get enough of those these days.

[-]

noduerme 8 months ago

k, you're saying you can't understand the obfuscated bullet they're sending to the backend unless you understand how it's turned into a particular nut on the front, and understand both sides of the request. I see how that makes sense, if you can't even figure out how to decipher the requests as you read them going out. Is that what you're trying to say? If so, say it, rather this this "wall of text" about why technical posts are scarce and bla bla bla. I'm more interested in the reason for this kind of obfuscation (and no, I don't think it's just to protect against bots, as there are many ways to do that).

[-]

gruez 8 months ago

>I'm more interested in the reason for this kind of obfuscation (and no, I don't think it's just to protect against bots, as there are many ways to do that).

see: https://news.ycombinator.com/item?id=43748681 and https://news.ycombinator.com/item?id=43749282

all of those can't really work if the javascript payload is easily comprehensible, because you can just write your own implementation in python or whatever.

oefrha 8 months ago

You see a parameter called signature in the request with a random looking value, you try to "mimic" it (how?) and you always get 403 back. How do you proceed? TFA tells you know, and tells you why reverse engineering is necessary (TFA in fact goes one step further than what is necessary, but you have to do at least half of the work there — I have done exactly that myself in the past).

All you've posted so far is "I don't think <other people's points>" while being wildly wrong. It's on you to explain handwaves like "just observe the calls and mimic them", but I don't think you'll do that, plus anyone with experience here can tell it's nonsense anyway, so here's where I disengage.

Edit: I'll add another point of view as someone who has implemented my own obfuscation scheme in a product where throwing up a third party CAPTCHA isn't an option (the above is from having studied and worked around other people's obfuscation schemes, including TikTok's). Obfuscation is an arms race so there's no 100% winning, but my implementation, while vastly simpler and probably won't stop LukasOgunfeitimi, reduced the observable abuse of our product down to effectively zero. Turns out most hackers are pretty dumb. So, this shit works, "I don't think" be damned.

supriyo-biswas 8 months ago

> Denial of service with too many absurd points is real.

Otherwise also known as sealioning: https://en.wikipedia.org/wiki/Sealioning

[-]

noduerme 8 months ago

[flagged]

supriyo-biswas 8 months ago

This is all correct, just that I'm lamenting the decline of technical discussions over uninformed positions and polarizing diatribes at a place I loved to come to be informed. /soapbox

[-]

esseph 8 months ago

You're the one doing the informing on this topic, congrats, you've forum-peaked.

noduerme 8 months ago

It's also a learning experience on both sides if people who have the knowledge share it with others. Teaching is a great way to learn; I exercise a great deal in my mind when I'm not working by trying to help other people solve problems.

[-]

8 months ago

[deleted]

noduerme 8 months ago

I'm not trying to be caustic or indignant. Not at all! I'm very surprised someone would take it that way. I have a different definition of what a "VM" is. I've been writing code for 30+ years and this is my understanding of what that means. Please, by all means, if I am completely wrong then take the time to explain what your view is.

marvinmorvin 8 months ago

[flagged]

saagarjha 8 months ago

Amazing. Every word of what you said was wrong.

[-]

noduerme 8 months ago

Well, then correct me.

[-]

saagarjha 8 months ago

Your definition of VM diverges from the standard one. This is a VM:

> it has opcodes that you can write to in another higher-level "language"

VMs aren't just VirtualBox.

TikTok obfuscates their frontend because they fingerprint the device and send it to their server, likely as an anti-fraud thing. Generally these kinds of things will make it so that you can't treat the frontend as a black box and replay requests or do anything simple like that. For example, if they add an incrementing counter to each request and then encrypt it, and then they see you send the same request again, then they will flag you for doing something fishy. You'd only know what they were doing if you reverse engineered the frontend. This isn't security per se but it makes people have to reverse engineer the code to mimic genuine behavior, which is their goal.

As for TikTok being a cyberweapon: you're going to have to back up that claim. Obfuscating an app and then installing it on people's phones is not as clever an idea as you think it is.

noduerme 8 months ago

I seem to be misunderstood. In a sibling part of this thread that was responding to someone who was later flagged, I wrote this:

But that's basically an emulator of a VM, isn't it? It's like rewriting the Flash AVM2 into JS... it's still running in JS whereas the original VM was C++. It could JIT compile stuff but only because it literally was reserving memory that could overflow, and (semi-technical take here) from that advantage, of being closer to the metal, flowed all of the flaws in AVM2 that precipitated most of Adobe's woes with Flash. A VM implant in a web page that uses a plugin like Java or Flash, to get around running browser-sandboxed code, which can take over physical memory, is far different from just emulating a VM in Javascript. I wouldn't call writing a ton of opcodes in JS, which resolved to JS functions, a "virtual machine", because it isn't reserving anything or doing anything that Javascript can't do. Someone correct me here if I'm wrong... this is just heavy-duty obfuscation.

Also, one major purpose of a VM is to improve performance over what's available in the browser. If you use that as a measurement, this clearly doesn't fit that goal.

[-]

lxgr 8 months ago

> But that's basically an emulator of a VM, isn't it? It's like rewriting the Flash AVM2 into JS... it's still running in JS whereas the original VM was C++.

I think you're using a different definition of the term VM than most other people here. An "emulated VM" is a VM too.

> one major purpose of a VM is to improve performance over what's available in the browser.

That's definitely a very nonstandard interpretation. Many VMs are, intentionally, much less capable (in a permissions sense; in a computational sense, they're almost always exactly as capable) than the host environment they run in.

wzdd 8 months ago

> I can't think of a scenario where you'd need to deconstruct their front-end to mimic the calls.

The article mentioned that important API calls are signed. So you would need at least partially to deconstruct their front-end to invoke the calls.

Any other nefarious purposes aside, this seems explicitly anti-bot because you can change the obfuscation whenever you like, forcing another RE effort.

davidsojevic 8 months ago

Making it harder for bots usually means that it drives up the cost for the bots to operate; so if they need to run in a headless browser to get around the anti-bot measures it might mean that it takes, for example, 1.5 seconds to execute a request as compared to the 0.1 seconds it would without them in place.

On top of that 1.5 seconds is also that there is a much larger CPU and memory cost from having to run that browser compared to a simple direct HTTP request which is near negligible.

So while you'll never truly defeat a sufficiently motivated actor, you may be able to drive their costs up high enough that it makes it difficult to enter the space or difficult to turn a profit if they're so inclined.

[-]

godelski 8 months ago

I understand the argument. You can't have perfect defense and speedbumps are quite effective. I'm not trying to disagree with that.

But it does not seem like the solution is effective at mitigating bots. Presumably bots are going a different route considering how prolific they are, which warrants another solution. If they are going through this route then it certainly isn't effective either and also warrants another solution.

It seems like this obscurification requires a fair amount of work, especially since you need to frequently update the code to rescramble it. Added complexity also increases risks for bugs and vulnerabilities, which ultimately undermine the whole endeavor.

I'm trying to understand why this level of effort is worth the cost. (Other than nefarious reasons. Those ones are rather obvious)

rfoo 8 months ago

Google has been doing this since forever for recaptcha. And, to be fair, it seems to be fairly effectively for bot detection.

https://github.com/neuroradiology/InsideReCaptcha

> bots seem to be going a very different route

If the "very different route" means running a headless browser, then it's a success for this tech. Because the bot must run a blackbox JS now, and this gives people a whole new street of ways to run bot detection, using the bot's CPU.

[-]

godelski 8 months ago

Okay... but those bots exist... and in high numbers... By "very different route" I mean "measure to effectively stop the bots" (or dramatically reduce). It seems like if they're using a headless browser then they're still being quite effective in accomplishing their goals.

[-]

mike_hearn 8 months ago

Google's obfuscating VM based anti-bot system (BotGuard) was very effective. Source: I wrote it. We used it to completely wipe out numerous botnets that were abusing Google's products e.g. posting spam, clickfraud, phishing campaigns. BotGuard is still deployed on basically every Google product and they later did similar systems for Android and iOS, so I guess it continues to work well.

AFAIK Google was the first to use VM based obfuscation in JavaScript. Nobody was using this technique at the time for anti-spam so I was inspired primarily by the work Nate Lawson did on BluRay.

What most people didn't realize back then is that if you can force your adversary to run a full blown web browser there are numerous tricks to detect that the browser is being automated. When BotGuard was new most of those tricks were specific to Internet Explorer, none were already known (I had to discover them myself) and I never found any evidence that any of them were rediscovered outside of Google. The original bag of tricks is obsolete now of course, nobody is using Internet Explorer anymore. I don't know what it does these days.

The VM isn't merely about protecting the tricks, though. That's useful but not the main reason for it. The main reason is to make it easier to generate random encrypted programs for the VM, and thus harder to write a static analysis. If you can't write a static analysis for the program supplied by your adversary you're forced to actually execute it and therefore can't write a "safe" bot. If the program changes in ways that are designed to detect your bot, done well there's no good way to detect this and bring the botnet to a safe halt because you don't know what the program is actually doing at the semantic level. Therefore the generated programs can detect your bot and then report back to the server what it found, triggering delayed IP/account/phone number bans. It's very expensive for abusers to go through these bans but because they have to blindly execute the generated programs they can't easily reduce the risk. Once the profit margin shrinks below the margin from abusing a different website, they leave and you win.

throwaway48476 8 months ago

Makes it easier to hide code that does browser fingerprinting.

Scaevolus 8 months ago

Obfuscation is one part of defense in depth. Tiktok also has a variety of captchas to block scrapers, independent of this.

None of it's perfect, and they can be worked around, but by providing a barrier you've restricted some of the bad actors (spambots, scrapers) from acting at all.

It's easier to deal with 100 spambots than 1000!

[-]

like_any_other 8 months ago

Unless the scrapers are DDoSing the site, I refuse to consider the downloading of publicly posted data as malicious. It shows how captured the conversation has become by corporate interests, that viewing or storing data posted free of charge, publicly, by their users, in a way not approved by that corporation, is seen as malicious, and the only morally allowed way to view it is to use their spyware-laden client.

[-]

areyourllySorry 8 months ago

this is also a measure against bots that write, not just those that read

Scaevolus 8 months ago

What if the user has disabled downloads of a video? Should the creator (and copyright owner) of a piece of media not be allowed even token attempts to prevent copying?

[-]

ndriscoll 8 months ago

No because that interferes with fair use. If someone publicly posts a video, everyone has the right to copy it without any permission or awareness from the original author for things like commentary/criticism (it would be silly to require the copyright owner's permission to criticise a work!).

hoseja 8 months ago

Here's a great way to prevent people copying your precious video: don't post in on the internet.

8 months ago

[deleted]

8 months ago

[deleted]

aaron695 8 months ago

[dead]

davidsojevic 8 months ago

Very impressive work! I always enjoy a good write up about reverse engineering efforts and yours was really simple to follow.

Many popular/large websites and bot protection services usually have environment checking as a baseline and mouse-movement tracking in some of the more aggressive anti-bot checks.

It's always interesting to see how long it takes from when the measures have been defeated/publicised until the service ends up making changes to their mechanism to make you start over (hopefully not from scratch).

[-]

xfeeefeee 8 months ago

All credit should go to Lukas https://github.com/LukasOgunfeitimi

I was sharing this here since I thought it was a great write up, but did not intend to pass it off as my own!

There is certainly always a good amount of push and pull, though my personal concern as a contributor to yt-dlp under another alias is more about archival of the underlying media rather than automating things like comments.

YouTube also uses an interesting scheme for authenticating requests for media as well which required implementing a very basic JavaScript interpreter within Python for yt-dlp too. I expect this kind of thing to continue to become even more common and complicated.

mrkramer 8 months ago

In my bookmarks I found this RE examples as well: https://www.nullpt.rs/reverse-engineering-tiktok-vm-1

https://ibiyemiabiodun.com/projects/reversing-tiktok-pt2/

ronsor 8 months ago

There is no legitimate reason for a social media platform to employ this much obfuscation.

[-]

fidotron 8 months ago

If you believe this you underestimate how adversarial the software world really is. TikTok will be on the receiving end of botnets by everything from commercial entities, state backed groups and criminals.

They won't be betting that this stops that entirely, but it adds a layer of friction that is easy for them to change on a continuous basis. These things are also very good for leaving honeypots in where if someone is found to still be using something after a change you can tag them as a bot or otherwise hacking. Both of those approaches are also widely used in game anti-cheat mechanisms, and as shown there the lengths people will go to anyway are completely insane.

[-]

fmxsh 8 months ago

It's an excellent strategy for the reasons you mention. And a kind of "security by principle of least privilege".

lazyeye 8 months ago

Nah..I agree with the parent comment, there is simply no legitimate reason for a social media app to employ this level of obsfucation.

[-]

Thorrez 8 months ago

If you ran a social media site and app, and had a problem of many different groups employing bots to post tons of content for nefarious purposes to your site, what would you do?

[-]

lazyeye 8 months ago

I guess Id probably be doing something similar to what all the other social media apps are doing (unless of course, I had something to hide...)

[-]

Thorrez 8 months ago

What are the other social media apps doing? Are you sure they're not using obfuscated VMs as well?

I'm guessing a lot of them use reCAPTCHA, and according to this comment, reCAPTCHA uses an obfuscated VM:

https://news.ycombinator.com/item?id=43748994

[-]

lazyeye 8 months ago

Yep I'd probably go with a reCaptcha like everybody else except TikTok then.

krackers 8 months ago

The legitimate reason could be bot protection, the same way recaptcha uses a similar VM technique for obfuscation.

vasco 8 months ago

You not being able to come up with one is different from there not being any possible reason.

supriyo-biswas 8 months ago

See my other comment on this thread: https://news.ycombinator.com/item?id=43748994

miohtama 8 months ago

It's to keep bots away and not turn to be another Twitter.

[-]

dns_snek 8 months ago

That's probably not the goal. There are bots advertising illegal services (e.g. ads for "hacking services", illegal drugs) in most comment sections. If you report these comments, 99.9% of the time the report will be rejected with "no violations found" and the spam stays up.

[-]

bolognafairy 8 months ago

That doesn’t mean that it’s “probably not the intention”.

[-]

dns_snek 8 months ago

The balance of evidence suggests otherwise. If they cared about spam bots they would take action when spammers are handed to them on a silver platter. The kinds of spammers who will leave 30 identical comments advertising illegal services, not some weird moderation corner case.

If you ever end up on a video that's related to drugs, there will be entire chains of bots just advertising to each other and TikTok won't find any violations when reported. But sure, I'm sure they care a whole lot about not ending up like Twitter.

[-]

wpietri 8 months ago

A large company is much less cohesive than you realize. You can't reliably reason about the goals of one part because another part isn't consistent. This particular difference could easily be explained by insufficient funding to moderation, which is endemic in social media.

[-]

dns_snek 8 months ago

I've said this twice already, it's not that another part "isn't consistent" (I would agree that this is to be expected), they're CONSISTENTLY acting in the opposite manner than is being speculated here and I subscribe to the "purpose of a system is what it does" world view.

[-]

wpietri 8 months ago

If you really subscribed to POSIWID, you wouldn't be making arguments like "That's probably not the goal", as that's nonsensical from the POSIWID perspective.

The nominal goal of the code could well be bots at the same time the POSIWID purpose is about the exec impressing his superiors and the developers feeling smart and indulging their pet technical interests. Similarly, the nominal goal of the abuse reporting system would include spam, even if the POSIWID analysis would show that the true current purpose is to say they're doing something while keeping costs low.

So again, I don't think you have a lot of understanding of how large companies work. Whereas I, among other things, ran an anti-abuse engineering team at Twitter back in the day, so I'm reasonably familiar with the dynamics.

TheDong 8 months ago

So you're saying that TikTok's support team doing a poor job of handling reports is proof that the engineering team wasn't tasked with reducing spam by writing code obfuscation?

TikTok is a huge company, evidence of what the support department does or doesn't do has only minor bearing on the whole company, and basically none on the engineering department.

The thing that seems most likely to me is that they care about spam, the engineering department did this one thing, and the support department is either overworked or cares less. Or really efficient which is why you only see "a lot of spam", not "literally nothing but spam".

lazyeye 8 months ago

Because bots cant interact with web pages at the browser level like humans do...

yard2010 8 months ago

This is not a social media platform but a government backed tool for doing stuff for the government.

Wowfunhappy 8 months ago

...can I ask a really stupid question? What is a VM in this context?

I've used VM's for years to run Windows on top of macOS or Linux on top of Windows or macOS on top of macOS when I need an isolated testing environment. I also know that Java works via the "Javascript Virtual Machine" which I've always thought of as "Java code actually runs in its own lightweight operating system on top of the host OS, which makes it OS-agnostic". The JVM can't run on bare metal because it doesn't have hardware drivers, but presumably it could if you wrote those drivers.

But presumably the VM being discussed in TFA isn't that kind of VM, right? Bytedance didn't write an operating system in Javascript?

I've been seeing "VM" used in lots of contexts like this recently and it makes me think I must be missing something, but it's the sort of question I don't know how to Google. AIs have not been helpful either, plus I don't trust them.

[-]

ngneer 8 months ago

This is not a stupid question. I have seen other comments on the thread that confuse the two terms and run with it. Better to ask than assume. Especially since "VM" is the same label for two or three distinct yet related notions in security.

The VM you are familiar with indeed can run an OS, and is indeed not what TikTok does.

#1 VMM - hypervisor runs VMs

#2 JVM/.NET - efficient bytecode

#3 Obfuscation - obscure bytecode

The main thing is that for #2 and #3 the machine language changes.

With "virtualization" as used in most contexts, involving a virtual machine monitor, or hypervisor, one creates zero or more new (virtual) machines, to execute on multiple software recipes. All the recipes are written in the same (machine) language, for all the machines. This can help security by introducing isolation, for example, where one VM cannot read memory belonging to another VM unless the hypervisor allows it.

With the "virtual machine" used for obfuscation, the machine language changes. The system performs the same actions as it would without obfuscation, but now it is performing those actions using a different machine language. Behaviorally, the result is the same. But, the new language makes it harder to reverse engineer the behavior.

Stupid example:

Original instruction: MOV A,B

Under hypervisor virtualization, VM0 and VM1 will perform this same instruction.

Under obfuscation virtualization, software will perform instructions that amount to the same result, but are harder to figure out. So, the MOV instruction is redefined and mapped onto a new (virtual) machine. The new machine does not simply leverage the existing instruction, rather an obfuscated sequence. For example:

A <- B + C + D * E

A <- A - C

A <- A - D * E

Obviously, the above transformation is easy to understand and undo. Others are harder to understand and undo. Look up MOVfuscator to see how crazy things may get.

turtleyacht 8 months ago

Virtual Machine Decompiling: https://github.com/LukasOgunfeitimi/TikTok-ReverseEngineerin...

And also VM223, with statements that do stuff to an array "stack": https://github.com/LukasOgunfeitimi/TikTok-ReverseEngineerin...

One obvious giveaway for a VM is laying out memory, or processing some intermediate language. In this case, it could be the latter.

In-browser, you have Chrome V8 running Javascript; that Javascript could be running an interpreted environment where abstractions are not purely business logic, but an execution model separate from domain stuff: auth, video, user, etc.

By that observation, this C snippet is a VM:

  char instruction = 'p'; /* or array */

  if (instruction == 'p') {
    println("document.appendChild(...)");
  }

If the program outputs to a vm.js file, it's kinda-sorta a "VM." I would call it something else, maybe a generator of sorts (for now). Just in my opinion, for me, if I were working on a VM, the threshold of calling it that would be much higher than the above.

On the other hand, if I had to comment in the generated Javascript debugging hints referring to execution stack or stack pointers, it is kind of a VM idea.

yjftsjthsd-h 8 months ago

Nit:

> I also know that Java works via the "Javascript Virtual Machine"

Java Virtual machine. That Java and JavaScript are named the way they are is... basically a historical accident of a cross-promotion gone too far, IMO. They aren't really related (at least, in the way that the name might imply).

Now to your real question. Virtual machines are anything that is one computer pretending to be another computer. Sometimes, that's an x86_64 PC pretending to be another x86_64 PC to run a different OS. Sometimes that's an x86_64 PC pretending to be a 50-year-old mainframe ( https://opensimh.org/ really shines there). Sometimes it's an ARM laptop running macOS pretending to be an x86_64 PC so it can run Windows. And, relevant here, sometimes it's a phone pretending to be a machine that has never actually existed in hardware. You can just make up an imaginary machine that has any old characteristics you want. Maybe it has a built-in high-level network card that magically turns HTTP requests into responses without programs having to implement HTTP themselves. Maybe it has an imaginary graphics card that directly renders buttons. Maybe you imagine a CPU that runs Java opcodes directly. Whatever it is, if you can imagine a system and then write a program that emulates it, you can make a virtual machine and run stuff in it.

[-]

Wowfunhappy 8 months ago

> Java Virtual machine. That Java and JavaScript are named the way they are is... basically a historical accident of a cross-promotion gone too far

Oops, that was a typo! Thank you.

Jasper_ 8 months ago

The words "virtual machine" and "interpreter" are mostly interchangeable; they both refer to a mechanism to run a computer program not by compiling it to machine code, but to some intermediate "virtual" machine code which will then get run. The terminology is new, but the idea is older, "P-code" was the term we used to use before it fell out of favor.

Sun popularized the term "virtual machine" when marketing Java instead of using "interpreter" or "P-code", both for marketing reasons (VMware had just come on the scene and was making tech headlines), but also to get away from the perception of classic interpreters being slower than native code since Java had a JIT compiler. Just-in-time compilers that compiled to the host's machine code at runtime were well-known in research domains at the time, but were much less popular than the more dominant execution models of "AST interpreter" and "bytecode interpreter".

There might be some gatekeepers that suggest that "interpreter" means AST interpreter (not true for the Python interpreter, for instance), or VM always means JIT compiled (not true for Ruby, which calls its bytecode-based MRI "RubyVM" in a few places), but you can ignore them.

fmxsh 8 months ago

It sounds more advanced than it is.

It's a function wrapping the functionality of its host environment. Then provides the caller with its own byte code language to execute instructions. The virtual machine translates those instructions to the corresponding real functionality of the host environment (Javascript) upon execution.

This particular case is sophisticated but the idea is simple.

Correct me if I'm wrong. I'm not knowledgeable in this. This is my current understanding of it.

jacobp100 8 months ago

Yes the VM discussed is similar to JVM

heinternets 8 months ago

Is TikTok so obfuscated to prevent people from knowing the full extent of data collection and device fingerprinting?

[-]

gruez 8 months ago

1. Practically speaking all this javascript fingerprinting pales in comparison to what native apps have access to. Most people aren't using tiktok on their browsers, and the browser version heavily pushes you to using the app, so you should be far more worried about whatever's happening in the app.

2. Despite tiktok having a giant target painted on its back for its perceived connections to the CCP, I haven't really seen any evidence that it does any more tracking/fingerprinting that most other websites (eg. facebook) or security services (eg. cloudflare or recaptcha) already do.

[-]

nicce 8 months ago

> 2. Despite tiktok having a giant target painted on its back for its perceived connections to the CCP, I haven't really seen any evidence that it does any more tracking/fingerprinting that most other websites (eg. facebook) or security services (eg. cloudflare or recaptcha) already do.

Take a look for request parameters in TikTok vs. Instagram for example.

Every request for TikTok forces you to pass most of the information that browser can collect from the end-user before server responds:

https://www.nullpt.rs/reverse-engineering-tiktok-vm-1

[-]

gruez 8 months ago

>Every request for TikTok forces you to pass most of the information that browser can collect from the end-user before server responds:

Half of the parameters are stuff relating to the app itself, or could be inferred from other sources like user-agent. The other fingerprinting stuff (eg. canvas or webgl fingerprinting) is basically industry standard and by no means unique to tiktok. Even the claim that "browser can collect from the end-user before server responds" doesn't hold up to scrutiny, because there's no meaningful difference between that, and browser check interstitials (eg. the cloudflare checkbox), which fingerprint you before letting you access the content. It's also unclear how that's more sinister than the alternative approach of sending telemetry/fingerprinting data to a separate endpoint.

RexM 8 months ago

Is this VM somehow related to Lynx (their cross platform dev tooling?)

https://lynxjs.org/

Also discussed on HN

https://news.ycombinator.com/item?id=43264957

0xDEADFED5 8 months ago

this is cool. i briefly worked on a TikTok bot a while back and it was a huge pain in the ass.

weinzierl 8 months ago

Is there also a VM in their iOS app? I thought a VM would be against Apple's policies?

[-]

xmodem 8 months ago

Apple's policies prevent using JIT compilation, they don't ban VM's outright.

[-]

jacobp100 8 months ago

This is the correct answer. They even expose JavaScript Core to apps

Scaevolus 8 months ago

Their mobile apps have equivalent signature code, but it's compiled to native binaries instead.

lazyeye 8 months ago

An oldie but a goodie. A guide to manipulating online comments to hide/dilute/obsfucate undesirable commentary....

https://cryptome.org/2012/07/gent-forum-spies.htm

sylware 8 months ago

What's terrible are the humans writing such software...

But if AI can help to fight those people's work, good for humanity I guess.

That said... Is AI going to de-obfuscate/reverse engineer their obsfuscated AI prompts or web apps?

domfie 8 months ago

Looks like a lot of work. I recently discovered webcrack and the tool jehna/humanify for such deobfuscate tasks

[-]

3abiton 8 months ago

It could be interesting to see a comparison to OP's work.

itsthecourier 8 months ago

this level of obfuscation in a social app is super suspicious

[-]

doublerabbit 8 months ago

I wouldn't say so, pretty common. It used to add a layer of security. You should take a look at an casino app.

Did you know that every chip on a Chip & Pin bank card is powered by a Java Virtual Machine that when you go to tap or insert in to a card reader it's activated.

https://en.wikipedia.org/wiki/Java_Card

8 months ago

[deleted]

lukaso112 8 months ago

[dead]

worldsavior 8 months ago

That's a very strong obfuscation. Takes a lot of work to deobfuscate such a thing. Great writeup.

xfeeefeee 8 months ago

The fascinating process of reverse engineering this VM is detailed here.

TikTok uses a custom virtual machine (VM) as part of its obfuscation and security layers. This project includes tools to:

Deobfuscate webmssdk.js that has the virtual machine.

Decompile TikTok’s virtual machine instructions into readable form.

Script Inject Replace webmssdk.js with the deobfuscated VM injector.

Sign URLs Generate signed URLs which can be used to perform auth-based requests eg. Post comments.

[-]

noduerme 8 months ago

Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.

Still, I had no idea. This is really taking JS obfuscation to the next level.

One kind of wonders, what is the purpose of that level of obfuscation? The naive take is that obfuscation is usually to protect intellectual property... but this is client-side code that wouldn't give away anything about their secret sauce algorithm.

[-]

MonkeyClub 8 months ago

> Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.

From the Repo's README:

"TikTok is using a full-fledged bytecode VM, if you browse through it, it supports scopes, nested functions and exception handling. This isn't a typical VM and shows that it is definitely sophiscated."

[-]

noduerme 8 months ago

But that's basically an emulator of a VM, isn't it? It's like rewriting the Flash AVM2 into JS... it's still running in JS whereas the original VM was C++. It could JIT compile stuff but only because it literally was reserving memory that could overflow, and (semi-technical take here) from that advantage, of being closer to the metal, flowed all of the flaws in AVM2 that precipitated most of Adobe's woes with Flash. A VM implant in a web page that uses a plugin like Java or Flash, to get around running browser-sandboxed code, which can take over physical memory, is far different from just emulating a VM in Javascript. I wouldn't call writing a ton of opcodes in JS, which resolved to JS functions, a "virtual machine", because it isn't reserving anything or doing anything that Javascript can't do. Someone correct me here if I'm wrong... this is just heavy-duty obfuscation.

Also, one major purpose of a VM is to improve performance over what's available in the browser. If you use that as a measurement, this clearly doesn't fit that goal.

[-]

gruez 8 months ago

>But that's basically an emulator of a VM, isn't it?

Emulators and VMs aren't mutually exclusive.

>Also, one major purpose of a VM is to improve performance over what's available in the browser. If you use that as a measurement, this clearly doesn't fit that goal.

And from your other comment:

>I would define it as a custom instruction set plus some sort of plug-in that allows those opcodes to be run closer to the metal than the language they're written in.

A virtual machine just means a machine that's virtual. All the other expectations you apply on top of it (eg. "improve performance over what's available in the browser") is totally irrelevant. The JVM clearly doesn't improve performance of java code than running natively, but nobody denies it's a virtual machine. The same goes for VMWare products ("VM" is literally in its name!), which executes x86 code but is further away from "the metal" that it's running on.

8 months ago

[deleted]

throwaway48476 8 months ago

VM obfuscation is a common technique for malware developers.

The VM term is applied because the obfuscator creates a custom instruction set and executes custom byte code. This is generated per build.

[-]

noduerme 8 months ago

I appreciate you making the distinction that anything which creates a custom instruction set is thus a VM. I think that's the way a lot of people here who are currently at my throat seem to define it, so I'm glad you put it in clear terms. I would define it as a custom instruction set plus some sort of plug-in that allows those opcodes to be run closer to the metal than the language they're written in. FWIW I'd call this thing more of an obfuscation framework. But maybe I'm just a dino. I am really glad you made this comment, though. It clarified for me why so many people went bananas when I said this wasn't a VM.

userbinator 8 months ago

You are replying to a comment that looks extremely unhuman.

[-]

codetrotter 8 months ago

It looks like OP filled out the text area alongside with the URL when submitting the post.

HN takes that text and turns it into a comment. I’ve seen it happen before.

The unfortunate outcome of that IMO is that sometimes text that makes sense as a description of a submission feels a bit out of place as a comment due to how they are worded. And these comments sometimes then end up getting downvoted.

I wouldn’t be completely sure it was not human written. Even though it feels a bit weird to read it as a comment.

[-]

xfeeefeee 8 months ago

> It looks like OP filled out the text area alongside with the URL when submitting the post. HN takes that text and turns it into a comment.

Yeah, this is exactly what happened, but I decided to keep it rather than delete and filled it out more with the synopsis from the repo.

Looking back at it, it really does look like an AI bulleted summary. I probably should have noted that the last part was indeed a quotation.

lukaso112 8 months ago

[dead]

dmitrygr 8 months ago

What is the purpose of you posting a bad ChatGPT summary of the original post?

[-]

xfeeefeee 8 months ago

I quoted the synopsis from the readme thinking it would be helpful.

pests 8 months ago

It was the submission statement along with him submitting this post. It was detached as a comment and I don't think it AI.