I see a lot about source codes being leaked and I’m wondering how it that you could make something like an exact replica of Super Mario Bros without the source code or how you can’t take the finished product and run it back through the compilation software?
I actually work on a C++ compiler… I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you’re not familiar with the domain.
When you compile a program you’re taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.
You lose things like code comments because the machine doesn’t care about the comments right off the bat.
Then you lose local variable and function parameter names because the machine doesn’t care about those things.
Then you lose your class structure … because the machine really just cares about the total size of the thing it’s passing around. You can recover some of this information by looking at the functions but it’s not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity … and not every memory allocation uses a constructor. You won’t get any names of any data members/fields though because … again the machine doesn’t care.
So what you’re left with is basically the mangled names of functions and what you can derive from how instructions access memory.
The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that’s not guaranteed either, it’s just because that’s how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.
But wait! There’s more! The optimizer can do some really wild things in the name of speed… Including combining functions. Those constructors? Gone, now they’re just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it’s the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it’s all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it’s always 27, so the logic to compute it? Gone.
Now all of that stuff doesn’t happen every time, particularly not all of those things are always possible optimizations or good optimizations … But you can see how incredibly difficult it is to reconstruct a program once it’s been compiled and gone through optimization. There’s a very low chance if you do reconstruct it, that it will look anything like what you started with.
Just wait until you see the crazy optimizers for embedded systems. They take the complete code of a system into consideration, and, in a number of compile passes, reuses code snippets from app, libraries, and OS layer to create one big tangled mess that is hard to follow even if you have the source code…
The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.
One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code
It’s sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it’s impossible to know exactly which words were in the original sentence.
Thank you, sorry to push further but my understanding is that computers deal with binary so every language is compiled to machine code, which I took as binary.
So if the language has elements being removed and the machine doesn’t need them shouldn’t you get back out exactly what is needed to do the task? Like if you compiled some code and then uncompiled it you would get the most efficient version of it because the computer took what it needed, discarded the rest and gave it back to you?
One thing that’s missing is variables. In programming you can have multiple variables, you can assign variables to new variables with new names and names are very important to understand what’s going on.
The machine doesn’t care about variable names, it deals with registers and memory locations. Variable names are present in debug information but that is commonly removed before releasing to the public
Not all of what is used in the source materials actually ends up in the final result. The compiler gets rid of any unnecessary components and anything that can be simplified, repeated, or condensed, it will strip away as well.
Take the recent GTA V leak. The source code was some 1.2 TB of raw data and source material. Then add on the fact that there are probably thousands of code libraries to pull in bits and pieces from. And somehow we get a ~150GB playable version of that.
In that process to be more efficient, it will also strip away anything representing a variable name, commenting, spaces and new lines.
Last, it crunches that all down to assembly so only the hardware on your computer has any real chance of making sense of it.
The way modders make their tools is by poking software at the RAM and identifying chunks of memory used for certain things. If I’m not mistaken, these have to be hand-labeled after the fact as any reference names used internally by Rockstar would be lost along the way.
Trick question: it can be,that’s what a decompiler does. The problem is that a lot of information is lost in compiling source code, such as names and the exact implementation of loops.