Skip to content

ExeRay AI detects malicious Windows executables using ML. Analyzes entropy, imports, and metadata for rapid classification, aiding incident response. Built with Python and scikit-learn.

License

Notifications You must be signed in to change notification settings

MohamedMostafa010/ExeRay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ExeRay 2.0 πŸ₯

TruxTrace banner

Advanced X-ray Vision for Windows Executables

  • Detect malicious .exe files using machine learning. Extracts static features (entropy, imports, metadata) and combines ML with heuristic rules for fast, automated classification.

πŸš€ What's New in v2.0

  • 50+ New Detection Features (VM detection, anti-debugging, API call chains)
  • Enhanced Prediction Engine with detailed suspicious behavior reports if malware found
  • Recall-Optimized Training: Custom scorer prioritizing malware detection
  • Streamlined 3-Script Architecture (faster workflow)
  • Improved Accuracy (F1-score up to 0.99 in testing)
  • Dataset Provided !!

πŸ“Š Dataset Information

- Source & Composition:

Dataset From Examples Total
Malicious Dataset DikeDataset, theZoo, MalwareBazaar WannaCry.exe, njRAT.exe 10,925
Benign Dataset Windows Files, Ninite.com, PortableApps.com Putty.exe, notepad.exe, ida.exe 3,590
Total 14,515
Size 11.81 GB
  • Dataset Processing: From 10,925 malware samples, we processed 4,200 for feature extraction, then applied Undersampling to balance with 3,500 benign samples (7,000 total). Used RandomUnderSampler (random_state=42) to prevent malware bias while preserving key patterns.
  • Benign and Malicious Dataset Link (11.81 GB on MEGA): https://mega.nz/folder/iAU3iARQ#nKPwCQIW4jZgAEFmRJlR6Q
  • ⚠️ Safety Notice:
    • Please exercise caution when downloading and handling this dataset. It contains both benign and malicious files for research purposes.
    • Do not execute or open any files unless you're in a secure, isolated environment (e.g., a virtual machine or sandbox). Executing malicious files can harm your system or compromise your data.
  • ⚠️ Important Notice About the Dataset:
    • To keep this repository lightweight and easy to download, the full dataset is not included here. Specifically:
      • The data/ folder does not contain any executable files.
      • You will find only two empty directories: benign/ and malware/.
      • If you wish to work with the actual dataset, you need to download it manually from the MEGA link.

MEGA Logo

βš™οΈ Enhanced Features

  • Hybrid AI detection (XGBoost + Random Forest)
  • Detailed Malware Fingerprinting:
    • VM/Sandbox detection markers
    • Anti-debugging technique identification
    • Suspicious API call patterns
  • Confidence Scoring with threat level classification

πŸ”§ Upgraded Tech Stack

New Components:

  • Advanced PE Analysis: Full directory parsing (TLS, Debug, Resources)
  • String Analysis: Unicode/ASCII pattern detection
  • Behavioral Indicators: 15+ new malware behavior signatures

Key Improvements:

- Structural Features:

# PE File Structure
'num_sections', 
'num_unique_sections',
'section_names_entropy',
'avg_section_size',
'min_section_size',
'max_section_size',
'total_section_size',
'avg_entropy',
'min_entropy',
'max_entropy',
'has_packed_sections',
'has_executable_sections',
'writable_executable_sections',
'is_dll',
'is_executable',
'is_system_file',
'has_aslr',
'has_dep',
'is_signed',
'has_rich_header',
'rich_header_entries',
'has_resources',
'num_resources',
'has_embedded_exe',
'has_debug',
'has_tls',
'has_relocations',
'ep_in_first_section',
'ep_in_last_section',
'ep_section_entropy',
'has_suspicious_sections'

- Behavioral Features:

# API/Import Analysis
'num_imports',
'num_unique_dlls',
'num_unique_imports',
'imports_to_dlls_ratio',
'has_import_name_mismatches',
'suspicious_imports_count',
'num_exports',
'suspicious_exports',
'suspicious_api_chains',
'has_delayed_imports',
'has_vm_detection_imports',
'has_anti_debug_imports',
'has_process_creation_imports',
'has_createprocess',
'has_setwindowshookex',

# String Patterns
'num_strings',
'avg_string_length',
'has_suspicious_strings',
'has_anti_debug',
'has_vm_detection_strings',
'has_vm_mac_addresses',
'has_anti_debug_strings',
'has_nop_sleds',
'has_anti_debug_strings'

- Detection Signatures:

vm_detection_strings = {
    b'vbox', b'vmware', b'virtualbox', b'qemu', b'xen', b'hypervisor',
    b'virtual machine', b'vmcheck', b'vboxguest', b'vboxsf', b'vboxvideo'
}

vm_mac_prefixes = {
    b'00:0C:29', b'00:1C:14', b'00:05:69', b'00:50:56',  # VMware
    b'08:00:27',  # VirtualBox
    b'00:16:3E',  # Xen
    b'00:1C:42',  # Parallels
    b'00:15:5D'   # Hyper-V
}

anti_debug_strings = {
    b'IsDebuggerPresent', b'CheckRemoteDebuggerPresent', b'OutputDebugString',
    b'NtQueryInformationProcess', b'NtSetInformationThread', b'ZwSetInformationThread'
}

suspicious_patterns = {
    b'payload', b'malware', b'inject', b'virus', b'trojan',
    b'backdoor', b'rat', b'worm', b'spyware', b'keylog',
    b'xored', b'encrypted', b'packed', b'obfus'
}

# API Groups
vm_detection_apis = {
    'cpuid', 'hypervisor', 'vmcheck', 'vbox', 'vmware', 'virtualbox',
    'wine_get_unix_file_name', 'wine_get_dos_file_name'
}

anti_debug_apis = {
    'IsDebuggerPresent', 'CheckRemoteDebuggerPresent', 'OutputDebugStringA',
    'NtQueryInformationProcess', 'NtSetInformationThread', 'NtQuerySystemInformation',
    'GetTickCount', 'QueryPerformanceCounter', 'RDTSC', 'GetProcessHeap',
    'ZwSetInformationThread', 'DbgBreakPoint', 'DbgUiRemoteBreakin'
}

process_creation_apis = {
    'CreateProcessA', 'CreateProcessW', 'CreateProcessAsUserA', 'CreateProcessAsUserW',
    'SetWindowsHookExA', 'SetWindowsHookExW', 'ShellExecuteA', 'ShellExecuteW',
    'WinExec', 'System'
}

# Suspicious API Chains
api_sequences = {
    ('VirtualAlloc', 'WriteProcessMemory', 'CreateRemoteThread'): 'Process Injection',
    ('RegCreateKey', 'RegSetValue', 'RegCloseKey'): 'Registry Persistence',
    ('LoadLibraryA', 'GetProcAddress', 'VirtualProtect'): 'Dynamic API Resolution',
    ('OpenProcess', 'ReadProcessMemory', 'WriteProcessMemory'): 'Process Hollowing',
    ('NtUnmapViewOfSection', 'MapViewOfFile', 'ResumeThread'): 'RunPE Technique',
    ('CreateProcessA', 'WriteProcessMemory', 'ResumeThread'): 'Process Injection',
    ('SetWindowsHookExA', 'GetMessage', 'DispatchMessage'): 'Hook Injection'
}

πŸ“ Directory Structure

ExeShield_AI/
β”œβ”€β”€ assets/                      # Repo Images
β”œβ”€β”€ data/                        # Raw Samples  
β”‚   β”œβ”€β”€ malware/                 # Malicious Executables  
β”‚   └── benign/                  # Clean Executables
β”œβ”€β”€ dependencies/                # Installation Dependencies
β”œβ”€β”€ models/                      # Saved Models/Thresholds  
β”‚   β”œβ”€β”€ malware_detector.joblib  
β”‚   └── optimal_threshold.npy  
β”œβ”€β”€ output/                      # Processed Data (CSV/features)
β”‚   └── processed_features_dataset.csv
β”œβ”€β”€ scripts/                     # Core Scripts  
β”‚   β”œβ”€β”€ extract_features.py  
β”‚   β”œβ”€β”€ train_model.py  
β”‚   └── predict.py  
└── README.md

πŸ’» Installation and Usage (Commands & Outputs)

1. Clone the repository:

git clone https://github.com/MohamedMostafa010/ExeRay.git
cd ExeRay

2. Install dependencies:

pip install -r dependencies/requirements.txt

3. Extract Features:

> python extract_features.py
[*] Processing benign samples from ../data\benign...
[!] Not a valid PE file: adaminstall.exe
[!] Not a valid PE file: adamsync.exe
[!] Not a valid PE file: AddSuggestedFoldersToLibraryDialog.exe
[!] Not a valid PE file: AgentService.exe
[!] Not a valid PE file: AggregatorHost.exe
[!] Not a valid PE file: appcmd.exe
[!] Not a valid PE file: AppHostRegistrationVerifier.exe
[!] Not a valid PE file: ApplySettingsTemplateCatalog.exe
[!] Not a valid PE file: ApplyTrustOffline.exe
[!] Not a valid PE file: ApproveChildRequest.exe
[!] Not a valid PE file: AppVClient.exe
[!] Not a valid PE file: ARPPRODUCTICON.exe
[!] Not a valid PE file: audit.exe
[!] Not a valid PE file: AuditShD.exe
[!] Not a valid PE file: autofstx.exe
...
[*] Processing malware samples (limited to 3500) from ../data\malware...

[+] Processed Features Dataset saved to ../output/processed_features_dataset.csv
[+] Total samples: 6857
[+] Malware samples: 3500
[+] Benign samples: 3357

4. Train Model (Metrics Also Provided After Training to Know Your Model's Performance):

> python train_model.py
Training models:   0%|                                                                                                                                                 | 0/2 [00:00<?, ?it/s]
New best model: XGBoost (Recall=0.990)
Training models: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:07<00:00,  3.90s/it]

=== Evaluation ===
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       672
           1       0.99      0.99      0.99       700

    accuracy                           0.99      1372
   macro avg       0.99      0.99      0.99      1372
weighted avg       0.99      0.99      0.99      1372

ROC AUC: 1.000

Model saved to ../models/malware_detector.joblib

5. Predict Executable:

> python predict.py "path/to/[benign_file]"
Malware Detection Results:
========================================
File: CapCut.exe
Prediction: BENIGN
Malware Probability: 0.41%
Confidence Level: VERY_LOW
Decision Threshold: 35.93%

> python predict.py "path/to/[suspicious_file]"
Malware Detection Results:
========================================
Malware Detection Results:
========================================
File: Mh1.exe
Prediction: MALWARE
Malware Probability: 98.39%
Confidence Level: VERY_HIGH
Decision Threshold: 35.93%

Top Suspicious Features:
- High maximum section entropy: 7.887
- Suspicious API imports: 6
- High average section entropy: 5.743
- High section name entropy: 2.807
- Writable and executable sections: 2
- Suspicious API call chains: 1
- Anti-debugging API imports: 1
- Anti-debugging strings in code: 1

πŸ” Handling False Positives

  • While ExeShield AI achieves high accuracy, occasional false positives (legitimate files flagged as malware) may occur. Common causes:
    • Legitimate tools with behaviors resembling malware (e.g., putty.exe).
    • Packed/obfuscated benign files (high entropy).

- Example False Positive Output (Below is an example of a malicious file (1.exe) predicted as BENIGN:

> python predict.py python predict.py "C:\Users\[USERNAME]\Documents\Maicious_TEST\1.exe"
Malware Detection Results:
========================================
File: 1.exe
Prediction: BENIGN
Malware Probability: 1.56%
Confidence Level: VERY_LOW
Decision Threshold: 35.93%

Mitigation Strategies:

  • Adjust Threshold:
    • Lower the decision threshold in predict.py for stricter filtering
  • Whitelist Trusted Files:
    • Manually verify and exclude known-safe executables.
  • Retrain the Model:
    • Add misclassified samples to your dataset and rerun train_model.py.

πŸ“„ Scientific Research Paper

  • Presented the ExeRay scientific research paper at the 9th International Undergraduate Research Conference (IUGRC 2025), held at the Military Technical College (MTC), Cairo, Egypt.
  • The full academic paper (PDF) is included in this repository under the assets/ folder: ExeRay Paper.pdf
  • The dataset in the paper is smaller for benchmarking under academic constraints.

🎞️ Paper Preview (GIF)

ExeRay Paper Preview

πŸ“ˆ ROC Curve Analysis

ROC Curve: Malware Detection Performance

Model Performance Metrics:

  • 🟦 Our Model (AUC = 1.00): Perfect classification capability
  • πŸŸ₯ Random Guess (AUC = 0.5): Baseline for comparison
  • πŸ“Š Optimal Threshold: 36% (0.36 in probability units)

Key Interpretation:

  • X-axis (False Positive Rate): Lower values = fewer false alarms
  • Y-axis (True Positive Rate): Higher values = more malware caught
  • Perfect Score: Curve touching top-left corner (achieved in our case)

🀝 Contributing

  • Pull requests are welcome! If you have ideas for new user profiles, simulation modes, or forensic artifacts, feel free to contribute.

πŸ“– License

  • This project is released under the MIT License.

About

ExeRay AI detects malicious Windows executables using ML. Analyzes entropy, imports, and metadata for rapid classification, aiding incident response. Built with Python and scikit-learn.

Topics

Resources

License

Stars

Watchers

Forks

Languages