-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathgenerate_dataset.py
More file actions
1212 lines (989 loc) · 58.6 KB
/
generate_dataset.py
File metadata and controls
1212 lines (989 loc) · 58.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/env python3
"""
generate_dataset.py — NVD CVE Dataset Generator and Analysis Handoff
Entry point: python generate_dataset.py [options]
Purpose
-------
Queries the NVD 2.0 API to fetch CVE records,
populates the local NVD 2.0 and CVE List V5 disk caches, then hands off to
analysis_tool.py for analysis. analysis_tool only references cached files,
generate_dataset ensures those files are cached and current
Generation Modes
----------------
--cve <ID>
Single targeted CVE. Calls query_nvd_cve_by_id(). Mutually exclusive
with all date/status options.
--last-days N / --start-date + --end-date
Date-range mode. Calls generate_last_days() or generate_date_range(),
which delegate to query_nvd_cves_by_date_range().
(default)
Status-based mode via --statuses. Calls query_nvd_cves_by_status().
Key Functions
-------------
query_nvd_cves_by_status() Status-based bulk fetch.
query_nvd_cves_by_date_range() Date-range bulk fetch.
query_nvd_cve_by_id() Single targeted CVE fetch + cache write.
_save_nvd_cve_to_cache_*() Writes NVD 2.0 records to disk cache.
_save_cve_list_v5_to_cache_*() Writes CVE List V5 records to disk cache.
_flush_cache_batches() Persists all queued cache writes atomically.
run_analysis_tool() Sole handoff point to analysis_tool.main()
via sys.argv injection with --file.
main() Argument parsing and mode dispatch.
Important Rules
---------------
- All three generation modes converge at run_analysis_tool() before analysis.
- Cache writes are batched in memory and flushed explicitly; on interrupt,
_handle_interrupt() ensures the batch is flushed before exit.
- --alias-report requires --source-uuid (except in --nvd-ish-only mode).
- --cve cannot be combined with --last-days, --start-date, --end-date, or
--statuses.
"""
import json
import os
import re
import sys
import signal
from datetime import datetime, timezone, timedelta
from time import sleep
import argparse
from pathlib import Path
import traceback
import uuid
from src.analysis_tool.logging.workflow_logger import get_logger
from src.analysis_tool.core.gatherData import config
def _handle_interrupt(signum, frame):
"""Handle interrupt signals - flush cache and exit cleanly"""
try:
logger = get_logger()
logger.warning("\nInterrupt received - flushing cache batches...", group="DATASET")
_flush_cache_batches()
logger.warning("Cache flushed - exiting", group="DATASET")
except Exception as e:
print(f"\nWarning: Cache flush on interrupt failed: {e}", file=sys.stderr)
sys.exit(130)
def validate_uuid(uuid_string):
"""Validate that a string is a proper UUID format"""
try:
uuid.UUID(uuid_string)
return True
except ValueError:
return False
def resolve_output_path(output_file, run_directory=None):
"""Resolve output file path - write to run directory if provided, otherwise use absolute path"""
if os.path.isabs(output_file):
return Path(output_file)
elif run_directory:
# Write directly to logs directory (consolidated storage)
logs_dir = run_directory / "logs"
logs_dir.mkdir(parents=True, exist_ok=True)
return logs_dir / output_file
else:
# No run directory provided - fail fast
raise RuntimeError("Run directory required for dataset generation - standalone usage not supported")
VERSION = config['application']['version']
TOOLNAME = config['application']['toolname']
# Get global logger instance (shared with analysis_tool)
logger = get_logger()
# Lists track all NVD and CVE List record cache updates
_cache_batch = {
'nvd_updates': [],
'cve_list_updates': []
}
def _flush_cache_batches():
"""Process all pending cache updates in batches"""
_flush_nvd_cache_batch()
_flush_cve_list_cache_batch()
def _flush_cve_list_cache_batch():
"""Process queued CVE List V5 cache updates in batch"""
if not _cache_batch['cve_list_updates']:
return
from src.analysis_tool.core.gatherData import _refresh_cvelist_from_mitre_api, _update_cache_metadata, load_schema
batch = _cache_batch['cve_list_updates']
_cache_batch['cve_list_updates'] = []
logger.info(f"{len(batch)} CVE List v5 records require cache operations", group="CACHE_MANAGEMENT")
# Pre-load CVE List V5 schema once for entire batch to avoid repeated loads
try:
cve_schema = load_schema('cve_cve_5_2')
except Exception as e:
logger.warning(f"Failed to load CVE List V5 schema for batch validation: {e} - Processing without validation", group="CACHE_MANAGEMENT")
cve_schema = None
for item in batch:
try:
# Use centralized gatherData.py function for API call and caching
# Disable per-file metadata updates during batch processing (efficient)
_refresh_cvelist_from_mitre_api(
item['cve_id'],
item['file_path'],
refresh_reason="batch cache update",
cve_schema=cve_schema,
update_metadata=False
)
except Exception as e:
logger.debug(f"CVE List v5 local cache update failed for {item['cve_id']}: {e}", group="CACHE_MANAGEMENT")
# Single metadata update for entire batch
if batch:
try:
cve_config = config['cache_settings']['cve_list_v5']
cve_repo_path = cve_config.get('path', 'cache/cve_list_v5')
_update_cache_metadata('cve_list_v5', cve_repo_path)
except Exception as e:
logger.warning(f"Failed to update CVE List V5 cache metadata after batch: {e}", group="CACHE_MANAGEMENT")
def _flush_nvd_cache_batch():
"""Process queued NVD cache updates in batch"""
if not _cache_batch['nvd_updates']:
return
from src.analysis_tool.core.gatherData import _save_nvd_cve_to_local_file, _update_cache_metadata, load_schema
batch = _cache_batch['nvd_updates']
_cache_batch['nvd_updates'] = []
logger.info(f"{len(batch)} NVD 2.0 records require cache operations", group="CACHE_MANAGEMENT")
# Pre-load schema once for entire batch to avoid repeated loads
try:
nvd_schema = load_schema('nvd_cves_2_0')
except Exception as e:
logger.warning(f"Failed to load NVD schema for batch validation: {e} - Processing without validation", group="CACHE_MANAGEMENT")
nvd_schema = None
# Extract repo_path from first item (all items use same path)
nvd_repo_path = batch[0]['repo_path'] if batch else None
for item in batch:
try:
# Create NVD API response format
nvd_response_data = {
"resultsPerPage": 1,
"startIndex": 0,
"totalResults": 1,
"format": "NVD_CVE",
"version": "2.0",
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z",
"vulnerabilities": [item['vulnerability_record']]
}
# Disable per-file metadata updates during batch processing
status = _save_nvd_cve_to_local_file(item['cve_id'], nvd_response_data, nvd_schema, update_metadata=False)
label = {"cached": "ADDED", "updated": "UPDATED", "up-to-date": "CURRENT"}.get(status, "ERROR")
logger.info(f"NVD 2.x {item['cve_id']:<20} {label}", group="CACHE_MANAGEMENT")
except Exception as e:
logger.info(f"NVD 2.x {item['cve_id']:<20} ERROR", group="CACHE_MANAGEMENT")
logger.debug(f"NVD 2.0 local cache update failed for {item['cve_id']}: {e}", group="CACHE_MANAGEMENT")
# Single metadata update for entire batch (efficient)
if batch and nvd_repo_path:
try:
_update_cache_metadata('nvd_2_0_cve', nvd_repo_path)
except Exception as e:
logger.warning(f"Failed to update NVD cache metadata after batch: {e}", group="CACHE_MANAGEMENT")
def _save_nvd_cve_to_cache_during_bulk_generation(cve_id, vulnerability_record):
"""
Queue NVD CVE record for batch cache update during bulk dataset generation.
Updates cache if API data is newer than existing cached data, or if the cache
file is missing, has no timestamp, or is corrupted. Only skips if cached data
is already up-to-date (api lastModified <= cached lastModified).
"""
try:
# Import cache functions (delayed import to avoid circular dependencies)
from src.analysis_tool.core.gatherData import _resolve_cve_cache_file_path
nvd_config = config['cache_settings']['nvd_2_0_cve']
# Use 'cache/nvd_2.0_cves' as default path (parallel to cve_list_v5)
nvd_repo_path = nvd_config.get('path', 'cache/nvd_2.0_cves')
nvd_file_path = _resolve_cve_cache_file_path(cve_id, nvd_repo_path)
if not nvd_file_path:
return {'action': 'no_action', 'reason': 'path_resolution_failed'}
# Extract API lastModified timestamp
api_last_modified = vulnerability_record.get('cve', {}).get('lastModified')
if not api_last_modified:
logger.warning(f"NVD 2.0 local cache missing lastModified for {cve_id} - No Action", group="CACHE_MANAGEMENT")
return {'action': 'no_action', 'reason': 'missing_timestamp'}
# Parse API timestamp
try:
if 'Z' in api_last_modified:
api_datetime_str = api_last_modified.replace('Z', '+00:00')
elif '+' not in api_last_modified and api_last_modified.count(':') >= 2:
api_datetime_str = api_last_modified + '+00:00'
else:
api_datetime_str = api_last_modified
api_datetime = datetime.fromisoformat(api_datetime_str)
except ValueError:
logger.warning(f"NVD 2.0 local cache timestamp parse failed for {cve_id}: {api_last_modified} - No Action", group="CACHE_MANAGEMENT")
return {'action': 'no_action', 'reason': 'timestamp_parse_error'}
# Determine queue reason based on cache file state; only 'up-to-date' short-circuits.
# All other outcomes (stale, missing_timestamp, corrupted, new_or_missing) fall
if nvd_file_path.exists():
try:
with open(nvd_file_path, 'r', encoding='utf-8') as f:
cached_data = json.load(f)
cached_vulns = cached_data.get('vulnerabilities', [])
cached_last_modified = cached_vulns[0].get('cve', {}).get('lastModified') if cached_vulns else None
if cached_last_modified:
# Parse cached timestamp
if 'Z' in cached_last_modified:
cached_datetime_str = cached_last_modified.replace('Z', '+00:00')
elif '+' not in cached_last_modified and cached_last_modified.count(':') >= 2:
cached_datetime_str = cached_last_modified + '+00:00'
else:
cached_datetime_str = cached_last_modified
cached_datetime = datetime.fromisoformat(cached_datetime_str)
if api_datetime <= cached_datetime:
# API data is same or older — no update needed
return {'action': 'no_action', 'reason': 'up-to-date'}
queue_reason = 'stale'
else:
logger.warning(f"NVD 2.0 local cache missing lastModified timestamp for {cve_id} - Queued for Update", group="CACHE_MANAGEMENT")
queue_reason = 'missing_timestamp'
except (json.JSONDecodeError, IOError) as e:
logger.warning(f"NVD 2.0 local cache file corrupted for {cve_id}: {e} - Queued for Update", group="CACHE_MANAGEMENT")
queue_reason = 'corrupted'
else:
queue_reason = 'new_or_missing'
_cache_batch['nvd_updates'].append({
'cve_id': cve_id,
'file_path': nvd_file_path,
'vulnerability_record': vulnerability_record,
'repo_path': nvd_repo_path
})
return {'action': 'queued', 'reason': queue_reason}
except Exception as e:
logger.warning(f"NVD 2.0 local cache update queue failed for {cve_id}: {e}", group="CACHE_MANAGEMENT")
return {'action': 'no_action', 'reason': 'error'}
def _save_cve_list_v5_to_cache_during_bulk_generation(cve_id, nvd_last_modified=None):
"""
Queue CVE List V5 record for batch cache update during bulk dataset generation.
If NVD lastModified <= last_manual_update, the V5 record is current
Fallback: TTL-based file age check when the fast-path comparison is unavailable.
Args:
cve_id: CVE identifier (e.g., "CVE-2024-12345")
nvd_last_modified: NVD lastModified timestamp string from the in-memory API response.
When provided, enables the last_manual_update shortcut check.
"""
try:
# Import cache functions (delayed import to avoid circular dependencies)
from src.analysis_tool.core.gatherData import _resolve_cve_cache_file_path
cve_config = config['cache_settings']['cve_list_v5']
# Use 'cache/cve_list_v5' as default path
cve_repo_path = cve_config.get('path', 'cache/cve_list_v5')
cve_file_path = _resolve_cve_cache_file_path(cve_id, cve_repo_path)
if not cve_file_path:
return {'action': 'no_action', 'reason': 'path_resolution_failed'}
if cve_file_path.exists():
# Fast path: NVD lastModified is already in memory from the bulk API response.
if nvd_last_modified:
last_manual_update_str = cve_config.get('refresh_strategy', {}).get('last_manual_update')
if last_manual_update_str:
try:
def _normalize_dt(dt_str):
if 'Z' in dt_str:
dt_str = dt_str.replace('Z', '+00:00')
elif '+' not in dt_str and dt_str.count(':') >= 2:
dt_str = dt_str + '+00:00'
return datetime.fromisoformat(dt_str)
nvd_dt = _normalize_dt(nvd_last_modified)
lmu_dt = _normalize_dt(last_manual_update_str)
if nvd_dt <= lmu_dt:
# V5 cache was refreshed after this CVE's last NVD modification — current
return {'action': 'no_action', 'reason': 'current_by_last_manual_update'}
except ValueError:
pass # Unparseable timestamp — fall through to TTL check
# Fallback: TTL-based file age check
cache_ttl_hours = cve_config.get('refresh_strategy', {}).get('notify_age_hours', 168)
file_modified_time = datetime.fromtimestamp(cve_file_path.stat().st_mtime, tz=timezone.utc)
file_age_hours = (datetime.now(timezone.utc) - file_modified_time).total_seconds() / 3600
if file_age_hours < cache_ttl_hours:
return {'action': 'no_action', 'reason': 'up-to-date', 'age_hours': file_age_hours, 'ttl_hours': cache_ttl_hours}
file_status = {'action': 'queued', 'reason': 'stale', 'age_hours': file_age_hours, 'ttl_hours': cache_ttl_hours}
else:
# File doesn't exist - queue for creation
file_status = {'action': 'queued', 'reason': 'new_or_missing'}
_cache_batch['cve_list_updates'].append({
'cve_id': cve_id,
'file_path': cve_file_path,
'repo_path': cve_repo_path
})
return file_status
except Exception as e:
logger.debug(f"CVE List v5 local cache update queue failed for {cve_id}: {e}", group="CACHE_MANAGEMENT")
return {'action': 'no_action', 'reason': 'error'}
def query_nvd_cves_by_status(api_key=None, target_statuses=None, output_file="cve_dataset.txt", run_directory=None, source_uuid=None, statuses_explicitly_provided=False):
"""
Query NVD API for CVEs with specific vulnerability statuses
Args:
api_key (str): NVD API key (required for reasonable performance)
target_statuses (list): List of vulnerability statuses to filter by
['Received', 'Awaiting Analysis', 'Undergoing Analysis']
output_file (str): Output file path
run_directory (Path): Run directory where dataset should be written
source_uuid (str): Optional UUID to filter CVEs by sourceIdentifier (server-side filtering)
statuses_explicitly_provided (bool): Whether user explicitly provided status filters
"""
if target_statuses is None:
target_statuses = ['Received', 'Awaiting Analysis', 'Undergoing Analysis']
logger.info("Starting CVE dataset generation...", group="DATASET")
if source_uuid and not statuses_explicitly_provided:
logger.info("Target vulnerability statuses: ALL (inclusive mode with UUID filtering)", group="DATASET")
else:
logger.info(f"Target vulnerability statuses: {', '.join(target_statuses)}", group="DATASET")
if source_uuid:
logger.info(f"Source UUID filter (server-side): {source_uuid}", group="DATASET")
logger.info(f"Output file: {output_file}", group="DATASET")
logger.info(f"Using API key: {'Yes' if api_key else 'No'}", group="DATASET")
# Import dataset contents collector functions for progress tracking
from src.analysis_tool.reporting.dataset_contents_collector import (
record_api_call, record_api_call_unified, record_output_file, update_cve_discovery_progress
)
logger.info("=== CVE Record Cache Preparation ===", group="DATASET")
from src.analysis_tool.core.gatherData import query_nvd_cve_page, build_nvd_api_headers
base_url = config['api']['endpoints']['nvd_cves']
headers = build_nvd_api_headers(api_key)
# Results per page (max 2000)
results_per_page = 2000
start_index = 0
total_results = 0
matching_cves = []
api_failure_occurred = False # Track if API failures prevented complete data collection
logger.info("Starting CVE collection...", group="CVE_QUERY")
while True:
# Construct URL with pagination and optional UUID filtering
url = f"{base_url}?resultsPerPage={results_per_page}&startIndex={start_index}"
if source_uuid:
url += f"&sourceIdentifier={source_uuid}"
logger.info(f"Processing CVE dataset queries: Starting at index {start_index}...", group="CVE_QUERY")
page_data = query_nvd_cve_page(url, headers, f"NVD CVE Dataset Collection (index {start_index})")
if page_data is None:
logger.error("Dataset generation failed: Unable to retrieve page data after all retry attempts - stopping data collection", group="data_processing")
# Record failed API call in unified tracking
if run_directory:
record_api_call(0, False)
record_api_call_unified("NVD CVE API", success=False)
api_failure_occurred = True # Mark that API failure prevented complete collection
break
# Record successful API call
vulnerabilities = page_data.get('vulnerabilities', [])
if run_directory:
record_api_call(len(vulnerabilities), False) # Rate limiting handled internally by query function
# Also record in unified dashboard tracking
record_api_call_unified("NVD CVE API", success=True)
if not vulnerabilities:
logger.info("No more vulnerabilities found. Collection complete.", group="CVE_QUERY")
break
logger.info(f"Processing {len(vulnerabilities)} CVEs from this page...", group="CVE_QUERY")
for vuln in vulnerabilities:
cve_data = vuln.get('cve', {})
cve_id = cve_data.get('id', '')
vuln_status = cve_data.get('vulnStatus', '')
# Determine if we should apply status filtering
should_include = False
if source_uuid and not statuses_explicitly_provided:
# UUID filtering with default statuses - include all CVEs from this source
should_include = True
matching_cves.append(cve_id)
elif vuln_status in target_statuses:
# Either traditional status filtering or UUID + explicit status filtering
should_include = True
matching_cves.append(cve_id)
# OPTIMIZATION: Cache both NVD and CVE List V5 records now to avoid re-fetching later
if should_include and cve_id:
# Check cache status for both NVD and CVE List V5
nvd_status = _save_nvd_cve_to_cache_during_bulk_generation(cve_id, vuln)
cvelist_status = _save_cve_list_v5_to_cache_during_bulk_generation(cve_id, nvd_last_modified=cve_data.get('lastModified'))
# Log consolidated cache status (single line per CVE)
if nvd_status and cvelist_status:
# Format NVD message (timestamp-based comparison only)
nvd_msg = f"NVD: {nvd_status['reason']}"
# Format CVE List message (TTL-based with age context)
if cvelist_status['reason'] == 'stale':
cvelist_msg = f"CVE-List: stale ({cvelist_status.get('age_hours', 0):.1f}h > {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'up-to-date':
cvelist_msg = f"CVE-List: up-to-date ({cvelist_status.get('age_hours', 0):.1f}h < {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'current_by_last_manual_update':
cvelist_msg = "CVE-List: current (last_manual_update)"
else:
cvelist_msg = f"CVE-List: {cvelist_status['reason']}"
logger.info(f"Cache: {cve_id} | {nvd_msg} | {cvelist_msg}", group="CACHE_MANAGEMENT")
# Flush cache writes accumulated for this page before moving to the next
_flush_cache_batches()
# Check if we have more pages
total_results = page_data.get('totalResults', 0)
current_end = start_index + len(vulnerabilities)
progress_pct = (current_end / total_results * 100) if total_results > 0 else 0
logger.info(f"Processing CVE dataset generation: {current_end}/{total_results} ({progress_pct:.1f}%) - {len(matching_cves)} matching CVE records found", group="CVE_QUERY")
# Update dashboard progress
if run_directory:
update_cve_discovery_progress(current_end, total_results, len(matching_cves))
if current_end >= total_results:
logger.info("Reached end of available CVEs.", group="CVE_QUERY")
break
# Move to next page
start_index += results_per_page
# Rate limiting - wait between pages
if not api_key:
wait_time = config['api']['retry']['page_delay_without_key']
logger.info(f"Waiting {wait_time} seconds before next page (rate limiting)...", group="CVE_QUERY")
sleep(wait_time)
else:
wait_time = config['api']['retry']['page_delay_with_key']
if wait_time > 0:
logger.info(f"Waiting {wait_time} seconds before next page...", group="CVE_QUERY")
sleep(wait_time)
# Flush any remaining cache updates
_flush_cache_batches()
logger.info("=== END CVE Record Cache Preparation ===", group="DATASET")
# Check if API failure occurred - return False before attempting to write incomplete data
if api_failure_occurred:
logger.error("Dataset generation incomplete: API failures prevented complete data collection", group="data_processing")
logger.error(f"Partial data collected: {len(matching_cves)} CVEs (may be incomplete)", group="data_processing")
return False
# Write results to file
output_file_resolved = resolve_output_path(output_file, run_directory)
logger.info(f"Writing {len(matching_cves)} CVE IDs to {output_file_resolved}...", group="DATASET")
try:
with open(output_file_resolved, 'w') as f:
for cve_id in matching_cves:
f.write(f"{cve_id}\n")
logger.info("Dataset generated successfully!", group="DATASET")
logger.info(f"Collected {len(matching_cves)} CVE records", group="DATASET")
logger.info(f"File saved: {output_file_resolved}", group="DATASET")
# Record output file in dataset contents collector
if run_directory:
record_output_file(output_file, str(output_file_resolved), len(matching_cves))
except Exception as e:
logger.error(f"Dataset file creation failed: Unable to write dataset output to '{output_file_resolved}' - {e}", group="data_processing")
return False
return True
def generate_last_days(days, api_key=None, output_file="cve_recent_dataset.txt", run_directory=None, source_uuid=None):
"""Generate dataset for CVEs modified in the last N days"""
end_date = datetime.now(timezone.utc)
start_date = end_date - timedelta(days=days)
# Limit to 120 days max
if days > 120:
logger.error("Cannot query more than 120 days (NVD API limit)", group="DATASET")
return False
logger.info(f"Generating dataset for CVEs modified in the last {days} days", group="DATASET")
return query_nvd_cves_by_date_range(
start_date.strftime('%Y-%m-%dT%H:%M:%S.000Z'),
end_date.strftime('%Y-%m-%dT%H:%M:%S.000Z'),
api_key, output_file, run_directory, source_uuid
)
def generate_date_range(start_date_str, end_date_str, api_key=None, output_file="cve_range_dataset.txt", run_directory=None, source_uuid=None):
"""Generate dataset for CVEs modified in a specific date range"""
try:
# Parse dates
if 'T' not in start_date_str:
start_date_str += 'T00:00:00.000Z'
if 'T' not in end_date_str:
end_date_str += 'T23:59:59.000Z'
start_date = datetime.fromisoformat(start_date_str.replace('Z', '+00:00'))
end_date = datetime.fromisoformat(end_date_str.replace('Z', '+00:00'))
# Validate range
if (end_date - start_date).days > 120:
logger.error("Date range cannot exceed 120 days (NVD API limit)", group="DATASET")
return False
logger.info(f"Generating dataset for date range: {start_date_str} to {end_date_str}", group="DATASET")
return query_nvd_cves_by_date_range(start_date_str, end_date_str, api_key, output_file, run_directory, source_uuid)
except ValueError as e:
logger.error(f"Invalid date format: {e}", group="DATASET")
return False
def query_nvd_cves_by_date_range(start_date, end_date, api_key=None, output_file="cve_dataset.txt", run_directory=None, source_uuid=None):
"""Query NVD API for CVEs modified within a date range"""
logger.info(f"Querying CVEs modified between {start_date} and {end_date}", group="DATASET")
if source_uuid:
logger.info(f"Source UUID filter (server-side): {source_uuid}", group="DATASET")
# Import dataset contents collector functions for progress tracking
from src.analysis_tool.reporting.dataset_contents_collector import (
record_api_call, record_output_file, update_cve_discovery_progress
)
logger.info("=== CVE Record Cache Preparation ===", group="DATASET")
from src.analysis_tool.core.gatherData import query_nvd_cve_page, build_nvd_api_headers
base_url = config['api']['endpoints']['nvd_cves']
headers = build_nvd_api_headers(api_key)
results_per_page = 2000
start_index = 0
matching_cves = []
api_failure_occurred = False # Track if API failures prevented complete data collection
# URL encode dates
start_date_encoded = start_date.replace('+', '%2B')
end_date_encoded = end_date.replace('+', '%2B')
from src.analysis_tool.core.gatherData import query_nvd_cve_page
while True:
url = (f"{base_url}?"
f"lastModStartDate={start_date_encoded}&"
f"lastModEndDate={end_date_encoded}&"
f"resultsPerPage={results_per_page}&"
f"startIndex={start_index}")
if source_uuid:
url += f"&sourceIdentifier={source_uuid}"
logger.info(f"Querying CVEs modified in date range: Starting at index {start_index}...", group="CVE_QUERY")
page_data = query_nvd_cve_page(url, headers, f"NVD CVE Date Range Query (index {start_index})")
if page_data is None:
logger.error("Failed to retrieve page data - stopping collection", group="data_processing")
# Record failed API call
if run_directory:
record_api_call(0, False)
api_failure_occurred = True # Mark that API failure prevented complete collection
break
# Record successful API call
vulnerabilities = page_data.get('vulnerabilities', [])
if run_directory:
record_api_call(len(vulnerabilities), False) # Rate limiting handled internally
if not vulnerabilities:
logger.info("No more vulnerabilities found. Collection complete.", group="CVE_QUERY")
break
logger.info(f"Processing {len(vulnerabilities)} CVEs from this page...", group="CVE_QUERY")
for vuln in vulnerabilities:
cve_data = vuln.get('cve', {})
cve_id = cve_data.get('id', '')
if cve_id:
matching_cves.append(cve_id)
# Check cache status for both NVD and CVE List V5
nvd_status = _save_nvd_cve_to_cache_during_bulk_generation(cve_id, vuln)
cvelist_status = _save_cve_list_v5_to_cache_during_bulk_generation(cve_id, nvd_last_modified=cve_data.get('lastModified'))
# Log consolidated cache status (single line per CVE)
if nvd_status and cvelist_status:
# Format NVD message (timestamp-based comparison only)
nvd_msg = f"NVD: {nvd_status['reason']}"
# Format CVE List message (TTL-based with age context)
if cvelist_status['reason'] == 'stale':
cvelist_msg = f"CVE-List: stale ({cvelist_status.get('age_hours', 0):.1f}h > {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'up-to-date':
cvelist_msg = f"CVE-List: up-to-date ({cvelist_status.get('age_hours', 0):.1f}h < {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'current_by_last_manual_update':
cvelist_msg = "CVE-List: current (last_manual_update)"
else:
cvelist_msg = f"CVE-List: {cvelist_status['reason']}"
logger.debug(f"Cache: {cve_id} | {nvd_msg} | {cvelist_msg}", group="CACHE_MANAGEMENT")
# Flush cache writes accumulated for this page before moving to the next
_flush_cache_batches()
total_results = page_data.get('totalResults', 0)
current_end = start_index + len(vulnerabilities)
progress_pct = (current_end / total_results * 100) if total_results > 0 else 0
logger.info(f"Progress: {current_end}/{total_results} ({progress_pct:.1f}%) - {len(matching_cves)} CVEs found", group="CVE_QUERY")
# Update dashboard progress
if run_directory:
update_cve_discovery_progress(current_end, total_results, len(matching_cves))
if current_end >= total_results:
break
start_index += results_per_page
# Rate limiting
wait_time = config['api']['retry']['page_delay_without_key'] if not api_key else config['api']['retry']['page_delay_with_key']
if wait_time > 0:
sleep(wait_time)
# Flush any remaining cache updates
_flush_cache_batches()
logger.info("=== END CVE Record Cache Preparation ===", group="DATASET")
# Check if API failure occurred - return False before attempting to write incomplete data
if api_failure_occurred:
logger.error("Dataset generation incomplete: API failures prevented complete data collection", group="data_processing")
logger.error(f"Partial data collected: {len(matching_cves)} CVEs (may be incomplete)", group="data_processing")
return False
# Write results
output_file_resolved = resolve_output_path(output_file, run_directory)
try:
with open(output_file_resolved, 'w') as f:
for cve_id in matching_cves:
f.write(f"{cve_id}\n")
# Record output file in dataset contents collector (logs silently)
if run_directory:
record_output_file(output_file, str(output_file_resolved), len(matching_cves))
return True
except Exception as e:
logger.error(f"Failed to write output file: {e}", group="data_processing")
return False
def query_nvd_cve_by_id(cve_id, api_key=None, output_file="cve_dataset.txt", run_directory=None):
"""Fetch NVD 2.0 + CVE List V5 cache for a single CVE ID, write the ID to the dataset file. Used by --cve."""
logger.info(f"Targeted CVE query: {cve_id}", group="DATASET")
from src.analysis_tool.reporting.dataset_contents_collector import (
record_api_call, record_api_call_unified, record_output_file
)
logger.info("=== CVE Record Cache Preparation ===", group="DATASET")
from src.analysis_tool.core.gatherData import query_nvd_cve_page, build_nvd_api_headers
base_url = config['api']['endpoints']['nvd_cves']
headers = build_nvd_api_headers(api_key)
url = f"{base_url}?cveId={cve_id}"
page_data = query_nvd_cve_page(url, headers, f"NVD CVE Targeted Query ({cve_id})")
if page_data is None:
logger.error(f"Failed to fetch NVD record for {cve_id}", group="CVE_QUERY")
if run_directory:
record_api_call_unified("NVD CVE API", success=False)
return False
vulnerabilities = page_data.get('vulnerabilities', [])
if not vulnerabilities:
logger.warning(f"NVD returned no vulnerability data for {cve_id}", group="CVE_QUERY")
return False
if run_directory:
record_api_call(1, False)
record_api_call_unified("NVD CVE API", success=True)
vuln = vulnerabilities[0]
nvd_status = _save_nvd_cve_to_cache_during_bulk_generation(cve_id, vuln)
cvelist_status = _save_cve_list_v5_to_cache_during_bulk_generation(cve_id, nvd_last_modified=vuln.get('cve', {}).get('lastModified'))
if nvd_status and cvelist_status:
nvd_msg = f"NVD: {nvd_status['reason']}"
if cvelist_status['reason'] == 'stale':
cvelist_msg = f"CVE-List: stale ({cvelist_status.get('age_hours', 0):.1f}h > {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'up-to-date':
cvelist_msg = f"CVE-List: up-to-date ({cvelist_status.get('age_hours', 0):.1f}h < {cvelist_status.get('ttl_hours', 0)}h)"
elif cvelist_status['reason'] == 'current_by_last_manual_update':
cvelist_msg = "CVE-List: current (last_manual_update)"
else:
cvelist_msg = f"CVE-List: {cvelist_status['reason']}"
logger.info(f"Cache: {cve_id} | {nvd_msg} | {cvelist_msg}", group="CACHE_MANAGEMENT")
_flush_cache_batches()
logger.info("=== END CVE Record Cache Preparation ===", group="DATASET")
output_file_resolved = resolve_output_path(output_file, run_directory)
logger.info(f"Writing {cve_id} to {output_file_resolved}...", group="DATASET")
try:
with open(output_file_resolved, 'w') as f:
f.write(f"{cve_id}\n")
logger.info("Dataset generated successfully!", group="DATASET")
logger.info(f"File saved: {output_file_resolved}", group="DATASET")
if run_directory:
record_output_file(output_file, str(output_file_resolved), 1)
except Exception as e:
logger.error(f"Failed to write dataset file: {e}", group="data_processing")
return False
return True
def main():
"""Parse arguments and dispatch to the appropriate generation function (--cve, date-range, or status-based), then hand off to run_analysis_tool()."""
# Register interrupt handler to flush cache on Ctrl+C
signal.signal(signal.SIGINT, _handle_interrupt)
signal.signal(signal.SIGTERM, _handle_interrupt)
parser = argparse.ArgumentParser(description='Generate CVE dataset from NVD API')
# Group 1: Tool Output - What analysis outputs to generate
output_group = parser.add_argument_group('Tool Output', 'Select which analysis outputs to generate')
output_group.add_argument("--nvd-ish-only", nargs='?', const='true', choices=['true', 'false'], default='false',
help="Generate complete NVD-ish enriched records without report files (ignores other output flags)")
output_group.add_argument('--sdc-report', nargs='?', const='true', choices=['true', 'false'], default='false',
help='Generate Source Data Concerns report (default: false, true if flag provided without value)')
output_group.add_argument('--cpe-determination', nargs='?', const='true', choices=['true', 'false'], default='false',
help='Generate CPE suggestions via NVD CPE API calls (default: false, true if flag provided without value)')
output_group.add_argument('--alias-report', nargs='?', const='true', choices=['true', 'false'], default='false',
help='Generate alias report via curator features (default: false, true if flag provided without value)')
output_group.add_argument('--cpe-as-generator', nargs='?', const='true', choices=['true', 'false'], default='false',
help='Generate CPE Applicability Statements (default: false, true if flag provided without value)')
# Group 2: Data Input/Sources - Specify what data to process and where to get it
input_group = parser.add_argument_group('Data Input/Sources', 'Specify input data and data sources')
input_group.add_argument('--api-key', nargs='?', const='CONFIG_DEFAULT', help='NVD API key. Use without value to use config default, or provide explicit key')
# Group 3: Dataset Generation - Control what CVE data is included in the dataset
dataset_group = parser.add_argument_group('Dataset Generation', 'Control CVE data selection and dataset creation')
dataset_group.add_argument('--source-uuid', type=str,
help='Filter CVEs by sourceIdentifier (CNA/ADP providerMetadata.orgId) - must be valid UUID format. When used without --statuses, includes all vulnerability statuses.')
dataset_group.add_argument('--statuses', nargs='+',
default=['Received', 'Awaiting Analysis', 'Undergoing Analysis'],
help='Vulnerability statuses to include (default when no UUID: Received, Awaiting Analysis, Undergoing Analysis; default with UUID: all statuses)')
dataset_group.add_argument('--cve', type=str,
help='Single CVE ID to target directly (bypasses status/date filtering, populates cache and runs analysis)')
dataset_group.add_argument('--last-days', type=int,
help='Generate dataset for CVEs modified in the last N days')
dataset_group.add_argument('--start-date', type=str,
help='Start date for lastModified filter (YYYY-MM-DD or ISO format)')
dataset_group.add_argument('--end-date', type=str,
help='End date for lastModified filter (YYYY-MM-DD or ISO format)')
# Group 4: Run Organization - Control directory structure for multi-run orchestration
run_org_group = parser.add_argument_group('Run Organization', 'Control run directory hierarchy')
run_org_group.add_argument('--parent-run-dir', type=str,
help='Parent run directory path - creates this run as child within parent (used by harvest script)')
args = parser.parse_args()
logger.info("=" * 80, group="DATASET")
logger.info("NVD CVE Dataset Generator", group="DATASET")
logger.info("=" * 80, group="DATASET")
# Validate UUID if provided
if args.source_uuid and not validate_uuid(args.source_uuid):
logger.error(f"Invalid UUID format: {args.source_uuid}", group="DATASET")
logger.error("Source UUID must be a valid UUID format", group="DATASET")
return 1
# --cve is mutually exclusive with date/status filtering modes
if args.cve and any([args.last_days, args.start_date, args.end_date, '--statuses' in sys.argv]):
logger.error("--cve cannot be combined with --last-days, --start-date, --end-date, or --statuses", group="DATASET")
return 1
# Validate CVE ID format if provided
if args.cve:
cve_id = args.cve.strip().upper()
if not re.fullmatch(r'^CVE-[0-9]{4}-[0-9]{4,19}$', cve_id):
logger.error(f"Invalid CVE ID format: {args.cve}", group="DATASET")
return 1
# Detect if statuses were explicitly provided (not using defaults)
statuses_explicitly_provided = '--statuses' in sys.argv
# Resolve API key from command line, config, or default to None
if args.api_key == 'CONFIG_DEFAULT':
resolved_api_key = config['api']['api_key']
logger.info("NVD API key detected | Source: Configuration", group="DATASET")
elif args.api_key:
resolved_api_key = args.api_key
logger.info("NVD API key detected | Source: Direct Input", group="DATASET")
else:
# No --api-key flag provided, fall back to config
resolved_api_key = config['api']['api_key'] or None
if resolved_api_key:
logger.info("NVD API key detected | Source: Configuration", group="DATASET")
if not resolved_api_key:
logger.error("API key is required for dataset generation", group="DATASET")
logger.error("Either use --api-key parameter or set api.api_key in config.json", group="DATASET")
logger.error("NVD API without a key has severe rate limits that make dataset generation impractical", group="DATASET")
return 1
logger.info("Using API key for enhanced rate limits", group="DATASET")
if args.source_uuid:
logger.info(f"UUID filtering enabled: {args.source_uuid}", group="DATASET")
# Create run directory first - ALL dataset generation creates runs
from src.analysis_tool.storage.run_organization import create_run_directory
# Convert string boolean arguments to actual booleans
sdc_report = args.sdc_report.lower() == 'true'
cpe_determination = args.cpe_determination.lower() == 'true'
alias_report = args.alias_report.lower() == 'true'
cpe_as_generator = args.cpe_as_generator.lower() == 'true'
nvd_ish_only = args.nvd_ish_only.lower() == 'true'
# Handle --nvd-ish-only flag processing (override behavior)
if nvd_ish_only:
# Override other output flags (ignore their values)
sdc_report = False
cpe_determination = False
alias_report = False
cpe_as_generator = False
logger.info("NVD-ish only mode enabled: generating complete enriched records without report files", group="DATASET")
# Validate feature combinations
if alias_report and not args.source_uuid:
print("ERROR: --alias-report requires --source-uuid to determine the appropriate confirmed mappings file")
print("Example usage:")
print(" python generate_dataset.py --last-days 7 --alias-report --source-uuid your-uuid-here")
return
# Validate that at least one feature is enabled (or nvd-ish-only mode)
if not any([sdc_report, cpe_determination, alias_report, cpe_as_generator, nvd_ish_only]):
print("ERROR: At least one feature must be enabled for dataset generation!")
print("Available features:")
print(" --sdc-report : Generate Source Data Concerns report")
print(" --cpe-determination : Generate CPE determination via NVD CPE API calls")
print(" --alias-report : Generate alias report via curator features")
print(" --cpe-as-generator : Generate CPE Applicability Statements")
print(" --nvd-ish-only : Generate complete NVD-ish enriched records without report files")
print("")
print("Example usage:")
print(" python generate_dataset.py --last-days 7 --sdc-report")
print(" python generate_dataset.py --last-days 7 --cpe-determination --cpe-as-generator")
return 1
# Generate initial run context with source shortname resolution
if args.source_uuid:
# Resolve source shortname for better human readability in directory names
from src.analysis_tool.storage.nvd_source_manager import get_or_refresh_source_manager
# Get NVD source manager (uses cache or refreshes as needed)
source_manager = get_or_refresh_source_manager(resolved_api_key, log_group="DATASET")
# Get human-readable shortname (capped to 7-8 characters)
full_shortname = source_manager.get_source_shortname(args.source_uuid)
source_shortname = full_shortname[:8] if len(full_shortname) > 8 else full_shortname
logger.info(f"Resolved source UUID {args.source_uuid[:8]}... to shortname: '{source_shortname}'", group="DATASET")
else:
# No source UUID provided - use default shortname
source_shortname = None
# Prepare enhanced naming parameters
execution_type = "dataset"
# Determine range specification
range_spec = None
if args.cve:
range_spec = cve_id # already normalized
elif args.last_days:
range_spec = f"last_{args.last_days}_days"
elif args.start_date and args.end_date:
range_spec = f"range_{args.start_date}_to_{args.end_date}"
# Prepare tool flags (only include those that are true)
tool_flags = {}
if nvd_ish_only:
tool_flags['nvd-ish'] = True
if sdc_report:
tool_flags['sdc'] = True
if cpe_determination:
tool_flags['cpe-sug'] = True
if alias_report:
tool_flags['alias'] = True
if cpe_as_generator:
tool_flags['cpe-as-gen'] = True
# Create run directory using enhanced naming
# If parent_run_dir provided, create this run as child within parent hierarchy
parent_run_path = Path(args.parent_run_dir) if args.parent_run_dir else None
# Determine subdirectories - all dataset runs only need logs
subdirs = ["logs"]
# Check if we're in a test environment to enable consolidated test run handling
is_test = os.environ.get('CONSOLIDATED_TEST_RUN') == '1'
run_directory, run_id = create_run_directory(
execution_type=execution_type,
source_shortname=source_shortname,
range_spec=range_spec,
status_list=args.statuses if args.statuses else None,
tool_flags=tool_flags if tool_flags else None,
parent_run_dir=parent_run_path,
subdirs=subdirs,
is_test=is_test
)
logger.info(f"Created dataset generation run: {run_id}", group="DATASET")
logger.info(f"Run directory: {run_directory}", group="DATASET")
# Configure file logging to write to run-specific logs directory
logs_dir = run_directory / "logs"
logger.set_run_logs_directory(str(logs_dir))
logger.start_file_logging("cve_dataset")
# Initialize dataset contents report immediately for periodic updates
from src.analysis_tool.reporting.dataset_contents_collector import initialize_dataset_contents_report, get_dataset_contents_collector
# Pre-initialize collector with config before calling initialize_dataset_contents_report