# dcnn-nlp

**Repository Path**: linius/dcnn-nlp

## Basic Information

- **Project Name**: dcnn-nlp
- **Description**: An implementation of ACL2014 paper "A Convolutional Neural Network for Modelling Sentences"
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 10
- **Forks**: 1
- **Created**: 2014-06-17
- **Last Updated**: 2023-10-27

## Categories & Tags

**Categories**: ai

**Tags**: None

## README

#dcnn-nlp (In development)
==========================================================

dcnn-nlp是一款使用卷积神经网络进行自然语言处理以及文本分类的工具。参考2014ACL论文"A Convolutional Neural Network for Modelling Sentences"实现并扩展。</br>

它具有以下特征：</br>
<ul>
<li>相对于传统的bag of words模型，能更好地利用词语之间的序列信息，因此能更好地捕获短文本的语义</li>
<li>针对不同的文本分类任务不需要做任何特征提取的工作就可以达到State of the Art的效果</li>
<li>使用Python代码编写，核心部分用Cython, BLAS等技术进行大量优化</li>
<li>扩展了论文"A Convolutional Neural Network for Modelling Sentences"中的网络结构，可自由定义深层卷积神经网络层次</li>
<li>改进了论文中词向量的训练方法，支持gensim word2vec工具包</li>
<li>支持无监督训练词语、句子、段落向量(TODO)</li>
<li>支持GPU加速(TODO)</li>
</ul>

Examples
==========================================================
```python
# Stanford Sentiment Treebank Experiment
# You should run python prepare.py in the data/stanford direction firstly
total_data_file = 'data/stanford/total.data'
total_sentences = LineSentence(total_data_file, repeat=5)

train_data_file = 'data/stanford/train2.data'
train_label_file = 'data/stanford/train2.label'
train_sentences = LineSentence(train_data_file)
train_labels = numpy.fromfile(train_label_file, sep='\n', dtype=numpy.int32)

dev_data_file = 'data/stanford/dev2.data'
dev_label_file = 'data/stanford/dev2.label'
dev_sentences = LineSentence(dev_data_file)
dev_labels = numpy.fromfile(dev_label_file, sep='\n', dtype=numpy.int32)

test_data_file = 'data/stanford/test2.data'
test_label_file = 'data/stanford/test2.label'
test_sentences = LineSentence(test_data_file)
test_labels = numpy.fromfile(test_label_file, sep='\n', dtype=numpy.int32)

# n_filters=[6,14] in the paper
# n_filters=[4,6] in LeNet
# But you can go deeper
model = DCNNDeep(sentences=train_sentences, output_layer_size=2, wordvec_dim=48, 
                 alpha=0.012, entropy_descent_m=0.995, dropout_rate_in_hiddens=0.5, 
                 dropout_rate_in_input=0.2, min_count=2, full_con_layer_size=5, 
                 filter_width=[7,5,3], k_top=4, n_filters=[6,14,6], alpha_m=0.999995, 
                 min_alpha=0.00001, pre_train_word_vec=True, pre_train_sentences=total_sentences)
model.train(train_sentences=train_sentences, train_labels=train_labels, patience=5, 
            validate_freq=2000, max_entropy_allowed=0.38, validate_sentences=dev_sentences, 
            validate_labels=dev_labels, chunksize=5)
print 'test accuracy: %f' %model.accuracy(test_sentences, test_labels)
```